Neural network weight distribution from a grid of memory elements
By designing an on-chip network with uninterrupted weight transfer in the memory element grid, the problems of weight delivery delay and congestion in neural networks are solved, achieving efficient and uninterrupted weight transfer, improving the performance and energy efficiency of neural inference chips, and making it suitable for high-performance computing of convolutional neural networks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INTERNATIONAL BUSINESS MACHINE CORPORATION
- Filing Date
- 2021-01-28
- Publication Date
- 2026-06-26
AI Technical Summary
In existing technologies, the delivery of neural network weights and parameters to the computing kernel suffers from latency and congestion, which limits the performance of neural inference chips, especially in large-scale neural network computations, where efficient, uninterrupted weight transfer cannot be achieved.
An on-chip network with uninterrupted weight delivery is adopted. By designing instruction and data buffers in the memory element grid, conflict-free weight distribution is achieved, ensuring that weight parameters are delivered to the computing core in a timely manner under pre-scheduled mode, thus avoiding network congestion and pauses.
It achieves efficient weight transfer without interruption in neural network computation, improving the performance and energy efficiency of neural inference chips, and is suitable for various computing modes, including high-performance evaluation of convolutional neural networks.
Smart Images

Figure CN115362448B_ABST
Abstract
Description
Background Technology
[0001] Embodiments of this disclosure relate to neural network processing, and more specifically, to neural network weight distribution from a grid of memory elements. Summary of the Invention
[0002] According to embodiments of this disclosure, a neural inference chip for computing neural activation is provided. In various embodiments, the neural inference chip includes at least one neural nucleus, a memory array, an instruction buffer, and an instruction memory. The memory array is operatively connected to the at least one neural nucleus and includes multiple elements, each element including a memory and a horizontal buffer, the horizontal buffer of each element of the memory array communicating with the horizontal buffer of another element of the memory array or at least one neural nucleus. The instruction buffer communicates with the memory array and has a location corresponding to each of the multiple elements of the memory array. The instruction memory communicates with the instruction buffer. The instruction memory is adapted to provide at least one instruction to the instruction buffer. The instruction buffer is adapted to advance at least one instruction between locations in the instruction buffer. The instruction buffer is adapted to provide the at least one instruction from an associated location of the at least one element of the multiple elements of the memory array to the at least one element when the memory of the at least one element of the multiple elements of the memory array contains data associated with the at least one instruction. Each of the multiple elements of the memory array is adapted to provide a block of data from its memory to its horizontal buffer in response to the arrival of an associated instruction from the instruction buffer. The horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or to at least one neural nucleus.
[0003] Preferably, the present invention provides a neural inference chip, wherein: the instruction buffer is adapted to advance instructions between positions in the instruction buffer at a rate of one position per cycle, and the horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or to at least one neural nucleus at a rate of one data block per cycle.
[0004] Preferably, the present invention provides a neural inference chip comprising an array of neural nuclei, the array of neural nuclei including at least one neural nucleus and having multiple rows.
[0005] Preferably, the present invention provides a neural inference chip in which the memory array is one-dimensional and multiple elements of the memory array are arranged in a row and multiple columns.
[0006] Preferably, the present invention provides a neural inference chip in which the memory array is two-dimensional and multiple elements of the memory array are arranged in multiple rows and columns.
[0007] Preferably, the present invention provides a neural inference chip, wherein each element of the memory array further includes a vertical buffer, and the vertical buffer of each element of the memory array communicates with the vertical buffer of another element of the memory array.
[0008] Preferably, the present invention provides a neural inference chip, wherein: each of a plurality of elements of a memory array is adapted to provide a data block from its memory to its vertical buffer in response to the arrival of an associated instruction from an instruction buffer; each of a plurality of elements of the memory array is adapted to provide a data block from its vertical buffer to its horizontal buffer; and the vertical buffer of each element of the memory array is adapted to provide a data block to the vertical buffer of another element of the memory array.
[0009] Preferably, the present invention provides a neural inference chip, wherein: the horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or at least one neural nucleus at a rate of one data block per cycle, and the vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array at a rate of one data block per cycle.
[0010] Preferably, the present invention provides a neural inference chip, wherein each element of the memory array further includes a layover buffer, the layover buffer of each element of the memory array communicating with the horizontal and vertical buffers of that element of the memory array.
[0011] Preferably, the present invention provides a neural inference chip, wherein: each of a plurality of elements of a memory array is adapted to provide a data block from its memory to its vertical buffer in response to the arrival of an associated instruction from an instruction buffer; each of a plurality of elements of the memory array is adapted to provide a data block from its vertical buffer to its staging buffer; each of a plurality of elements of the memory array is adapted to provide a data block from its staging buffer to its horizontal buffer; and the vertical buffer of each element of the memory array is adapted to provide a data block to the vertical buffer of another element of the memory array.
[0012] Preferably, the present invention provides a neural inference chip, wherein: the horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or at least one neural nucleus at a rate of one data block per cycle; the vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array at a rate of one data block per cycle; and the stagnation buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of that element of the memory array at a rate of one data block per cycle.
[0013] Preferably, the present invention provides a neural inference chip, wherein: the instruction memory is adapted to provide multiple instructions to the instruction buffer per cycle, each location of the instruction buffer is adapted to store multiple instructions, and the instruction buffer is adapted to advance multiple instructions between locations in the instruction buffer per cycle.
[0014] According to embodiments of the present disclosure, a neural inference chip for computing neural activation is provided. In various embodiments, the neural inference chip includes at least one neural nucleus, a memory array, a plurality of instruction buffers, and a plurality of instruction memories. The memory array is operatively coupled to at least one neural nucleus and includes a plurality of elements, each element including a memory, a horizontal buffer, and a vertical buffer. The horizontal buffer of each element of the memory array communicates with a horizontal buffer of another element of the memory array, or communicates to at least one neural nucleus, and the vertical buffer of each element of the memory array is connected to a vertical buffer of another element of the memory array. The plurality of instruction buffers communicate with the memory array, each instruction buffer having a location corresponding to one of the plurality of elements of the memory array. The plurality of instruction memories each communicate with one of the plurality of instruction buffers. Each instruction memory is adapted to provide at least one instruction to its instruction buffer. Each instruction buffer is adapted to advance at least one instruction between locations within the instruction buffer. Each instruction buffer is adapted to provide the at least one instruction from an associated location of the at least one element of the plurality of elements of the memory array to the at least one element in the instruction buffer when the memory of the at least one element of the plurality of elements of the memory array contains data associated with the at least one instruction. Each of the multiple elements of the memory array is adapted to provide a data block from its memory to its vertical buffer in response to the arrival of an associated instruction from the instruction buffer. Each of the multiple elements of the memory array is adapted to provide a data block from its vertical buffer to its horizontal buffer. The vertical buffer of each element of the memory array is adapted to provide a data block to the vertical buffer of another element of the memory array. The horizontal buffer of each element of the memory array is adapted to provide a data block to the horizontal buffer of another element of the memory array or to at least one neural nucleus.
[0015] According to embodiments of this disclosure, a method and computer program product for calculating neural activation are provided. At least one instruction is provided from an instruction memory to an instruction buffer. The at least one instruction advances between locations within the instruction buffer. The at least one instruction is provided from the instruction buffer to the at least one element of the plurality of elements of the memory array when the memory of at least one element of the plurality of elements contains data associated with the at least one instruction. The memory array includes a plurality of elements, each element including a memory and a horizontal buffer, the horizontal buffer of each element of the memory array communicating with a horizontal buffer of another element of the memory array or with at least one neural nucleus. In response to the arrival of the at least one instruction from the instruction buffer, a data block is provided from the memory to the horizontal buffer of the at least one element of the plurality of elements. The data block is provided from the horizontal buffer of the at least one element of the plurality of elements to the horizontal buffer of another element of the memory array or to at least one neural nucleus.
[0016] Preferably, the present invention provides a neural inference chip, wherein: each instruction buffer is adapted to advance instructions between positions in the instruction buffer at a rate of one position per cycle, and the horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or at least one neural nucleus at a rate of one data block per cycle.
[0017] Preferably, the present invention provides a neural inference chip comprising an array of neural nuclei, the array of neural nuclei including at least one neural nucleus and having multiple rows.
[0018] Preferably, the present invention provides a neural inference chip in which the memory array is two-dimensional and multiple elements of the memory array are arranged in multiple rows and columns.
[0019] Preferably, the present invention provides a neural inference chip, wherein each element of the memory array further includes a vertical buffer, and the vertical buffer of each element of the memory array communicates with the vertical buffer of another element of the memory array.
[0020] Preferably, the present invention provides a neural inference chip, wherein: each of a plurality of elements of a memory array is adapted to provide a data block from its memory to its vertical buffer in response to the arrival of an associated instruction from an instruction buffer; each of a plurality of elements of the memory array is adapted to provide a data block from its vertical buffer to its horizontal buffer; and the vertical buffer of each element of the memory array is adapted to provide a data block to the vertical buffer of another element of the memory array.
[0021] Preferably, the present invention provides a neural inference, wherein: the horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or at least one neural nucleus at a rate of one data block per cycle, and the vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array at a rate of one data block per cycle.
[0022] Preferably, the present invention provides a neural inference chip, wherein each element of the memory array further includes a pausing buffer, the pausing buffer of each element of the memory array communicating with the horizontal buffer and the vertical buffer of that element of the memory array.
[0023] Preferably, the present invention provides a neural inference chip, wherein: each of a plurality of elements of a memory array is adapted to provide a data block from its memory to its vertical buffer in response to the arrival of an associated instruction from an instruction buffer; each of a plurality of elements of the memory array is adapted to provide a data block from its vertical buffer to its staging buffer; each of a plurality of elements of the memory array is adapted to provide a data block from its staging buffer to its horizontal buffer; and the vertical buffer of each element of the memory array is adapted to provide a data block to the vertical buffer of another element of the memory array.
[0024] Preferably, the present invention provides a neural inference chip, wherein: the horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or at least one neural nucleus at a rate of one data block per cycle; the vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array at a rate of one data block per cycle; and the stagnation buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of that element of the memory array at a rate of one data block per cycle.
[0025] Preferably, the present invention provides a neural inference chip, wherein: the instruction memory is adapted to provide multiple instructions to the instruction buffer per cycle, each location of the instruction buffer is adapted to store multiple instructions, and the instruction buffer is adapted to advance multiple instructions between locations in the instruction buffer per cycle.
[0026] In another aspect, the present invention provides providing at least one instruction from an instruction memory to an instruction buffer; advancing at least one instruction between locations in the instruction buffer; providing at least one instruction from the instruction buffer to the at least one element of a plurality of elements of a memory array when the memory of at least one element of a plurality of elements contains data associated with at least one instruction, the memory array comprising a plurality of elements, each element comprising a memory and a horizontal buffer, the horizontal buffer of each element of the memory array communicating with the horizontal buffer of another element of the memory array or with at least one neural nucleus, providing a data block from the memory to the horizontal buffer of the at least one element of the plurality of elements in response to the arrival of at least one instruction from the instruction buffer; and providing a data block from the horizontal buffer of the at least one element of the plurality of elements to the horizontal buffer of another element of the memory array or to at least one neural nucleus. Attached Figure Description
[0027] Figure 1 A neural nucleus according to an embodiment of the present disclosure is shown.
[0028] Figure 2 An exemplary inference processing unit (IPU) according to an embodiment of the present disclosure is shown.
[0029] Figure 3 A multi-core inference processing unit (IPU) according to an embodiment of the present disclosure is shown.
[0030] Figure 4 Neural nuclei and associated networks according to embodiments of the present disclosure are shown.
[0031] Figure 5 This is a schematic diagram of data distribution from a global memory array according to an embodiment of the present disclosure.
[0032] Figure 6 An exemplary memory controller with a linearly weighted memory array is shown according to an embodiment of the present disclosure.
[0033] Figure 7 The use of embodiments according to this disclosure is illustrated. Figure 6 The method of memory distribution for the controller.
[0034] Figure 8 An exemplary memory controller with a two-dimensional weighted memory array is shown according to an embodiment of the present disclosure.
[0035] Figure 9 The use of embodiments according to this disclosure is illustrated. Figure 8 The method of memory distribution for the controller.
[0036] Figure 10An exemplary memory controller with a two-dimensional weighted memory array and a temporary buffer is shown according to an embodiment of the present disclosure.
[0037] Figure 11 An exemplary configuration of a plurality of memory controllers including a two-dimensional weighted memory array is shown according to an embodiment of the present disclosure.
[0038] Figures 12A-12I The distribution of instructions and sequential data using a linear weighted memory array according to embodiments of the present disclosure is illustrated.
[0039] Figures 13A-13I The distribution of instructions and randomly accessed data using a linear weighted memory array is illustrated according to an embodiment of the present disclosure.
[0040] Figures 14A-14M The distribution of instructions and sequential data using a two-dimensional weighted memory array according to an embodiment of the present disclosure is illustrated.
[0041] Figures 15A-15M The distribution of instructions and random access data using a two-dimensional weighted memory array is illustrated according to an embodiment of the present disclosure.
[0042] Figures 16A-16K The distribution of instructions and data using a two-dimensional memory array and a temporary buffer is illustrated according to an embodiment of the present disclosure.
[0043] Figure 17 A method for calculating neural activation according to embodiments of the present disclosure is shown.
[0044] Figure 18 A computing node according to an embodiment of the present disclosure is described. Detailed Implementation
[0045] An artificial neuron is a data function, a mathematical function whose output is a nonlinear function of a linear combination of its inputs. If the output of one neuron is the input of another neuron, then the two neurons are connected. Weights are scalar values that encode the strength of the connection between the output of one neuron and the input of another neuron.
[0046] A neuron computes its output by applying a nonlinear activation function to a weighted sum of its inputs; this is called activation. The weighted sum is an intermediate result computed by multiplying each input by its corresponding weight and summing the products. A partial sum is a weighted sum of a subset of the inputs. The weighted sum of all inputs can be computed in stages by accumulating one or more partial sums.
[0047] A neural network is a collection of one or more neurons. Neural networks are typically divided into groups of neurons called layers. A layer is a collection of one or more neurons that all receive input from the same layer and all send outputs to the same layer, and typically perform similar functions. The input layer is the layer that receives input from sources outside the neural network. The output layer is the layer that sends outputs to targets outside the neural network. All other layers are intermediate processing layers. A multilayer neural network is a neural network with more than one layer. A deep neural network is a multilayer neural network with multiple layers.
[0048] A tensor is a multidimensional array of numerical values. A tensor block is a contiguous subarray of the elements in a tensor.
[0049] Each neural network layer is associated with a parameter tensor V, a weight tensor W, an input data tensor X, an output data tensor Y, and an intermediate data tensor Z. The parameter tensor contains the activation functions used to control the neurons in the control layer. All parameters of the layer. The weight tensor contains all the weights that connect the input to the layer. The input data tensor contains all the data consumed by the layer as input. The output data tensor contains all the data computed by the layer as output. The intermediate data tensor contains any data produced by the layer as intermediate computations, such as partial sums.
[0050] The layer's data tensor (input, output, and intermediate) can be three-dimensional, where the first two dimensions can be interpreted as encoding spatial location, and the third dimension as encoding different features. For example, when the data tensor represents a color image, the first two dimensions encode the vertical and horizontal coordinates within the image, and the third dimension encodes the color at each location. Each element of the input data tensor X can be connected to each neuron via individual weights, so the weight tensor W typically has six dimensions, connecting the three dimensions of the input data tensor (input row a, input column b, input feature c) with the three dimensions of the output data tensor (output row i, output column j, output feature k). The intermediate data tensor Z has the same shape as the output data tensor Y. The parameter tensor V connects the three dimensions of the output data tensor to the indexed activation function. Additional dimensions of the parameters Connect them. In some embodiments, the activation function No additional parameters are needed; adding dimensions is unnecessary in this case. However, in some embodiments, the activation function... At least one additional parameter is required to appear in dimension o.
[0051] The elements of the layer's output data tensor Y can be calculated as shown in Equation 1, where the neuron activation function... The vector configuration of the activation function parameters V[i,j,k,:] and the weighted sum Z[i,j,k] can be calculated as in Equation 2.
[0052]
[0053] Equation 1
[0054]
[0055] Equation 2
[0056] To simplify the notation, the weighted sum in Equation 2 can be referred to as the output, which is equivalent to using a linear activation function. It should be understood that the same statement is appropriate without loss of generality when using different activation functions.
[0057] In various embodiments, the computation of the output data tensor as described above is decomposed into smaller problems. Each problem can then be solved in parallel on one or more neural kernels, or on one or more kernels of a conventional multi-core system.
[0058] As can be clearly seen from the above, neural networks are parallel structures. A neuron in a given layer receives input X or other inputs from one or more layers, where X has elements... Each neuron computes its state based on its input and weights W. The weight W has elements In various embodiments, the weighted sum of the inputs is adjusted by the bias b, and the result is then passed to the nonlinear F( For example, the activation of a single neuron can be represented as... .
[0059] Because all neurons in a given layer receive input from the same layer and compute their outputs independently, neuron activations can be computed in parallel. Due to this aspect of the overall neural network, performing computations in parallel-distributed kernels accelerates the overall computation. Furthermore, within each kernel vector, operations can be computed in parallel. Even with recurrent inputs, such as when a layer projects back to itself, all neurons are still updated simultaneously. In effect, recurrent connections are delayed to align with subsequent inputs from that layer.
[0060] Now for reference Figure 1The diagram depicts a neural nucleus according to an embodiment of the present disclosure. A neural nucleus 100 is a tilable computational unit for computing the output tensor of a block. The neural nucleus 100 has M inputs and N outputs. In various embodiments, M = N. To compute the output tensor block, the neural nucleus multiplies an M×1 input tensor block 101 with an M×N weighted tensor block 102 and accumulates the multiplications into a weighted sum, which is stored in a 1×N intermediate tensor block 103. An O×N parameter tensor block contains O parameters specifying each of N neuron activation functions applied to the intermediate tensor block 103 to produce a 1×N output tensor block 105.
[0061] Multiple neural nuclei can be laid flat in a neural nucleus array. In some embodiments, the array is 2-dimensional.
[0062] A neural network model is a set of constants that collectively specify the entire computation performed by the neural network, including the connection graph between neurons and the weight parameters and activation function parameters for each neuron. Training is the process of modifying the neural network model to perform the desired function. Inference is the process of applying the neural network to inputs to produce outputs without modifying the neural network model.
[0063] An inference processing unit is a type of processor that performs neural network inference. A neural inference chip is a specific physical instance of an inference processing unit.
[0064] refer to Figure 2 An exemplary inference processing unit (IPU) is illustrated according to embodiments of the present disclosure. IPU 200 includes a memory 201 for a neural network model. As described above, the neural network model may include synaptic weights of the neural network to be computed. IPU 200 includes an activation memory 202, which may be temporary. The activation memory 202 may be divided into input and output regions and stores neuron activations for processing. IPU 200 includes a neural computation unit 203 loaded with the neural network model from model memory 201. Input activations are provided from activation memory 202 prior to each computation step. Outputs from neural computation unit 203 are written back to activation memory 202 for processing on the same or another neural computation unit.
[0065] In various embodiments, microengine 204 is included in IPU 200. In such embodiments, all operations in the IPU are directed by the microengine. As described below, central and / or distributed microengines may be provided in various embodiments. A global microengine may be referred to as a chip microengine, while a local microengine may be referred to as a core microengine or a local controller. In various embodiments, a microengine includes one or more microengines, microcontrollers, state machines, CPUs, or other controllers.
[0066] refer to Figure 3 The diagram illustrates a multi-core inference processing unit (IPU) according to an embodiment of the present disclosure. The IPU 300 includes a memory 301 for a neural network model and instructions. In some embodiments, the memory 301 is divided into a weight portion 311 and an instruction portion 312. As described above, the neural network model may include synaptic weights of the neural network to be computed. The IPU 300 includes an activation memory 302, which may be temporary. The activation memory 302 may be divided into an input region and an output region, and stores neuron activations for processing.
[0067] The IPU 300 includes an array 306 of neural cores 303. Each core 303 includes a computation unit 333 loaded with a neural network model from model memory 301 and operable to perform vector computations. Each core also includes a local activation memory 332. Input activation is provided from the local activation memory 332 before each computation step. Outputs from the computation unit 333 are written back to the activation memory 332 for processing on the same or another computation unit.
[0068] The IPU 300 includes one or more on-chip networks (NoCs) 305. In some embodiments, part numbers and NoCs 351 interconnect core 303 and transmit part numbers and instructions between them. In some embodiments, a separate parameter distribution NoC 352 connects core 303 to memory 301 for distributing weights and instructions to core 303. It will be understood that various configurations of NoCs 351 and 352 are appropriate for use according to this disclosure. For example, broadcast networks, row broadcast networks, tree networks, and switched networks may be used.
[0069] In various embodiments, a global microengine 304 is included in the IPU 300. In various embodiments, a local core controller 334 is included on each core 303. In such embodiments, the global microengine (chip microengine) and the local core controller (core microengine) cooperate to guide operation. Specifically, at 361, computation instructions are loaded from the instruction portion 312 of the model memory 301 to the core controller 334 on each core 303 via the global microengine 304. At 362, parameters (e.g., neural network / synaptic weights) are loaded by the global microengine 304 from the weight portion 311 of the model memory 301 to the neural computation unit 333 on each core 303. At 363, neural network activation data is loaded from the activation local activation memory 332 to the neural computation unit 333 on each core 303 via the local core controller 334. As described above, activation is provided to neurons of a specific neural network defined by the model and can originate from the same or another neural computation unit, or from outside the system. At 364, the neural computation unit 333 performs the computation to generate output neuron activations as directed by the local kernel controller 334. Specifically, the computation includes applying input synaptic weights to the input activations. It should be understood that various methods can be used to perform this computation, including in silicon dendrites and vector multiplication units. At 365, the computation results are stored in the local activation memory 332 as instructed by the local kernel controller 334. As mentioned above, these stages can be pipelined to provide efficient use of the neural computation unit on each kernel. It should also be understood that, depending on the requirements of a given neural network, inputs and outputs can be transferred from the local activation memory 332 to the global activation memory 302.
[0070] Therefore, this disclosure provides runtime control of operations within an inference processing unit (IPU). In some embodiments, the microengine is centralized (a single microengine). In some embodiments, IPU computation is distributed (performed by a core array). In some embodiments, runtime control of operations is hierarchical—both the central microengine and the distributed microengines are involved.
[0071] Microengines or multiple microengines guide the execution of all operations within the IPU. Each microengine instruction corresponds to several sub-operations (e.g., address generation, loading, computation, storage, etc.). Kernel microcode runs on a kernel microengine (e.g., 334). In the case of local computation, kernel microcode includes instructions to perform complete single tensor operations, such as convolution between weight tensors and data tensors. In the case of distributed computation, kernel microcode includes instructions to perform single tensor operations on subsets (and partial sums) of locally stored data tensors. Chip microcode runs on a chip microengine (e.g., 304). Microcode includes instructions to perform all tensor operations within a neural network.
[0072] Now for reference Figure 4 Exemplary neural nuclei and associated networks are illustrated according to embodiments of this disclosure. See references... Figure 1 The described configuration embodies core 401 interconnected with additional cores via networks 402…404. In this embodiment, network 402 is responsible for distributing weights and / or instructions, network 403 is responsible for distributing partial sums, and network 404 is responsible for distributing activations. However, it will be understood that various embodiments of this disclosure may combine these networks or further separate them into multiple additional networks.
[0073] Input activation (X) is transmitted from outside the core to the core 401 via activation network 404 to activation memory 405. Layer instructions are distributed to core 401 and from outside the core to instruction memory 406 via weight / instruction network 402. Layer weights (W) and / or parameters are distributed to core 401 and from outside the core to weight memory 407 and / or parameter memory 408 via weight / instruction network 402.
[0074] The vector matrix multiplication (VMM) unit 409 reads the weight matrix (W) from the weight memory 407. The VMM unit 409 reads the activation vector (V) from the activation memory 405. Then, the VMM unit 409 calculates the vector matrix multiplication. The results are then provided to vector unit 410. Vector unit 410 reads additional partial sums from partial sum memory 411 and receives additional partial sums from outside the core via partial sum network 403. Vector-vector unit 410 calculates vector-vector operations based on these source partial sums. For example, the partial sums can be summed sequentially. The resulting target partial sum is written to partial sum memory 411, transmitted to outside the core via partial sum network 403, and / or fed back to vector-vector unit 410 for further processing.
[0075] After all computations for the input of a given layer are completed, portions and results from vector-vector units 410 are provided to activation units 412 for computation of output activations. Activation vectors (Y) are written to activation memory 405. Layer activations (including the results written to activation memory) are redistributed across cores from activation memory 405 via activation network 404. Upon reception, they are written to the local activation memory of each receiving core. Upon completion of processing for a given frame, output activations are read from activation memory 405 and transmitted outside the core via network 404.
[0076] Therefore, in operation, the core control microengine (e.g., 413) coordinates the core's data movement and computation. The microengine issues a read operation on the activation memory address to load the input activation block into the vector-matrix multiplication unit. The microengine issues a read operation on the weight memory address to load the weight block into the vector-matrix multiplication unit. The microengine issues computation operations to the vector-matrix multiplication unit, causing the vector-matrix multiplication unit to compute parts and blocks.
[0077] The microengine issues one or more of the following partial and read / write memory address operations, vector computation operations, or partial and communication operations to perform one or more of the following operations: reading partial and data from a partial and source; performing computations using the partial and arithmetic unit; or writing partial and data to a partial and destination. Writing partial and data to a partial and destination may include external communication with the core via a partial and network interface, or transferring partial and data to an active arithmetic unit.
[0078] The microengine issues an activation function calculation operation, causing the activation function arithmetic unit to calculate and output an activation block. The microengine then issues a write activation memory address, and the output activation block is written to the activation memory via the activation memory interface.
[0079] Therefore, various sources, targets, address types, computation types, and control components are defined for a given core.
[0080] The sources of the vector-vector unit 410 include the vector-matrix multiplication (VMM) unit 409, constants from the parameter memory 408, the partial sum memory 411, the partial sum result (TGT partial sum) from the previous cycle, and the partial summing network 403.
[0081] The objectives of the vector-vector unit 410 include partial sum memory 411, partial sum results of subsequent cycles (SRC partial sum), activation unit 412, and partial sum network 403.
[0082] Therefore, a given instruction can be read from or written to the activation memory 405, from the weight memory 407, or from the partial sum memory 411. The computational operations performed by the core include vector-matrix multiplication in the VMM unit 409, vector (partial sum) operations in the vector-vector unit 410, and activation functions in the activation unit 412.
[0083] Control operations include updating the program counter and the loop and / or sequence counter.
[0084] Therefore, memory operations are issued to read weights from addresses in weight memory, read parameters from addresses in parameter memory, read activations from addresses in activation memory, and read / write partial sums to addresses in partial sum memory. Computation operations are issued to perform vector-matrix multiplication, vector operations, and activation functions. Communication operations are issued to select vector-vector operands, route messages on partial sum networks, and select partial sum targets. Cross-layer loop outputs and cross-layer loop inputs are controlled by specifying control operations of the program counter, loop counter, and sequence counter in the microengine.
[0085] Now for reference Figure 5 This provides a schematic diagram of data distribution from a global memory array according to embodiments of the present disclosure. The global memory array 501 includes a plurality of elements 502, each element 502 including a memory element 504 and a buffer 503. Weights and instructions are provided from the global memory array 501 to the array 505 of core 506 via a network 507. The above relates to... Figure 3 An exemplary configuration of core 303 was discussed. Figure 3 Core 303 can be implemented as described in conjunction with global memory array 501, where core array 306 corresponds to 505.
[0086] As described above, multi-core architectures for neural inference offer significant advantages in terms of computational power. However, if neural network weights and parameters are not provided to the computational core in a timely manner, the core cannot perform any useful computations. Consequently, the performance of a neural chip can be limited by its ability to deliver neural network weights and parameters to the computational core on the chip. On-chip memory significantly improves memory bandwidth compared to typical off-chip memory such as Dynamic Random Access Memory (DRAM) or High Bandwidth Memory (HBM). Furthermore, on-chip memory is more energy-efficient than off-chip memory, resulting in more energy-efficient neural inference systems. In various embodiments, on-chip memory may include Static Random Access Memory (SRAM) or other embedded memory. However, delivering neural network weights to the core at a rate commensurate with processing speed remains a challenge.
[0087] Specific efficiencies can be achieved using convolutional neural networks (CNNs). In CNNs, the same weight matrix (sometimes called convolutional filters) is reused. To minimize the amount of on-chip memory used, it is preferable to store a given weight matrix in one location without repetition. To store large neural networks, some embodiments of on-chip memory consist of a collection of many memory elements. It will also be understood that many kernels are the target of the memory weights. This leads to a many-to-many communication problem (many memory elements to many kernels). Broadcasting weights can cause on-chip network (NoC) congestion and can generate many collisions and pipeline stalls, resulting in a degradation of broadcast bandwidth.
[0088] As described above, in various embodiments of the neural inference chip, a grid of neural inference kernels is provided to accelerate neural network inference. In various embodiments, instruction pre-scheduling is provided. Neural network evaluation involves rule-based computational patterns, thus instructions can be pre-scheduled to achieve high performance without any pauses. However, it is preferable that all neural network weights are delivered to the kernel in a timely manner according to the pre-scheduled pattern. If the weight delivery network becomes congested and weight delivery pauses, the pre-scheduling of the neural network evaluation fails.
[0089] This disclosure provides a pause-free weight transfer on-chip network for transferring weight parameters from a memory element grid to a computation kernel grid. A one-dimensional scheme is first shown below, followed by extension to a two-dimensional grid scheme. These methods are further extended to support various weight distributions, such as striping (where different rows of the kernel receive different weights).
[0090] Even with variations in the timing of instruction transfer to memory elements and data transfer from memory elements to the computation core, the methods described in this paper operate conflict-free. These methods can address any column of the instruction stream in any order. The new scheme eliminates the constraint that all columns must start simultaneously.
[0091] Now for reference Figure 6 An exemplary memory controller with a linear weighted memory array is shown according to embodiments of the present disclosure. The memory controller 601 includes an instruction memory, shown herein as having four instruction slots 611…614. The weighted memory array 602 includes a plurality of elements 621…624, each element including a data buffer 625 and a memory 626. The instruction buffer 603 includes a plurality of elements 631…634 and 634, each element corresponding to one of the weighted memory elements 621 and 621…624. As stated above regarding… Figure 5 The core grid 604 comprises multiple cores 641.
[0092] Reference Figure 7 This shows the use of Figure 6 The controller's memory distribution method. At 701, an instruction is issued from the instruction memory. The instruction travels along the instruction buffer 603 between elements 631…634. When the instruction reaches the column storing the appropriate data (e.g., memory element 621), the data is read from memory (e.g., memory 626 in this example) into a data buffer (e.g., buffer 625 in this example). Once read, the data propagates along the data buffer, for example from memory elements 621 to 622…624. After reaching the final element 624, the value is passed to the core grid 604.
[0093] It should be understood that sequential instructions can be issued from memory controller 601, for example, one per cycle. The sum of the number of cycles each instruction travels along instruction buffer 603 and the number of cycles data travels along data buffers (in memory elements 621…624) is constant, regardless of the data's location. In particular, the total latency of instruction dispatch plus the total latency of data allocation is constant. This remains true even in the case of random access.
[0094] refer to Figure 8 An exemplary memory controller with a two-dimensional weighted memory array is shown according to embodiments of the present disclosure. The memory controller 801 includes an instruction memory, shown herein as having four instruction slots 811…814. The weighted memory array 802 includes multiple elements arranged in rows and columns 821…824, each element including a data buffer 825 and a memory 826. In addition to the data buffer 825, the two-dimensional case also includes a second buffer 827 to accommodate communication of data within a column. The instruction buffer 803 includes multiple elements 831…834, each element corresponding to a column of weighted memory elements 821, 821…824. As stated above regarding… Figure 5 The core grid 804 includes multiple cores 841.
[0095] Reference Figure 9 This shows the use of Figure 8 The memory distribution method of the controller. At 901, an instruction is issued from the instruction memory. At 902, the instruction advances along the instruction buffer 803 between elements 831…834. At 903, when the instruction reaches a column (e.g., column 821) storing the appropriate data, the data is read from memory (e.g., memory 826 in this example) into a data buffer (e.g., data buffer 825 in this example). At 904, once read, the data propagates vertically along the buffer (e.g., down column 821). At 905, the data propagates along each row, e.g., from column 821 to 822…824. At 906, after reaching the last column 824, the value is delivered to the core grid 804.
[0096] In the one-dimensional case, the total latency between the steps of instruction dispatch, vertical dispatch, and horizontal dispatch is constant. Specifically, the sum of the latency period for instruction dispatch, the latency period for vertical dispatch, and the latency period for horizontal dispatch is constant. Furthermore, it will be understood that when multiple columns are accessed out of order, the latency of instruction and data delivery matches each other.
[0097] refer to Figure 10An exemplary memory controller with a two-dimensional weighted memory array and a staging buffer according to an embodiment of the present disclosure is shown. The memory controller 1001 includes an instruction memory, shown herein as having four instruction slots 1011…1014. These instruction slots are arranged in multiple columns to allow multiple instructions to be issued simultaneously in one cycle. In this example, slots 1011 and 1013 store instructions to be issued in a first cycle, while slots 1012 and 1014 store instructions for a second cycle. The weighted memory array 1002 includes multiple elements arranged in rows and columns 1021…1024, each element including a horizontal buffer 1025 and a memory element 1026. In addition to the horizontal buffer 1025, the two-dimensional case also includes a second (vertical) buffer 1027 to accommodate communication of data within a column. In this example, a staging buffer 1028 is also included between the vertical buffer 1027 and the horizontal buffer 1025. Instruction buffer 1003 includes multiple elements 1031…1034, each element corresponding to a column of weighted memory elements 1021…1024. Each element 1031…1034 can store multiple instructions issued during the same cycle. (As mentioned above...) Figure 5 The core grid 1004 includes multiple cores 1041.
[0098] In this exemplary embodiment, striping is supported. Specifically, it is possible to read multiple data items on the same column. A pausing buffer 1028 is added to support reading multiple data items and striping the data. Sending different data to different rows is useful in various situations—cooperative neural inference kernels effectively multiply bandwidth to receive the memory array.
[0099] The total time for instruction dispatch, vertical dispatch, waiting in the staging buffer, and horizontal dispatch is constant, regardless of where the source data is stored or which row the data is dispatched to. The maximum number of cycles for distributing data through the vertical buffer is determined to be the combined time of vertical dispatch and waiting in the staging buffer. This ensures that all data flows from the staging buffer to the horizontal buffer simultaneously. A counter can be assigned to each vertical dispatch group, counting down each clock cycle. This is one way to ensure that all data is transferred from the staging buffer to the horizontal buffer within the same cycle.
[0100] refer to Figure 11An exemplary configuration of multiple memory controllers including a two-dimensional weighted memory array is shown according to embodiments of the present disclosure. Each memory controller 1101 includes an instruction memory, shown herein as having four instruction slots 1111…1114. The weighted memory array 1102 includes multiple elements arranged in rows and columns 1121…1124, each element including a data buffer 1125 and a memory element 1126. In addition to the data buffer 1125, the two-dimensional case also includes a second (vertical) buffer 1127 to accommodate data communication within a column. Each row of the memory array 1102 has a corresponding instruction buffer 1103, which includes multiple elements 1131…1134, each element corresponding to a column of weighted memory elements 1121…1124. (The above is about…) Figure 5 The core grid 1104 includes multiple cores 1141.
[0101] In this example, each row has a separate memory controller, which has an instruction memory and an instruction buffer. Using this approach, each memory element is physically located close to its corresponding instruction buffer. This allows the instruction buffer to control the memory elements without additional pipeline latency.
[0102] refer to Figures 12A to 12I The distribution of instructions and data is illustrated according to embodiments of this disclosure. In this example, data is read from left to right, where each image depicts a continuous cycle. Memory control instructions in the instruction memory and the data to be read by the instructions are labeled with the same symbols. For example, a first instruction A0 in the instruction memory will read data A0 stored in the memory element in the leftmost column of the memory array. Figure 12A In the instruction buffer, instruction A0 is shown. Figure 12B In the middle, the second instruction is issued, and both A0 and A1 are in the instruction buffer. Figure 12C In the middle, the third instruction is issued, and A0, A1, and A2 are in the instruction buffer. Figure 12D In the process, the fourth instruction is issued, and A0, A1, A2, and A3 are in the instruction buffer, arriving at their destination memory array elements. Figure 12E In the process, instructions A0, A1, A2, and A3 are executed, reading data from their corresponding memory locations into their data buffers. Figure 12F In the process, data moves forward through the data buffer, and data A0 arrives at the kernel grid. Figure 12G In the process, data advances through the data buffer, where data A1 arrives at the kernel grid. Figure 12H In the process, data advances through the data buffer, where data A2 arrives at the kernel grid. Figure 12I In the process, data advances through the data buffer, where data A3 reaches the kernel grid.
[0103] refer to Figures 13A to 13I The distribution of instructions and data is illustrated according to embodiments of this disclosure. In this example, data is read in a random order, where each image depicts one cycle. Figure 13A In the instruction buffer, instruction A2 is issued. Figure 13B In the middle, the second instruction is issued, and both A2 and A1 are in the instruction buffer, with A2 arriving at its destination column. Figure 13C In the instruction buffer, the third instruction A3 is issued, instruction A1 advances, and instruction A2 is executed, reading data into the data buffer. A1 and A3 are located in the instruction buffer. Figure 12D In the instruction buffer, the fourth instruction A0 is issued, instruction A1 advances, data A2 advances through the data buffer, and instruction A3 is executed, reading data into the data buffer. A0 and A1 are located in the instruction buffer. Figure 13E In the process, instruction A0 advances, A1 executes, reads data from memory, and data A2 and A3 advance through the data buffer. Figure 13F In the process, data moves forward through the data buffer, and data A2 arrives at the kernel grid. Figure 13G In the process, data advances through the data buffer, where data A1 arrives at the kernel grid. Figure 13H In the process, data advances through the data buffer, with data A3 reaching the core grid. Instruction A0 is executed, thereby reading the data into memory. Figure 13I In this process, data is forwarded through the data buffer, with data A0 reaching the core grid. Although the data corresponding to the instruction resides in random access memory elements, there are no network collisions because the sum of the number of cycles for the forward instruction and the number of cycles for the forward data is constant. Furthermore, data is transferred to the core grid in the order of the instructions, i.e., A2, A1, A3, and A0.
[0104] refer to Figures 14A to 14M The distribution of instructions and data is illustrated according to embodiments of this disclosure. In this example, data is read from a two-dimensional weighted memory array, where each image depicts one cycle. Figure 14A In the instruction buffer, instruction A0 is issued. Figure 14B In the second instruction, A0 and A1 are both in the instruction buffer. Figure 14C In the instruction buffer, the third instruction A2 is issued, and instructions A0 and A1 proceed. A0, A1, and A2 are in the instruction buffer. Figure 14D In the instruction buffer, the fourth instruction A3 is issued, and the previous instructions advance. A0, A1, A2, and A3 are in the instruction buffer. Figure 14E In this process, all instructions are executed on their corresponding columns, thereby reading data from memory. Figure 14F In this process, the vertical buffer begins to propagate data along each column. Figure 14GIn the middle, vertical data propagation continues. Column A1 has completed its vertical distribution but waits for an additional cycle before starting horizontal propagation to even out the delay with other columns. Figure 14H Vertical data propagation is complete. Figure 14I In the process, data is copied to the horizontal buffer. Figures 14J to 14M In the process, data advances through the horizontal buffer until all data reaches the core array.
[0105] refer to Figures 15A to 15M The distribution of instructions and data is illustrated according to embodiments of this disclosure. In this example, data is read from a two-dimensional weighted memory array in a random order, where each image depicts one cycle. Figure 15A In the instruction buffer, instruction A2 is issued. Figure 15B In the process, the second instruction A1 is issued, and both A2 and A1 are in the instruction buffer. A2 then reaches its target column. Figure 15C In the process, the third instruction A3 is issued, and instruction A1 advances. A1 and A3 are in the instruction buffer, and the data for A2 is read into the vertical buffer. Figure 14D In the current instruction sequence, the fourth instruction A0 is issued, and previous instructions proceed. A1 and A0 are in the instruction buffer. Data for A2 is vertically dispatched. Data for A3 is read into the vertical buffer. Figure 15E In the middle, A0 moves forward. Data A1 is read from memory. Data A2 and A3 are distributed forward along the vertical buffer. Figure 14F In the middle, the vertical buffer continues to propagate data along each column. A0 advances along the instruction buffer. Figure 15G In the middle, vertical data propagation continues for A1 and A3. Instruction A0 reaches its target column. After its vertical distribution is complete, the data of A2 is copied to the horizontal buffer. Figure 15H In the process, data A0 is read into the corresponding vertical buffer, and propagation continues vertically for A1. Data A2 moves forward through the horizontal buffer. Data A3 is copied from the vertical buffer to the corresponding horizontal buffer. Figure 15I In the process, data A2 and A3 are forwarded through the horizontal buffer. Between data A2 and A3, data A1 is copied into the horizontal buffer without conflict. Figure 15J In this process, the data for the first instruction A2 is delivered to the core grid. Data for A1 and A3 is forwarded through the horizontal buffer, while data for A0 is forwarded through the vertical buffer. Figure 15K In the middle, the data for the second instruction A1 is delivered. The last instruction A0 completes the vertical distribution. Figure 15L In the middle, data from A3 is delivered. Data from A0 is copied to the horizontal buffer. Figure 15MIn this process, data from A0 is delivered to the core array. Although instructions read data from random locations, the data progresses without collisions and is delivered to the core grid in the order of instructions. To avoid grid collisions in the vertical buffer, instructions should not be read from the same column of memory elements during the duration of vertical data propagation.
[0106] refer to Figures 16A to 16K The distribution of instructions and data is illustrated according to embodiments of this disclosure. In this example, data is read from a two-dimensional weighted memory array, where each image depicts one cycle. Instruction pairs are issued within the same cycle. In this example, the first row of memory elements has data A0, and the second row of memory elements has data B0. Data A0 is distributed across all even-numbered rows, while B0 is distributed across all odd-numbered rows.
[0107] exist Figure 16A In the process, instructions A0 and B0 are issued and arrive at the instruction buffer. Figure 16B In the process, instructions A1 and B1 are issued, and A0 and B0 move forward. Figure 16C In this process, each pair of instructions advances to its destination column. Figure 16D In this process, all instructions are executed on their corresponding columns, thereby reading data from memory into the corresponding vertical buffer. Figure 16E In the process, the vertical buffer begins propagating data along each column. While the vertical buffer transmits data, some data is stored in a stale buffer. The stale buffer is used when data has reached its target row and needs to wait until horizontal distribution begins. Figure 16F Vertical data propagation continues. Figure 16G In this process, all data arrives at its target row and is copied to the staging buffer. The stripe pattern is stored in the staging buffer. Figure 16H In this process, data is copied from the staging buffer to the horizontal buffer. Figure 16I In the process, data moves forward through the data buffer. Figure 16J In the process, data advances through the data buffer, passing data stripes A0 and B0 to the kernel grid. Figure 16J In the process, data advances through the data buffer, passing data stripes A1 and B1 to the kernel grid.
[0108] refer to Figure 17The diagram illustrates a method for calculating neural activation. At 1701, at least one instruction is provided from instruction memory to instruction buffer. At 1702, the at least one instruction is advanced between positions in the instruction buffer. At 1703, when the memory of at least one of the plurality of elements of the memory array contains data associated with the at least one instruction, the at least one instruction is provided from the instruction buffer to the at least one element of the plurality of elements. The memory array includes a plurality of elements, each element including a memory buffer and a horizontal buffer, the horizontal buffer of each element of the memory array communicating with the horizontal buffer of another element of the memory array or with at least one neural nucleus. At 1704, in response to the arrival of the at least one instruction from the instruction buffer, a block of data is provided from memory to the horizontal buffer of the at least one element of the plurality of elements. At 1705, the block of data is provided from the horizontal buffer of the at least one element of the plurality of elements to the horizontal buffer of another element of the memory array or to at least one neural nucleus.
[0109] Various embodiments of this disclosure use combinations of instruction buffers, horizontal buffers, vertical buffers, and registers to provide instruction and data distribution in one-dimensional or two-dimensional memory arrays. It should be understood that the invention can be applied to higher-dimensional arrays with additional buffers. In these embodiments, the time from instruction issuance to data output from the data array is constant, even if each stage may take different amounts of time. Columns can be accessed in a random order. In the case of higher dimensions, two instructions accessing the same column should be separated by the vertical distribution time. In the case of one dimension, the vertical distribution time is zero, and therefore there are no constraints.
[0110] In various embodiments, a system is provided that includes a memory array, an instruction buffer, and a horizontal data buffer. The sum of the number of cycles used for instruction dispatch and memory dispatch is constant for all instructions.
[0111] In various embodiments, a two-dimensional memory array is provided. A horizontal buffer is provided for each row of the memory array. A vertical buffer is provided for each column of the memory array. The sum of the number of cycles for instruction dispatch, the number of cycles for data dispatch along the vertical buffer, and the number of cycles for data dispatch along the horizontal buffer is constant.
[0112] In various embodiments, a two-dimensional memory array is provided. A staging buffer is provided for each location in the memory array. The sum of the number of cycles for instruction dispatch, the number of cycles for data dispatch along the vertical buffer, the number of cycles for data dispatch along the horizontal buffer, and the number of cycles for data transfers to the staging buffer is constant.
[0113] Now for reference Figure 18The diagram illustrates an example of a computing node. Computing node 10 is merely one example of a suitable computing node and is not intended to impose any limitation on the scope of use or functionality of the embodiments described herein. In any case, computing node 10 is capable of implementing and / or performing any of the functions set forth above.
[0114] Within compute node 10, there is a computer system / server 12 that can operate with many other general-purpose or special-purpose computing system environments or configurations. Examples of well-known computing systems, environments, and / or configurations suitable for use with computer system / server 12 include, but are not limited to, personal computer systems, server computer systems, thin clients, fat clients, handheld or laptop devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the aforementioned systems or devices.
[0115] Computer system / server 12 can be described in the general context of computer system executable instructions, such as program modules executed by the computer system. Typically, program modules can include routines, programs, objects, components, logic, data structures, etc., that perform specific tasks or implement specific abstract data types. Computer system / server 12 can be implemented in a distributed cloud computing environment, where tasks are performed by remote processing devices linked via a communication network. In a distributed cloud computing environment, program modules can reside in local and remote computer system storage media, including memory storage devices.
[0116] like Figure 18 As shown, the computer system / server 12 in compute node 10 is illustrated in the form of a general-purpose computing device. Components of the computer system / server 12 may include, but are not limited to, one or more processors or processing units 16, system memory 28, and a bus 18 that couples various system components, including system memory 28, to the processing unit 16.
[0117] Bus 18 represents one or more of several types of bus architectures, including memory buses or memory controllers, peripheral buses, accelerated graphics ports, and processor or local buses using any of the various bus architectures. By way of example and not limitation, these architectures include the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MCA) bus, the Enhanced ISA (EISA) bus, the Video Electronics Standards Association (VESA) local bus, the Peripheral Component Interconnect (PCI) bus, the Peripheral Component Interconnect Fast (PCIe) bus, and the Advanced Microcontroller Bus Architecture (AMBA).
[0118] In various embodiments, one or more inference processing units (not shown) are coupled to bus 18. In such embodiments, the IPU can receive data from or write data to memory 28 via bus 18. Similarly, the IPU can interact with other components via bus 18 as described herein.
[0119] Computer system / server 12 typically includes various computer system readable media. Such media can be any available media accessible to computer system / server 12, and it includes volatile and non-volatile media, removable and non-removable media.
[0120] System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and / or cache memory 32. Computer system / server 12 may also include other removable / non-removable, volatile / non-volatile computer system storage media. By way of example only, storage system 34 may be provided for reading from and writing to non-removable, non-volatile magnetic media (not shown, and generally referred to as "hard disk drives"). Although not shown, disk drives for reading from and writing to removable, non-volatile disks (e.g., "floppy disks") and optical disk drives for reading from or writing to removable, non-volatile optical disks such as CD-ROMs, DVD-ROMs, or other optical media may be provided. In this case, each may be connected to bus 18 via one or more data media interfaces. As will be further described below, memory 28 may contain at least one program product having a set (e.g., at least one) of program modules configured to perform embodiments of the invention.
[0121] A program / utility 40 having a set (at least one) of program modules 42, along with an operating system, one or more applications, other program modules, and program data, may be stored in memory 28 as an example and not as a limitation. Each of the operating system, one or more applications, other program modules, and program data, or some combination thereof, may include an implementation of a networking environment. Program modules 42 typically perform the functions and / or methods of the embodiments described herein.
[0122] The computer system / server 12 can also communicate with one or more external devices 14, such as a keyboard, indicating devices, and a display 24; one or more devices that enable a user to interact with the computer system / server 12; and / or any device that enables the computer system / server 12 to communicate with one or more other computing devices (e.g., a network interface card, a modem, etc.). This communication can occur via input / output (I / O) interface 22; however, the computer system / server 12 can communicate with one or more networks via network adapter 20, such as a local area network (LAN), a general wide area network (WAN), and / or a public network (e.g., the Internet). As shown, network adapter 20 communicates with other components of the computer system / server 12 via bus 18. It should be understood that, although not shown, other hardware and / or software components can be used in conjunction with the computer system / server 12, examples including, but not limited to, microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archive storage systems.
[0123] This disclosure can be implemented as a system, method, and / or computer program product. A computer program product may include one or more computer-readable storage media having computer-readable program instructions thereon for causing a processor to perform aspects of this disclosure.
[0124] Computer-readable storage media can be tangible devices capable of retaining and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example, but not limited to, electronic storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of computer-readable storage media includes the following: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable optical disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices such as punch cards or recessed structures with instructions recorded thereon, and any suitable combination of the foregoing. As used herein, computer-readable storage media should not be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.
[0125] The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to a suitable computing / processing device, or via a network, such as the Internet, a local area network (LAN), a wide area network (WAN), and / or a wireless network, to an external computer or external storage device. The network may include copper cables, optical fibers, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to a computer-readable storage medium within the respective computing / processing device.
[0126] Computer-readable program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine-associated instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages (e.g., Smalltalk, C++, etc.) and conventional procedural programming languages (e.g., the "C" programming language or similar programming languages). The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a stand-alone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In the latter case, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs) may execute computer-readable program instructions by utilizing state information from the computer-readable program instructions to personalize the electronic circuitry in order to perform aspects of this disclosure.
[0127] This document describes aspects of the disclosure with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowcharts and / or block diagrams, and combinations of blocks in the flowcharts and / or block diagrams, can be implemented by computer-readable program instructions.
[0128] These computer-readable program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions / actions specified in one or more blocks of a flowchart and / or block diagram. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and / or other devices to operate in a particular manner, such that the computer-readable storage medium in which the instructions are stored includes an article of writing comprising instructions for implementing aspects of the functions / actions specified in one or more blocks of a flowchart and / or block diagram.
[0129] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer-implemented process, such that the instructions, which execute on the computer, other programmable apparatus or other device, perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0130] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of instructions comprising one or more executable instructions for implementing a specified logical function. In some alternative embodiments, the functions mentioned in the blocks may occur in a non-linear order as shown in the figures. For example, two blocks shown consecutively may actually be executed substantially simultaneously, or these blocks may sometimes be executed in reverse order, depending on the functions involved. It will also be noted that each block illustrated in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented by a dedicated hardware-based system that performs the specified function or action or executes a combination of dedicated hardware and computer instructions.
[0131] Various embodiments of this disclosure have been described for illustrative purposes, but are not intended to be exhaustive or limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope of the described embodiments. The terminology used herein has been chosen to best explain the principles of the embodiments, their practical application, or improvements to existing technologies in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
Claims
1. A neural inference chip for calculating neural activation, the neural inference chip comprising: At least one neural nucleus; A memory array operatively coupled to the at least one neural nucleus, the memory array comprising a plurality of elements, each element including a memory and a horizontal buffer, the horizontal buffer of each element of the memory array communicating with the horizontal buffer of another element of the memory array or with the at least one neural nucleus; An instruction buffer, in communication with the memory array, the instruction buffer having a location corresponding to each of the plurality of elements of the memory array; Instruction memory, communicating with the instruction buffer, wherein: The instruction memory is adapted to provide at least one instruction to the instruction buffer. The instruction buffer is adapted to advance the at least one instruction between positions within the instruction buffer. The instruction buffer is adapted to provide the at least one instruction from an associated location of the at least one element in the instruction buffer to the at least one element when the memory of at least one element of the plurality of elements of the memory array contains data associated with at least one instruction. Each of the plurality of elements of the memory array is adapted to provide a block of data from its memory to its horizontal buffer in response to the arrival of an associated instruction from the instruction buffer. The horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or to the at least one neural nucleus.
2. The neural reasoning chip according to claim 1, wherein: The instruction buffer is adapted to advance instructions between positions in the instruction buffer at a rate of one position per cycle. The horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or the at least one neural nucleus at a rate of one data block per cycle.
3. The neural inference chip according to claim 1, comprising a neural nucleus array, the neural nucleus array comprising the at least one neural nucleus and having multiple rows.
4. The neural inference chip of claim 1, wherein the memory array is one-dimensional, and the plurality of elements of the memory array are arranged in a row and multiple columns.
5. The neural inference chip of claim 1, wherein the memory array is two-dimensional, and the plurality of elements of the memory array are arranged in multiple rows and columns.
6. The neural inference chip of claim 5, wherein each element of the memory array further comprises a vertical buffer, the vertical buffer of each element of the memory array communicating with the vertical buffer of another element of the memory array.
7. The neural reasoning chip according to claim 6, wherein: Each of the plurality of elements of the memory array is adapted to provide a block of data from its memory to its vertical buffer in response to the arrival of an associated instruction from the instruction buffer. Each of the plurality of elements in the memory array is adapted to provide data blocks from its vertical buffer to its horizontal buffer. The vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array.
8. The neural reasoning chip according to claim 7, wherein: The horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or the at least one neural nucleus at a rate of one data block per cycle. The vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array at a rate of one data block per cycle.
9. The neural inference chip of claim 6, wherein each element of the memory array further includes a pausing buffer, the pausing buffer of each element of the memory array communicating with the horizontal buffer and the vertical buffer of the element of the memory array.
10. The neural reasoning chip according to claim 9, wherein: Each of the multiple elements of the memory array is adapted to provide a block of data from its memory to its vertical buffer in response to the arrival of an associated instruction from the instruction buffer. Each of the multiple elements of the memory array is adapted to provide data blocks from its vertical buffer to its temporary buffer. Each of the multiple elements of the memory array is adapted to provide data blocks from its staking buffer to its horizontal buffer. The vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array.
11. The neural reasoning chip according to claim 10, wherein: The horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or to the at least one neural nucleus at a rate of one data block per cycle. The vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array at a rate of one data block per cycle. The stag buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of the element of the memory array at a rate of one data block per cycle.
12. The neural inference chip according to claim 10, wherein: The instruction memory is adapted to provide multiple instructions to the instruction buffer in each cycle, and each location of the instruction buffer is adapted to store multiple instructions. The instruction buffer is adapted to advance multiple instructions between positions in the instruction buffer each cycle.
13. A neural inference chip for calculating neural activation, the neural inference chip comprising: At least one neural nucleus; A memory array, operatively coupled to the at least one neural nucleus, the memory array comprising multiple elements, each element including a memory, a horizontal buffer, and a vertical buffer. The horizontal buffer of each element of the memory array communicates with the horizontal buffer of another element of the memory array or with the at least one neural nucleus; as well as The vertical buffer of each element of the memory array communicates with the vertical buffer of the other element of the memory array; A plurality of instruction buffers communicating with the memory array, each of the plurality of instruction buffers having a location corresponding to one of the plurality of elements of the memory array; Multiple instruction memories, each communicating with one of the multiple instruction buffers, wherein: Each instruction memory is adapted to provide at least one instruction to its instruction buffer. Each instruction buffer is adapted to advance the at least one instruction between positions within the instruction buffer. Each instruction buffer is adapted to provide the at least one instruction from an associated location of the at least one element in the instruction buffer to the at least one element when the memory of at least one element of the plurality of elements of the memory array contains data associated with at least one instruction. Each of the plurality of elements of the memory array is adapted to provide a block of data from its memory to its vertical buffer in response to the arrival of an associated instruction from the instruction buffer. Each of the multiple elements of the memory array is adapted to provide data blocks from its vertical buffer to its horizontal buffer; The vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array; The horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or to the at least one neural nucleus.
14. The neural reasoning chip according to claim 13, wherein: Each instruction buffer is adapted to advance instructions between its positions at a rate of one position per cycle. The horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or the at least one neural nucleus at a rate of one data block per cycle.
15. The neural inference chip of claim 13, comprising a neural nucleus array, the neural nucleus array including the at least one neural nucleus and having multiple rows.
16. The neural inference chip of claim 13, wherein the memory array is two-dimensional, and the plurality of elements of the memory array are arranged in multiple rows and columns.
17. The neural inference chip according to claim 16, wherein: The horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or at least one neural nucleus at a rate of one data block per cycle. The vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array at a rate of one data block per cycle.
18. The neural inference chip of claim 16, wherein each element of the memory array further comprises a pausing buffer, the pausing buffer of each element of the memory array communicating with the horizontal buffer and the vertical buffer of the element of the memory array.
19. The neural inference chip according to claim 18, wherein: Each of the multiple elements of the memory array is adapted to provide a block of data from its memory to its vertical buffer in response to the arrival of an associated instruction from the instruction buffer. Each of the multiple elements of the memory array is adapted to provide data blocks from its vertical buffer to its temporary buffer. Each of the multiple elements of the memory array is adapted to provide data blocks from its staking buffer to its horizontal buffer. The vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array.
20. The neural reasoning chip according to claim 19, wherein: The horizontal buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of another element of the memory array or the at least one neural nucleus at a rate of one data block per cycle. The vertical buffer of each element of the memory array is adapted to provide data blocks to the vertical buffer of another element of the memory array at a rate of one data block per cycle. The stag buffer of each element of the memory array is adapted to provide data blocks to the horizontal buffer of that element of the memory array at a rate of one data block per cycle.
21. The neural reasoning chip according to claim 19, wherein: The instruction memory is adapted to provide multiple instructions to the instruction buffer per cycle, and each location of the instruction buffer is adapted to store multiple instructions. The instruction buffer is adapted to advance multiple instructions between positions in the instruction buffer each cycle.
22. A method comprising: Provide at least one instruction from the instruction memory to the instruction buffer; Advance the at least one instruction between positions in the instruction buffer; When the memory of at least one of the plurality of elements of the memory array contains data associated with the at least one instruction, the at least one instruction is provided from the instruction buffer to the at least one of the plurality of elements. The memory array includes multiple elements, each element including a memory and a horizontal buffer, the horizontal buffer of each element of the memory array communicating with the horizontal buffer of another element of the memory array or with at least one neural nucleus; In response to the arrival of the at least one instruction from the instruction buffer, a data block is provided from the memory to the horizontal buffer of the at least one of the plurality of elements; Data blocks are provided from the horizontal buffer of at least one of the plurality of elements to the horizontal buffer of another element of the memory array or at least one neural nucleus.