Artificial neural network operator, accelerator comprising same, and matrix operation method
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ELECTRONICS & TELECOMM RES INST
- Filing Date
- 2025-12-05
- Publication Date
- 2026-06-18
AI Technical Summary
The challenges of performing large language model computations with large-scale parameters are exacerbated by limited bandwidth and routing difficulties due to numerous wires required for data transfer, which bottleneck data transfer and computation efficiency.
An arithmetic unit comprising a plurality of operation cores with scalar arithmetic units that perform matrix operations through cyclic shifting of operands, reducing wiring complexity by providing operands to only one row and one column of the scalar operator array, and utilizing a global asynchronous local synchronous operation with local power and clock gating.
This approach enhances computation efficiency by minimizing bottlenecks and routing difficulties, enabling efficient data transfer and computation in large language models with reduced wiring complexity and improved performance.
Smart Images

Figure KR2025020855_18062026_PF_FP_ABST
Abstract
Description
Artificial neural network operator and accelerator including the same and matrix operation method
[0001] The present disclosure generally relates to an artificial neural network operator, an accelerator including the same, and a matrix operation method.
[0002] With the rapid growth of large language models, the data of parameters used in computations amounts to several terabytes. Performing large language model computations using such large-scale parameters presents many challenges.
[0003] As the demand for semiconductor performance and memory bandwidth to resolve these difficulties increases, semiconductor manufacturers are attempting to improve performance and memory bandwidth by utilizing heterogeneous integration with chiplets rather than single dies.
[0004] Bottlenecks occur during data transfer due to limited bandwidth, and routing difficulties arise from the numerous wires required to provide data to the computing unit for computation.
[0005] The present disclosure is intended to resolve the difficulties of such prior art.
[0006] An arithmetic unit for performing matrix operations according to the present embodiment comprises: a plurality of operation cores, and each of the operation cores comprises: a scalar arithmetic unit array for performing the matrix operation; an X register for storing and providing a first operand of the matrix operation, a Y register for storing and providing a second operand of the matrix operation, and a result register for storing the result of the matrix operation, wherein the first operand and the second operand are loaded into the plurality of scalar arithmetic units of the array to perform the matrix operation, and the scalar arithmetic unit array performs a cyclic shift of the loaded first operand and the second operand in one direction of the array and the other direction of the array, respectively.
[0007] According to one aspect of the present embodiment, each of the scalar operators includes one or more of addition, subtraction, multiplication, and MAC (multiply and accumulate) operations, an operation unit, a register storing the first operand, and a register storing the second operand.
[0008] According to one aspect of the present embodiment, the X register stores a multiplicand matrix that is the subject of the matrix operation and outputs the multiplicand matrix row by row to the scalar operators included in one column of the scalar operator array, and the Y operand register stores a multiplier matrix that is the subject of the matrix operation and outputs the multiplier matrix column by column to the scalar operators included in one row of the scalar operator array. In this aspect, the scalar operators cyclic shift the outputted first operand along one direction of the array and load it into each scalar operator, and cyclic shift the outputted second operand along the other direction of the array and load it into each scalar operator. In this aspect, in the array, the number of cyclic shifts performed by the scalar operators included in adjacent rows differs by one, and the number of cyclic shifts performed by the scalar operators included in adjacent columns differs by one.
[0009] According to one aspect of the present embodiment, the scalar operators perform the matrix operation with the loaded first operand and the second operand, and then perform a cyclic shift in one direction of the array and the other direction of the array.
[0010] According to one aspect of the present embodiment, the X register provides the first operand to only one column of the array, and the Y register provides the second operand to only one row of the array.
[0011] According to one aspect of the present embodiment, when the plurality of scalar operators perform the matrix operation with the shifted first operand and the second operand, the operand to be operated on another of the plurality of operation cores is fetched.
[0012] According to one aspect of the present embodiment, each of the operation cores further comprises an X multiplexer (MUX) that selects one of the first operands provided by the X registers included in the plurality of operation cores and outputs it to the X register of the operation unit, and a Y multiplexer that outputs one of the second operands provided by the Y register included in the plurality of operation cores and the matrix operation results provided by the accumulation register included in the plurality of operation cores to the Y register of the operation unit.
[0013] According to one aspect of the present embodiment, the value stored in the result register is provided to one or more of the plurality of operation cores as either the first operand and the second operand.
[0014] The artificial intelligence computation accelerator of the present embodiment comprises: a tensor operator including k computation cores; a wide bandwidth memory; an internal memory unit including a cache memory for storing data to be computed in the wide bandwidth memory and an instruction cache for storing instructions for the tensor operator; and a control unit for fetching data and instructions from the internal memory and providing them to the tensor operator, wherein the accelerator communicates data with the wide bandwidth memory via a designated pseudo channel.
[0015] According to one aspect of the present embodiment, the accelerator further includes a bus structure that communicates with the broadband memory and a plurality of the accelerators.
[0016] According to one aspect of the present embodiment, the accelerator operates in a globally asynchronous local synchronous manner, and local power gating and clock gating are possible.
[0017] According to one aspect of the present embodiment, a plurality of accelerators are included in a single neural network computation unit (NPU) die, and k neural network computation units and j wide bandwidth memory packages are joined to an interposer to form a computation device. (k, j: natural numbers)
[0018] A matrix operation method performed in a scalar operator array comprising an X register storing a first operand of a matrix operation according to the present embodiment, a Y register storing a second operand, and a plurality of scalar operators, wherein the matrix operation method comprises: an operand providing step in which the X register provides the first operand to the array and the Y register provides the second operand to the array; an operation step in which the scalar operators included in the array perform the matrix operation with the provided first operand and the second operand; and a step in which each of the scalar operators included in the array cyclic shifts the provided first operand in one direction of the array and cyclic shifts the second operand in the other direction of the array.
[0019] According to one aspect of the present embodiment, the first operand is a row of a multiplicand matrix and the second operand is a column of a multiplier matrix, and the operand providing step is performed by the X register providing an element of the first operand row along a column of the scalar operator array and the Y register outputting an element of the second operand column along a row of the scalar operator array. In this aspect, the matrix operation method further includes a loading step in which the scalar operators performed after the operand providing step cyclically shift the outputted first operand along the row of the array and load it into each scalar operator, and cyclically shift the outputted second operand along the column of the array and load it into each scalar operator.
[0020] According to one aspect of the present embodiment, in the operand providing step, the X register provides the first operand to only one column of the array, and the Y register provides the second operand to only one row of the array.
[0021] According to one aspect of the present embodiment, after the matrix operation method is completed, the matrix operation result is further provided as an operand to one or more of the plurality of operation cores.
[0022] According to one aspect of the present embodiment, when the matrix operation method is performed in an arithmetic unit comprising a plurality of X registers, a plurality of Y registers, and a plurality of scalar arithmetic unit arrays, when the matrix operation method is performed in an arithmetic core of one of the scalar arithmetic unit arrays, at least one of the steps of storing a first operand in the X register of another scalar arithmetic unit array and storing a second operand in the plurality of Y registers of the other scalar arithmetic unit arrays is performed.
[0023] According to the present embodiment, an artificial neural network computer is provided that can resolve bottlenecks occurring during data movement due to limited bandwidth and routing difficulties caused by numerous wires when providing data to a computer for computation.
[0024] FIG. 1 is a schematic diagram illustrating a neural network computing device including the computing unit of the present embodiment.
[0025] FIG. 2 is a diagram illustrating an overview of the operation unit of the present embodiment.
[0026] Figure 3 is a diagram illustrating the connection relationship between a single NPU die, four HBM memories, and an external host memory.
[0027] FIG. 4 is a diagram schematically illustrating the computational cores, load section, and storage section of the present embodiment.
[0028] FIG. 5 is a flowchart schematically illustrating the operation method of the computational core of the present embodiment.
[0029] Figure 6 is a diagram illustrating the operand input process.
[0030] FIG. 7 is a diagram illustrating a scalar operator array cyclically shifting a first operand in one direction and cyclically shifting a second operand in the other direction.
[0031] FIGS. 8 to 11 illustrate steps for performing cyclic shifts and operations for each row and column in a 32×32 scalar arithmetic array.
[0032] FIG. 12 is a diagram illustrating an example of an operation core included in an operation unit.
[0033] The present embodiment is described below with reference to the attached drawings. FIG. 1 is a schematic diagram illustrating a neural network computing device (1) including a computing unit (10, see FIG. 2) of the present embodiment. Referring to FIG. 1, the neural network computing device (1) includes a neural network computing unit (NPU) die (2) and a wide bandwidth memory (3), which are connected through an interposer (4). In the computing device (1) of the illustrated embodiment, two NPU (Neural Processing Unit) dies (2) and eight wide bandwidth memories (HBM, 3) are connected through the interposer (4).
[0034] The interposer (4) enables connection between the NPU dies (2) and the wide bandwidth memory (3) and the substrate (not shown), and can improve data transmission speed through dense wiring, reduce signal loss between high-performance semiconductor chips, and enable efficient communication.
[0035] In the embodiment illustrated in FIG. 1, one NPU die (2) can communicate with four HBMs (3) with a data width of 4096-bit. The broadband memory (HBM, 3) has 16 64-bit channels, and each channel is divided into two 32-bit pseudo-channels. Thus, the broadband memory (3) can expand the total of 16 physical channels into 32 pseudo-channels. In the illustrated embodiment, one NPU die (2) includes 128 arithmetic units (10, see FIG. 2) and communicates with four broadband memories (3), so each of the 128 arithmetic units (10, see FIG. 2) can communicate with the broadband memory (2) through a dedicated pseudo-channel.
[0036] Two NPU dies (2) communicate with a wideband memory (3) within the computing unit (1) with a data width of up to 8192-bit. Additionally, NPU dies can communicate die-to-die with a data width of 1300-bit.
[0037] FIG. 2 is a diagram illustrating an overview of the arithmetic unit (10) of the present embodiment. Referring to FIG. 2, the arithmetic unit (10) of the present embodiment includes an arithmetic core (100) comprising a plurality of scalar arithmetic units; an internal memory (300) comprising a data cache (310) for storing data fetched from a wide bandwidth memory (3) and an instruction cache (320) for storing instructions, and a control unit (200) for controlling the arithmetic unit (10). The arithmetic unit (10) communicates data with the wide bandwidth memory (3) via a designated pseudo channel through a bus unit (400).
[0038] In the embodiment illustrated in FIGS. 1 and 2, 128 ANC units of 4 TFLOPS are integrated into one NPU die (2), and a performance of 512 TFLOPS can be obtained. Each die (2) further includes a wideband memory controller (30) and a physical layer (not shown). The wideband memory controller (30) controls the HBM (3) so that one NPU die (2) can simultaneously read four wideband memories (3).
[0039] The arithmetic units (10) according to the exemplary embodiment are implemented in a global asynchronous, local synchronous (GALS) manner, and the clock frequency and power of the elements included in each arithmetic unit (10) can be controlled, thereby enabling fine control of the operating frequency and power control.
[0040] FIG. 3 is a diagram illustrating the connection relationship between a single NPU die (2), four HBM memories (3), and an external host memory. Referring to FIG. 3, each arithmetic unit (10) is connected to the broadband memory (3) via a dedicated pseudo-channel through the internal memory (300). Thus, data transfer performance can be improved by minimizing bottlenecks. The broadband memory (3) can be connected to the host memory via PCIE G5 × 16 lanes.
[0041] FIG. 4 is a schematic illustration of the operation cores (100a, 100b, 100c, 100d), load unit (210), and storage unit (220) of the present embodiment. Referring to FIG. 4, an operator (10) that performs matrix operations comprises: a plurality of operation cores (100), and each of the operation cores (100a, 100b, 100c, 100d) comprises: a scalar operator array (110) that performs matrix operations; an X register (OP X0, OP X1, OP X2, OP X3) that stores and provides a first operand of the matrix operation, a Y register (OP Y0, OP Y1, OP Y2, OP Y3) that stores and provides a second operand of the matrix operation, and a result register (ACC 0, ACC 1, ACC 2, ACC 3) that stores the result of the matrix operation.
[0042] Matrix operations are performed by loading a first operand and a second operand into a plurality of scalar operators (112) included in an array (110), and the scalar operator array (110) performs cyclic shifts on the loaded first operand and second operand in one direction of the array (110) and the other direction of the array, respectively.
[0043] Each scalar operator (112) may be a fused multiplication adder (FMA) that performs multiplication on the provided operands and accumulates the multiplication results. Additionally, the scalar operator (112) may include an operation unit that performs multiplication, addition, subtraction, and MAC operations on the input operands, a row element register that stores elements of the provided first operand, and a column element register that stores elements of the second operand. The row element register and the column element register may provide operands to adjacent scalar operators (112) in the row and column directions of the array. Thus, the operands are shifted along the row and column directions of the array. Additionally, the scalar operator (112) may include an accumulation register that stores the result of performing operations on the input operands, and the accumulation register may be connected to a result register (ACC) to provide the operation result to the result register.
[0044] FIG. 5 is a flowchart illustrating the operation method of the operation core (100) of the present embodiment. Referring to FIG. 5, a matrix operation method performed in a scalar operator array (110) comprising an X register storing a first operand of a matrix operation, a Y register storing a second operand, and a plurality of scalar operators (112) comprises: an operand providing step (S100) in which the X register provides a first operand to the array and the Y register provides a second operand to the array; an operation step (S200) in which a scalar operator (112) included in the array (110) performs a matrix operation with the provided first operand and second operand; and a step (S300) in which each of the scalar operators (112) included in the array (110) cyclic shifts the provided first operand in one direction of the array and cyclic shifts the second operand in the other direction of the array. In one embodiment, the operation step (S200) and the cyclic shift step (S300) may be performed repeatedly until the matrix operation is completed.
[0045] FIGS. 6 through 9 are drawings illustrating the process of performing matrix multiplication using the following mathematical formula. In the illustrated embodiment, the multiplicand matrix A is a 32×32 matrix, and the multiplier matrix B is a 32×32 matrix. In the example illustrated below, the scalar operator array (110) includes scalar operators (112) arranged in a 32×32 array. However, the number of scalar operators is for illustrative purposes only and is not intended to limit the invention.
[0046] [Mathematical Formula 1]
[0047]
[0048] The load unit (210) reads the first operand of the matrix multiplication, the multiplicand matrix A, and the second operand, the exponent matrix B, which are fetched from the wide bandwidth memory (3) by the control unit (200) and loaded into the data cache (310), and loads them into the X0 register, which stores the first operand, and the Y0 register, which stores the second operand, respectively. As described below, while performing matrix operations on one or more of the other operation cores (100a, 100b, 100c, 100d), the load unit (210) may read one or more of the data of the multiplicand matrix A and the second operand, the exponent matrix B, stored in the wide bandwidth memory (3), store them in the data cache (310) of the internal memory (300), and load them into the X0 register and / or the Y0 register.
[0049] For matrix multiplication operations, the X0 register provides elements for each row of the first operand to any column of an array (1110) of multiple scalar operators (112). Additionally, the Y0 register provides elements for each column of the second operand to any row of multiple scalar operators (110) arranged in an array (S100).
[0050] In one embodiment, the X0 register may be connected to any column of an array of scalar operators (110) to provide a row element of the first operand. Additionally, the Y0 register may be connected to any row of an array of scalar operators (110) to provide a column element of the second operand.
[0051] In the illustrated embodiment, for ease of understanding, the present embodiment is described by an example in which the X0 register outputs a row element of the first operand to the leftmost column (110L) of the array (110) formed by scalar operators (112), and the Y0 register outputs a column element of the first operand to the topmost row (110T) of the array (110) formed by scalar operators (112). However, this is not intended to limit the invention but is for ease of understanding.
[0052] Referring further to FIG. 6. The X0 register outputs elements of the first operand row by row to scalar operators included in the leftmost column (110L) of the array. In one embodiment, a scalar operator (112) located at row 0 of the leftmost column (110L) 0,0 ) contains a 0,31 , a 0,30 , a 0,29 , ..., a 0,0 The elements of the 0th row of the first operand may be provided in the order of, and the scalar operator (112) to which the elements are provided 0,0 ) can provide the provided elements by shifting them to adjacent scalar operators along the row direction of the array. Therefore, the first provided a 0,31 is continuously shifted and stored in the register of the scalar operator located in the far right column of the array (110). Through this process, in the registers of the scalar operators of the uppermost row (110T) of the array (110), a from left to right 0,0, a 0,1 , a 0,2 , ..., a 0,31 The row element data is stored.
[0053] Likewise, the scalar operator located in row 2 (112 2,0 In ), from left to right a 2,31 , a 2,30 , a 2,29 , ..., a 2,0The elements of the second row of the first operand can be provided in the order of . The scalar operators to which the elements are provided are provided by shifting the provided elements to adjacent scalar operators along the rows of the array. Thus, the last provided a 2,0 The row element register is stored in the register of the scalar operator located in the leftmost column of the array (110).
[0054] The Y0 register outputs elements by second operand column to scalar operators included in the top row (110T) of the array. In one embodiment, a scalar operator (112) located in column 0 0,0 ) contains b 31,0 , b 30,0 , b 29,0 , ..., b 0,0 The elements of the 2nd operand 0 column can be provided in the order. A scalar operator (112) to which the elements are provided 0,0 ) can provide the provided elements by shifting them to adjacent scalar operators along the array columns. Therefore, the first provided b 31,0 It is continuously shifted and stored in the registers of the scalar operators located in the bottom column of the array (110). Through this process, the registers of the scalar operators in the leftmost column (110L) of the array (110) contain b from bottom to top in the diagram. 31,0, b 30,0 , b 29,0 , ..., b 0,0 , column element data is stored.
[0055] Likewise, 112, a scalar operator located in column 3 0,3 In b 31,3 , b 30,3 , b 28,3 , ..., b 0,3 The elements of the third column of the second operand can be provided in the order of. The scalar operators to which the elements are provided shift the provided elements to adjacent scalar operators along the columns of the array. Thus, b provided immediately before the last1,3 is a scalar operator located immediately below the top column (112 1,3 It is stored in the column element register of ). In this way, by shifting and inputting the data to be operated into the scalar operator included in the array, the difficulty of wiring can be reduced, which is an advantage.
[0056] As described above, column element registers included in the scalar operator can be connected so that they can be shifted relative to each other within the same column, and the first and last registers of a column can be connected to perform a cyclic shift. Additionally, row element registers included in the scalar operator can be connected so that they can be shifted relative to each other within the same row. The start and end registers of a row can be connected to perform a cyclic shift.
[0057] As illustrated in FIG. 7, the scalar operator array (110) cyclically shifts the first operand provided by the X register in one direction and cyclically shifts the second operand provided by the Y register in the other direction. In one embodiment, the scalar operators located in the nth row of the array (110) each cyclically shift the provided first operand n times in one direction and output it (n: a positive integer including 0). Accordingly, the row element registers of the scalar operators located in the nth row load the respective values provided with the first operand cyclically shifted n times in one direction.
[0058] For example, scalar operators (112) located at row 0 of the array (110). 0,0 , 112 0,1 ,...,112 0,31 ) cyclic shifts the row elements of the first operand zero times in one direction. Therefore, scalar operators located at row 0 do not perform cyclic shifts.
[0059] As another example, scalar operators (112) located in the 3rd row of the array (110). 3,0 , 112 3,1 ,...,112 3,31 ) cyclically shifts the first operand three times in one direction. Thus, the scalar operator (112) located at the 3rd row of the array (110) 3,0 In ), cyclic shifts are performed 3 times. Therefore, the scalar operator (112 3,0 ) contains a scalar operator (112 3,3 The first operand row element value stored in ) is cyclic shifted three times and provided, then loaded into the row element register. Scalar operator (112 3,0 The first operand row element value stored in ) is cyclic shifted 3 times and scalar operator (112 3,29 It is provided as ) and loaded into the row element register.
[0060] In one embodiment, scalar operators located in column k of the array (110) cyclically shift the second operand k times in the other direction (k: a positive integer including 0). In one example, scalar operators located in column 0 of the array (110) cyclically shift the second operand 0 times in the other direction. Thus, scalar operators located in column 0 of the array (110) do not cyclically shift the second operand.
[0061] As another example, scalar operators (112) located in the 2nd column of the array (110) 0,2 , 112 1,2 ,...,112 31,2 ) cyclically shifts the second operand twice in the other direction. Therefore, the scalar operator (112 0,2 ) contains a scalar operator (112 2,2 The second operand column element value stored in ) is provided after being cyclically shifted twice, and the scalar operator (112 2,2 It is loaded into the column element register of ). Likewise, the scalar operator (112 31,2) contains a scalar operator (112 1,2 The second operand row element value stored in ) is provided by cyclic shifting twice and loaded into the column element register.
[0062] In a 32×32 scalar arithmetic array, the state in which cyclic shifts are completed for each row and column, and the row element values and column element values are loaded into the row element register and column element register of each scalar arithmetic, is as exemplified in FIG. 8. In the scalar arithmetic exemplified in FIG. 8, the upper left represents the row element of the first operand stored in the row element register, the upper right represents the column element of the second operand stored in the column element register, and the bottom represents the result value accumulated in the accumulation register. FIG. 8 illustrates a state in which values are loaded into the row element register and column element register of each scalar arithmetic and multiplication is not performed.
[0063] Each scalar operator forms a partial product by multiplying the row element value and the column element value loaded into the row element register and the column element register, and accumulates and stores the partial products in the value stored in each accumulator register (S200). FIG. 9 is a diagram illustrating the state in which the result of multiplication is performed with the element value and the column element value loaded and stored in the accumulator register of each scalar operator. Referring to FIG. 9, the accumulator register included in each scalar operator multiplies the data stored in the row element register and the column element register exemplified in FIG. 8 and stores the result in the accumulator register.
[0064] When the multiplication and accumulation operations of the row element and column element values are completed in each of the scalar operators, the value stored in the row element register is cyclically shifted in one direction, and the value stored in the other element register is cyclically shifted in the other direction (S300). FIG. 10 illustrates the state in which a cyclic shift is performed after the multiplication operation exemplified by FIG. 9 is completed. As exemplified, each scalar operator shifts the value stored in the row element register in one direction and shifts the value stored in the column element register in the other direction.
[0065] FIG. 11 is a diagram illustrating the state in which the results of a multiplication operation are accumulated after a cyclic shift. Referring to FIG. 11, when the cyclic shift exemplified by FIG. 10 is completed, each operation unit included in the scalar operators forms a partial product by multiplying the row element value and the column element value that have been cyclically shifted and loaded into the row element register and the column element register, and stores the result by accumulating it in the partial product already stored in each accumulation register. As the cyclic shift and accumulation processes are repeated in this manner, 32×32 matrix operations can be performed. In one embodiment, the 32×32 matrix multiplication can obtain the operation result through 31 cyclic shifts, 32 multiplications, and 31 partial product accumulations. The accumulation register included in each scalar operator provides the operation result to the result register (ACC), and the storage unit (220) can store the result stored in the result register in the data cache within the internal memory (300).
[0066] In the above-described embodiment, the direction in which the first operand register (X0) and the second operand register (Y0) provide the first operand and the second operand to input to their respective scalar arithmetic units, and the direction in which the provided operands are cyclically shifted, are merely examples, and a person skilled in the art can easily modify and implement them from the description of the above-described embodiment.
[0067] The result of the matrix multiplication operation of operands A and B is provided to the accumulation register ACC included in the operation core, and the storage unit (220) accesses the accumulation register ACC to obtain the operation result and can write it to the data cache (310, see FIG. 2).
[0068] As described above, after each scalar operator (112) completes the operation, the provided first operand and second operand are shifted in the row and column directions of the array, respectively, and output. Since the first operand and second operand are provided to adjacent scalar operators (112) in the row or column direction, there is no need to form a line connecting the load section or the register providing the operands and all scalar operators included in the array, thereby resolving routing difficulties and providing advantages in terms of area.
[0069] Parameters used in modern large language models have a size of several terabytes. Connecting wiring to input several terabytes of operands to each scalar operator is a challenging task. However, according to the present embodiment, the difficulty of wiring can be reduced by providing operands to only one row and one column of the scalar operator array.
[0070] Referring again to FIG. 4, in one embodiment, when one operation core (100a) included in the operation unit (10) performs matrix operations, the X0 register and the element values of the operand are read from the X0 register and the Y0 register, which stores the first operand included in the operation core (100a) and the second operand, and the scalar operation unit performs operations by sequentially shifting.
[0071] In this process, the load unit (210) can read operands used for calculation from the data cache (310) for subsequent calculations to be performed by the calculation core (100b), and store them in the operand registers X1 and / or Y1 included in the calculation core (100b). By operating in a pipeline manner in this way, the time required to read operands used for calculation and write them to the registers can be reduced, thereby improving calculation efficiency. This can improve calculation efficiency even when the width of data that the load unit (210) can read at once is limited, but the number of bits of the operand required for calculation is large.
[0072] FIG. 9 is a diagram illustrating an embodiment of an operation core (100) included in an operation unit (10). Referring to FIG. 9, the operation cores (100a, 100b, 100c, 100d) of the present embodiment may further include an X multiplexer (MUX) that selects one of the first operands provided by the first operand registers (X0, X1, X2, X3) included in each of the plurality of operation cores and outputs it to the scalar operation units.
[0073] Additionally, the operation cores (100a, 100b, 100c, 100d) may further include a Y multiplexer that outputs either of the second operands output by the second operand registers (Y0, Y1, Y2, Y3) and the matrix operation results output by the accumulation registers (ACC 0, ACC 1, ACC 2, ACC 3) included in the plurality of operation cores to an array of scalar operators.
[0074] From this, the operation core (100a) can perform matrix operations using the first operand register X1, X2, or X3 that is not included in the operation core (100a). Furthermore, an advantage is provided that the operation core (100a) can immediately perform matrix operations C×(A×B) after performing matrix operations A×B.
[0075] That is, the result of the matrix operation A×B performed by the operation core (100a) is stored in register ACC 0, and the storage unit (210) stores it in the data cache (310), and then the load unit (210) reads the value stored in the data cache (310) again, but the advantage is provided that the matrix operation C×(A×B) can be performed immediately and quickly by the Y register.
[0076]
[0077] To aid in understanding the present invention, the embodiments illustrated in the drawings have been described with reference to the examples shown; however, these are merely illustrative examples for implementation, and those skilled in the art will understand that various modifications and equivalent alternative embodiments are possible therefrom. Accordingly, the true technical scope of protection of the present invention should be determined by the appended claims.
Claims
1. An operator that performs matrix operations, wherein the operator: It includes a plurality of computational cores, and each of the computational cores is: An array of scalar operators that perform the above matrix operations; An X register that stores and provides a first operand of the above matrix operation, and a Y register that stores and provides a second operand of the above matrix operation, and It includes a result register that stores the result of a matrix operation, The first operand and the second operand are loaded into a plurality of scalar operators of the array to perform the matrix operation, and The above scalar operator array is an operator that performs a cyclic shift of the loaded first operand and the second operand in one direction of the array and the other direction of the array, respectively.
2. In Paragraph 1, Each of the above scalar operators is, One or more of addition, subtraction, multiplication, and MAC (multiply and accumulate) operations are performed by an operation unit and An arithmetic unit comprising a register storing the first operand and a register storing the second operand.
3. In Paragraph 1, The above X register is, The multiplicand matrix, which is the subject of the above matrix operation, is stored, and the multiplicand matrix is output row by row to the scalar operators included in one column of the above scalar operator array, and The above Y operand register is, An operator that stores a multiplier matrix, which is the target of the matrix operation, and outputs the columns of the multiplier matrix to the scalar operators included in any row of the scalar operator array.
4. In Paragraph 3, The above scalar operators The output first operand is cyclically shifted along one direction of the array and loaded into each scalar operator, and An operator that cyclically shifts the output second operand along the other direction of the array and loads it into each scalar operator.
5. In Paragraph 4, In the above array, The number of cyclic shifts performed by the above scalar operators included in adjacent rows differs by one, and An operator in which the number of cyclic shifts performed by the scalar operators included in adjacent columns differs by one.
6. In Paragraph 1, The above scalar operators, An operator that performs the matrix operation with the loaded first operand and the second operand, and then performs a cyclic shift in one direction of the array and the other direction of the array.
7. In Paragraph 1, The above X register provides the first operand to only one column of the array, and The above Y register is an arithmetic unit that provides the second operand to only one row of the array.
8. In Paragraph 1, The above-mentioned arithmetic unit is, When the above plurality of scalar operators perform the matrix operation with the shifted first operand and the second operand, An arithmetic unit in which an operand to be computed on another arithmetic core among the plurality of arithmetic cores is fetched.
9. In Paragraph 1, Each of the above computational cores is, An X multiplexer (MUX) that selects one of the first operands provided by the X registers included in the plurality of arithmetic cores and outputs it to the arithmetic unit X register, and An arithmetic unit further comprising a Y multiplexer that outputs to the arithmetic unit Y register any one of the second operands provided by the Y register included in the plurality of arithmetic cores and the matrix operation results provided by the accumulation register included in the plurality of arithmetic cores.
10. In Paragraph 1, The value stored in the above result register is, An arithmetic unit provided with one or more of the above plurality of arithmetic cores as either the first operand or the second operand.
11. As an artificial intelligence computation accelerator, the said accelerator is: A tensor operator comprising k computation cores; Wide bandwidth memory; An internal memory unit comprising a cache memory for storing data to be computed in the above-mentioned wide bandwidth memory and an instruction cache for storing instructions for the tensor operator; It includes a control unit that fetches data and commands from the internal memory and provides them to the tensor operator, The above accelerator is an accelerator that communicates data with the above-described wide bandwidth memory via a designated pseudo channel.
12. In Paragraph 11, The above accelerator An accelerator further comprising the above-mentioned broadband memory and a bus structure communicating with a plurality of the above-mentioned accelerators.
13. In Paragraph 11, The above accelerator is, It operates globally asynchronously and locally synchronously, An accelerator capable of local power gating and clock gating.
14. In Paragraph 11, The above accelerator Multiple units are included on a single neural network computation unit (NPU) die, and k of the above neural network computation units and j of the above-mentioned wide bandwidth memory packages An accelerator joined to an interposer to form a computing device. (k, j: natural numbers) 15. A matrix operation method performed on a scalar operator array comprising an X register storing a first operand of a matrix operation, a Y register storing a second operand, and a plurality of scalar operators, wherein the matrix operation method comprises: An operand providing step in which the X register provides the first operand to the array and the Y register provides the second operand to the array, and An operation step in which the scalar operator included in the array performs the matrix operation with the provided first operand and the second operand, and A matrix operation method comprising the step in which each of the scalar operators included in the array cyclic shifts a provided first operand in one direction of the array and cyclic shifts a second operand in the other direction of the array.
16. In Paragraph 15, The first operand is a row of the multiplicand matrix, and the second operand is a column of the multiplier matrix, and The above operand provision step is, The above X register provides elements of the first operand row along any one column of the above scalar arithmetic unit array, and A matrix operation method in which the above Y register outputs an element of the second operand column along one row of the above scalar operator array.
17. In Paragraph 16, The above matrix operation method is, Performed after the above operand provision step A matrix operation method further comprising a loading step in which the output first operand is cyclically shifted along the row of the array and loaded into each scalar operator, and the output second operand is cyclically shifted along the column of the array and loaded into each scalar operator.
18. In Paragraph 15, In the above operand provision step, A matrix operation method in which the X register provides the first operand to only one column of the array, and the Y register provides the second operand to only one row of the array.
19. In Paragraph 15, After the above matrix operation method is completed, A matrix operation method comprising the step of providing the result of the matrix operation as an operand to one or more of a plurality of operation cores.
20. In Paragraph 15, The above matrix operation method is, When performed on an arithmetic unit comprising multiple X registers, multiple Y registers, and multiple scalar arithmetic unit arrays, When performing the matrix operation method on the operation core of any one of the above-mentioned scalar operation arrays, A matrix operation method in which at least one of the steps of storing a first operand in the X register of another scalar operator array and storing a second operand in a plurality of Y registers of another scalar operator array is performed.