Data operation apparatus and method, processing core and electronic device
By dividing the computing unit array into multiple sub-computing arrays and combining them flexibly, the problem of insufficient flexibility and computing power of existing chips when processing algorithms and large amounts of data in different fields is solved, thereby improving data processing efficiency and energy efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- STREAM COMPUTING INC
- Filing Date
- 2020-12-31
- Publication Date
- 2026-06-26
AI Technical Summary
Existing data processing chips lack flexibility and computing power when handling algorithms and large amounts of data in different fields, resulting in the inability to fully utilize computing unit arrays and increasing computing time and power consumption.
By dividing the computing unit array into multiple sub-computing arrays and flexibly combining them according to the grouping parameters in the instructions, the computing power of the data processing device can be improved by utilizing the computing units.
It enables flexible combination of computing units, improves the computing power of data processing devices, and reduces computing time and power consumption.
Smart Images

Figure CN114691087B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of nuclear processing technology, and in particular to a data processing device and method, a processing core, and an electronic device. Background Technology
[0002] With the development of science and technology, human society is rapidly entering the intelligent era. A key characteristic of the intelligent era is that people are acquiring more and more types of data, the volume of data is increasing, and the demand for faster data processing is also rising.
[0003] Chips are the cornerstone of data processing, fundamentally determining our ability to process data. From an application perspective, chips mainly follow two paths: one is the general-purpose chip path, such as CPUs, which offer great flexibility but have relatively low effective computing power when processing algorithms in specific domains; the other is the dedicated chip path, such as TPUs, which can achieve high effective computing power in certain specific domains, but their processing capabilities are poor or even incapable of handling more general and flexible domains.
[0004] Because the data in the intelligent era is diverse and massive in quantity, chips are required to be highly flexible, capable of handling algorithms from different fields and constantly evolving, and also possess strong processing power, capable of rapidly processing extremely large and rapidly increasing amounts of data. Summary of the Invention
[0005] (I) Purpose of the Invention
[0006] The purpose of this invention is to provide a data processing device and method, a processing core and an electronic device. The data processing device divides the computing unit array into multiple sub-computing arrays according to the grouping parameters in the instructions, which can realize the flexible combination of computing units in the computing unit array and effectively utilize the computing units to improve the computing power of the data processing device.
[0007] (II) Technical Solution
[0008] To address the aforementioned problems, a first aspect of the present invention provides a data processing apparatus, comprising: a data reading module for receiving instructions and reading a first matrix and a second matrix based on the instructions, the instructions including grouping parameters of a computing unit array, the grouping parameters being parameters for dividing the computing unit array into sub-computing arrays, the grouping parameters being related to rows of the first matrix or columns of the second matrix; the sub-computing arrays reading data from the first matrix and the second matrix and performing operations on the first matrix and the second matrix.
[0009] The data processing device provided in the above embodiments of the present invention can divide the computing unit array into multiple sub-computing arrays according to the grouping parameters in the instructions, which can realize the flexible combination of computing units in the computing unit array and effectively utilize computing units to improve the computing power of the data processing device.
[0010] Optionally, the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix, wherein the number of rows of the sub-computing arrays is the same as the number of rows of the first matrix; or the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix, wherein the number of columns of the sub-computing arrays is the same as the number of columns of the second matrix.
[0011] Optionally, the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix; the sub-computing array reads the first matrix column by column, so that each column computing unit of the sub-computing array reads a corresponding column of data from the first matrix; the sub-computing array divides the second matrix into multiple second sub-matrices based on the column dimension of the sub-computing array, and the sub-computing array reads the corresponding second sub-matrices row by row, so that each row computing unit of the sub-computing array reads a corresponding row of data from the second sub-matrices.
[0012] Optionally, the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix; the sub-computing array divides the first matrix into multiple first sub-matrices in units of the row dimension of the sub-computing array, and the sub-computing array reads the corresponding first sub-matrices column by column, so that each column computing unit of the sub-computing array reads a column of data of the corresponding first sub-matrices; the sub-computing array reads the second matrix row by row, so that each row computing unit of the sub-computing array reads a row of data of the second matrix.
[0013] Optionally, each computing unit in the sub-computation array is used to accumulate the results of each operation to obtain an output matrix.
[0014] Optionally, the data reading module includes multiple storage areas; when the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix, the data reading module is used to store the elements of the first matrix into one storage area and group the elements of the second matrix into multiple storage areas based on the grouping parameter; or, when the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix, the data reading module is used to store the elements of the second matrix into one storage area and group the elements of the first matrix into multiple storage areas based on the grouping parameter.
[0015] Optionally, the data reading module is further configured to, based on the grouping parameters, turn on the corresponding switch from the storage area to the sub-computing unit array, so that each sub-computing unit array can extract the elements of the group corresponding to its current operation.
[0016] Optionally, the data reading module includes: a first data reading module, comprising: a first control unit, configured to receive instructions, extract the first matrix according to the instructions, and generate a first control signal based on the instructions; a first switch array, configured to, based on the first control signal, connect the switches of the storage area corresponding to the first matrix to the sub-computing unit array, so that each sub-computing unit array can extract the elements of its current group corresponding to its current operation; and a second data reading module, comprising: a second control unit, configured to receive instructions, extract the second matrix according to the instructions, and generate a second control signal based on the instructions; and a second switch array, configured to, based on the second control signal, connect the switches of the storage area corresponding to the second matrix to the sub-computing unit array, so that each sub-computing unit array can extract the elements of its current group corresponding to its current operation.
[0017] Optionally, the instructions further include: the storage starting address of the first matrix and the storage starting address of the second matrix; the first control unit includes: a first storage unit; a first address generation unit, which generates a data retrieval address of the first matrix based on the starting address of the first matrix, retrieves the first matrix based on the data retrieval address of the first matrix, and stores the first matrix in the first storage unit according to the grouping parameters; the second control unit includes: a second storage unit; a second address generation unit, which generates a data retrieval address of the second matrix based on the starting address of the second matrix, retrieves the second matrix based on the data retrieval address of the second matrix, and stores the second matrix in the second storage unit according to the grouping parameters.
[0018] According to a second aspect of the invention, a processing core is provided, comprising one or more data processing devices as described in the first aspect.
[0019] According to a third aspect of the present invention, an electronic device is provided, comprising the processing core of the second aspect.
[0020] According to a fourth aspect of the present invention, a chip is provided, comprising one or more processing cores provided in the third aspect.
[0021] According to a fifth aspect of the invention, a cardboard is provided, comprising one or more chips provided in the fourth aspect.
[0022] According to a sixth aspect of the present invention, an electronic device is provided, comprising one or more chips provided in the fifth aspect.
[0023] According to a seventh aspect of the present invention, a data processing method is provided, comprising: receiving an instruction; reading a first matrix and a second matrix based on the instruction, the instruction including grouping parameters of a computing unit array, the grouping parameters being parameters for dividing the computing unit array into sub-computing arrays, the grouping parameters being related to rows of the first matrix or columns of the second matrix; the sub-computing arrays reading data from the first matrix and the second matrix and performing operations on the first matrix and the second matrix.
[0024] According to an eighth aspect of the present invention, a computer storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, it implements the data processing method of the sixth aspect.
[0025] According to a ninth aspect of the present invention, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the data processing method of the sixth aspect.
[0026] According to a tenth aspect of the present invention, a computer program product is provided, comprising computer instructions, wherein when the computer instructions are executed by a computing device, the computing device can execute the data processing method of the sixth aspect.
[0027] (III) Beneficial Effects
[0028] The above-described technical solution of the present invention has the following beneficial technical effects:
[0029] The data processing device provided in this embodiment of the invention can divide the computing unit array into multiple sub-computing arrays according to the grouping parameters in the instruction, which can realize the flexible combination of computing units in the computing unit array and effectively utilize computing units to improve the computing power of the data processing device. Attached Figure Description
[0030] Figure 1(a) is a schematic diagram of a matrix operation;
[0031] Figure 1(b) is a schematic diagram of the structure of a data processing device;
[0032] Figure 2(a) is a schematic diagram of the data processing device shown in Figure 1(b) performing matrix operations;
[0033] Figure 2(b) is a schematic diagram of the first step of matrix operation performed by the data processing device shown in Figure 1(b);
[0034] Figure 2(c) is a schematic diagram of the second step of matrix operation performed by the data processing device shown in Figure 1(b);
[0035] Figure 3 This is a schematic diagram of the data processing device structure provided in an embodiment of the present invention;
[0036] Figure 4(a) is a schematic diagram of the structure of the first data reading module in the data processing device provided in the embodiment of the present invention;
[0037] Figure 4(b) is a schematic diagram of the structure of the second data reading module in the data processing device provided in the embodiment of the present invention;
[0038] Figure 5(a) is a schematic diagram of matrix operations provided by an embodiment of the present invention;
[0039] Figure 5(b) is a schematic diagram of the data processing device provided in the embodiment of the present invention performing matrix operations;
[0040] Figure 5(c) is a schematic diagram of the data processing device provided in the embodiment of the present invention performing matrix operations;
[0041] Figure 5(d) is a schematic diagram of the first step of matrix operation performed by the data processing device provided in the embodiment of the present invention;
[0042] Figure 5(e) is a schematic diagram of the second step of matrix operation performed by the data processing device provided in the embodiment of the present invention;
[0043] Figure 5(f) is a schematic diagram of the third step of matrix operation performed by the data processing device provided in the embodiment of the present invention;
[0044] Figure 5(g) is a schematic diagram of the fourth step of matrix operation performed by the data processing device provided in the embodiment of the present invention;
[0045] Figure 6 This is a flowchart of the data processing method provided in the embodiments of the present invention. Detailed Implementation
[0046] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to specific embodiments and the accompanying drawings. It should be understood that these descriptions are merely exemplary and not intended to limit the scope of the invention. Furthermore, descriptions of well-known structures and techniques are omitted in the following description to avoid unnecessarily obscuring the concept of the invention.
[0047] Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
[0048] In the description of this invention, it should be noted that the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.
[0049] Furthermore, the technical features involved in the different embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.
[0050] In neural network operations, matrix operations (including convolution operations, since convolution operations can be converted into matrix operations) account for the vast majority of the total computation. To improve throughput, reduce latency, and enhance the effective computing power of chips in neural network tasks, the key lies in improving the speed of matrix operations.
[0051] To improve the speed of matrix operations, arrays of computing units are generally used to perform matrix operations, thereby achieving high data reuse and improving computational efficiency.
[0052] Figure 1(a) is a schematic diagram of a matrix operation.
[0053] As shown in Figure 1(a), the first matrix M1 is an M-row, K-column matrix, and the second matrix M2 is a K-row, N-column matrix. Multiplying M1 and M2 will output an M-row, N-column output matrix M.
[0054] In the output matrix, the value of Cin, the element in the i-th row and n-th column of M, is the sum of the products of the corresponding elements in the i-th row of M1 and the n-th column of M2.
[0055] Figure 1(b) is a schematic diagram of a data processing device.
[0056] As shown in Figure 1(b), the device includes an M-row, N-column computing unit array PU, which comprises PU... 1,1 ~PU M,N .
[0057] M1, M2, and M are the register data buffers for the two input matrices and the output matrix, respectively. This array of computational units allows for full utilization of the data. For example, an element in M1 can be reused by N computational units in the same row, while an element in M2 can be reused by M computational units in the same column. That is, each computational unit can perform calculations on one column of elements in M1 and one row of elements in M2 at a time. For example, the computational unit in the first row and first column performs multiplication and summation on the corresponding elements of the first row of M1 and the first column of M2.
[0058] Figure 2(a) is a schematic diagram of the data processing device shown in Figure 1(b) performing matrix operations.
[0059] In the example shown in Figure 2(a), M1 is a 2x4 matrix, M2 is a 4x8 matrix, and the computational cell array is a 4x4 array. Multiplying M1 and M2 will yield a 2x8 output matrix M.
[0060] Figure 2(b) is a schematic diagram of the first step of matrix operation performed by the data processing device shown in Figure 1(b).
[0061] As shown in Figure 2(b), since there are only 4 computing units in a row in the computing unit array, and M2 is 4x8 with 8 columns, the entire computing process can only be completed in two steps.
[0062] In the first step, the first four columns of both M1 and M2 are calculated to obtain the first half of the output matrix M (the first four columns of two rows). Since the computational unit array only has four columns, the computational units in the last two rows are not executed during the first step. For example, computational unit PU... 0,0 Calculate the sum of the products of the four elements in row 0 of M1 and the four elements in column 0 of M2, multiplying them one-to-one, to obtain the data in row 0 and column 0 of the output matrix M; Calculation unit PU 01 Calculate the sum of the products of the four elements in the 0th row of M1 and the four elements in the 1st column of M2, and obtain the data in the 0th row and 1st column of the output matrix M.
[0063] Figure 2(c) is a schematic diagram of the second step of matrix operation performed by the data processing device shown in Figure 1(b).
[0064] In the second step, the last four columns of M1 and M2 are calculated to obtain the second half of the output matrix M (the data in the first four columns of the two rows). In this step, the calculation is still performed on the first two rows of the calculation unit array, and not on the last two rows. For example, calculation unit PU 0,0 Calculate the sum of the products of the four elements in row 0 of M1 and the four elements in column 4 of M2, multiplying them one-to-one, to obtain the data in row 0 and column 4 of the output matrix M; Calculation unit PU 01 Calculate the sum of the products of the four elements in row 0 of M1 and the four elements in column 5 of M2, and obtain the data in row 0 and column 5 of the output matrix M.
[0065] The aforementioned data processing device has the following drawbacks:
[0066] (1) Once the circuit of the above data processing device is designed, the size of the computing unit array is determined. Therefore, for matrix operations of certain sizes, such as the number of rows of the first matrix being twice or more than twice the number of rows of the computing unit array, or the number of columns of the second matrix being twice or more than twice the number of columns of the computing unit array, the effective computing power of the computing unit array cannot be fully utilized, increasing the time spent on matrix calculation.
[0067] (2) Some of the data needs to be retrieved multiple times, which increases power consumption.
[0068] Figure 3This is a schematic diagram of the data processing device provided in an embodiment of the present invention.
[0069] like Figure 3 As shown, the data processing device EU includes: a data reading module, used to receive instructions and read a first matrix and a second matrix based on the instructions, the instructions including grouping parameters of the computing unit array, the grouping parameters being parameters used to divide the computing unit array PUA into sub-computing arrays, the grouping parameters being related to the rows of the first matrix or the columns of the second matrix;
[0070] The sub-computing array reads the data from the first matrix and the second matrix and performs operations on the first matrix and the second matrix.
[0071] The data processing device provided in this embodiment of the invention can divide the computing unit array into multiple sub-computing arrays according to the grouping parameters in the instruction, which can realize the flexible combination of computing units in the computing unit array and effectively utilize computing units to improve the computing power of the data processing device.
[0072] In some embodiments, the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays by referring to the rows of the first matrix or the columns of the second matrix.
[0073] In some embodiments, the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix, wherein the number of rows in the sub-computing arrays is the same as the number of rows in the first matrix. For example, assuming the computing unit array is divided according to the number of rows in the first matrix, the information covered by the grouping parameter includes dividing the computing unit array according to the rows of the first matrix, and dividing it into N groups. The specific number of groups can be the quotient of the number of rows in the computing unit array and the number of rows in the first matrix, and the quotient is a positive integer.
[0074] In some embodiments, the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix, wherein the number of columns in the sub-computing arrays is the same as the number of columns in the second matrix. For example, assuming the computing unit array is divided according to the number of columns in the second matrix, the grouping parameter covers the information that the computing unit array is divided into M groups according to the number of columns in the second matrix. The specific number of groups can be the quotient of the number of columns in the computing unit array and the number of columns in the second matrix, and the quotient is a positive integer.
[0075] In some embodiments, the computing unit array is divided into multiple sub-computing arrays according to the rows of the first matrix. The "first matrix" is then called the reference matrix, the "rows" are called the reference "dimensions," and the second matrix is called the unreferenced matrix, with each "column" called an unreferenced "dimension." Conversely, if the computing unit array is divided into multiple sub-computing arrays according to the columns of the second matrix, the "second matrix" is also called the reference matrix, the "columns" are called the reference "dimensions," and the first matrix is called the unreferenced matrix, with each "row" called an unreferenced "dimension."
[0076] In this embodiment, the sub-computing array reads the data of the unreferenced matrix sequentially according to the row dimension or column dimension of the referenced matrix in the row of the first matrix or the column of the second matrix; wherein, the sub-computing array divides the unreferenced matrix into multiple sub-matrices with the unreferenced dimension of the sub-computing array as the unit, and the computing unit array reads the elements of the corresponding unreferenced matrix sequentially according to the referenced dimension.
[0077] Specifically, the grouping parameter is the parameter for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix; the sub-computing array reads the data of the first matrix column by column, so that each column computing unit of the sub-computing array reads a corresponding column of data of the first matrix; the sub-computing array divides the second matrix into multiple second sub-matrices in units of the column dimension of the sub-computing array, and the sub-computing array reads the corresponding second sub-matrices row by row, so that each row computing unit of the sub-computing array reads a corresponding row of data of the second sub-matrices.
[0078] It is understood that in this embodiment, the number of rows in each sub-computation array is the same as the number of rows in the first matrix, and the number of columns is the same as the number of columns in the undivided computation unit array.
[0079] In this embodiment, when performing the multiplication operation between M1 and M2, each sub-computing array reads elements of M1 column by column, ensuring that in one calculation, each column of the sub-computing unit array reads elements corresponding to a column of M1. That is, each row of the sub-computing unit array reads elements from the corresponding row of M1. For example, if the first column of M1 is being read, then each column of the sub-computing array reads the first column of M1, ensuring that the first row of each sub-computing array reads the first row of elements in that column of M1, and the last row of each sub-computing array reads the last row of elements in that column of M1.
[0080] Furthermore, the second matrix is divided into multiple second sub-matrices based on the column dimension of the sub-computation array. Each sub-computation array reads an element from one of the corresponding second sub-matrices in M2 row by row, ensuring that each row of the sub-computation array reads the element of that row of the corresponding second sub-matrice. That is, each column of the first sub-computation array reads the column element corresponding to that row of M2, and the number of columns of that row of M2 read by the first sub-computation array each time is the same as the number of columns of its own computation unit. The other sub-computation arrays in the sub-computation array read the second sub-matrices sequentially according to the order of the second sub-matrices and in the same reading method as the first sub-computation array.
[0081] For example, the computing unit array is divided into two sub-computing arrays according to the number of rows of M1. Each sub-computing array has the same number of columns as the computing unit array, which is 5 columns. M2 has 10 columns. Then M2 is divided into a first sub-matrix and a second sub-matrix. The first sub-computing array extracts 5 columns of data from the first sub-matrix row by row each time, and the second sub-computing array extracts 5 columns of data from the second sub-matrix row by row each time.
[0082] In some embodiments, the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix;
[0083] The sub-computing array divides the first matrix into multiple first sub-matrices using the row dimension of the sub-computing array as the unit. The sub-computing array reads the corresponding first sub-matrices column by column, so that each column computing unit of the sub-computing array reads one column of data from the corresponding first sub-matrices. The second matrix is read row by row, so that each row computing unit of the sub-computing array reads one row of data from the second matrix.
[0084] In some embodiments, the operations performed by M1 and M2 are multiplication operations, and each computational unit in the sub-computation array is used to accumulate the results of each operation to obtain an output matrix.
[0085] In some embodiments, the data reading module includes multiple storage areas; when the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix, the data reading module is used to store the elements of the first matrix into one storage area and group the elements of the second matrix into multiple storage areas based on the grouping parameter; or, when the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix, the data reading module is used to store the elements of the second matrix into one storage area and group the elements of the first matrix into multiple storage areas based on the grouping parameter.
[0086] In some optional embodiments, when the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix, the data reading module divides the second matrix into multiple second sub-matrices according to the number of columns of the computing unit array, and stores the multiple second sub-matrices into different storage areas respectively. Optionally, the multiple second sub-matrices can be stored in multiple consecutive storage areas according to the column order of the second sub-matrices. When the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix, the data reading module divides the first matrix into multiple first sub-matrices according to the number of rows of the computing unit array, and stores the multiple first sub-matrices into different storage areas respectively. Optionally, the multiple first sub-matrices can be stored in multiple consecutive storage areas according to the row order of the first sub-matrices.
[0087] In some embodiments, the data reading module is further configured to, based on the grouping parameters, turn on the corresponding switch from the storage area to the sub-computing unit array, so that each sub-computing unit array can extract the elements of the array corresponding to its current operation.
[0088] In some embodiments, the data reading module includes a first data reading module LD_M1 and a second data reading module LD_M2. The first data reading module is configured to read data from the external storage module M1 according to a received instruction and store the data of M1. The second data reading module is configured to read data from the external storage module M2 according to a received instruction and store the data of M1.
[0089] The following example illustrates the operation of the data processing device by dividing the computing unit array into multiple sub-computing arrays according to the number of rows M1:
[0090] First, the instruction decoding unit ID outside the EU receives instruction I, decodes it, and sends the decoded instructions to the LD_M1, LD_M2, and PUA modules within the EU. Specifically, the decoded instruction I includes control signals and parameters. More specifically, the EU sends the control signals to the PUA and the parameters to LD_M1 and LD_M2 respectively. The parameters include the starting address of M1's memory, the size of the memory area occupied by M1, the starting address of M2's memory, the size of the memory area occupied by M2, and the PUA's grouping parameters, etc.
[0091] Then, LD_M1 can retrieve the elements of matrix M1 from the storage module and store them in the first storage module DB based on parameters such as the starting address of matrix M1 and the size of the storage area occupied by matrix M1. The first storage module DB consists of multiple storage areas, the number of which is a positive integer X, and the number of rows of the computing unit array is a positive integer M, where M ≥ X ≥ 1. When X = 1, the computing unit array cannot be divided by columns; when X = M, the entire computing unit array can be combined in various ways, with the most extreme case being that the entire array of computing units is divided into M groups. The number of storage areas in the first storage module, X, is the maximum number of row groups that can be equally divided into for all rows in the computing unit array. If X for a chip is fixed, then the maximum number of row groups in the computing unit array is also fixed. A row group is a group formed by dividing the computing unit array according to its rows, and each row group contains multiple rows of computing units. If X is fixed, then when combining computing unit arrays, the number of sub-computing arrays is M / X, and each sub-computing array includes X rows of computing units. If the number of storage areas in the first storage module is the same as the number of rows of computing units, then the computing unit array is calculated according to (M / X), indicating that the computing unit array has 1 sub-computing array. When the number of row groups is greater than 1, it is necessary to ensure that all row groups contain the same number of rows of computing units.
[0092] According to the parameters in the received instruction, LD_M1 will turn on the corresponding switches in the switch array SM1 in units of row groups, so that when each row computing unit of PUA reads data from the DB of LD_M1, it can read the data of the corresponding row, and each data can be shared by all computing units of the corresponding row in all row groups.
[0093] Furthermore, LD_M2 can retrieve elements from the external storage module M2 and store them in the second storage module DB based on parameters such as the starting address of matrix M2 and the size of the storage area occupied by matrix M2. The second storage module DB also consists of multiple storage areas, with the number of storage areas being X, and different storage areas can be accessed by different row groups.
[0094] Based on the parameters in the received instruction, LD_M2 activates the corresponding switches in the switch array SM2, enabling column computing units within the same row group of the PUA to read data from the DB of LD_M2, and ensuring that each data point is shared by all computing units in that row group. Columns in different row groups access data in the storage areas of different second storage modules.
[0095] PUA reads column data sequentially from the DB of LD_M1 and row data sequentially from the DB of LD_M2, and performs calculations.
[0096] Figure 4(a) is a schematic diagram of the structure of the first data reading module in the data processing device provided in the embodiment of the present invention.
[0097] As shown in Figure 4(a), the first data reading module LD_M1 includes: a first control unit Ctrl, used to receive instructions, extract the first matrix M1 according to the instructions, and generate a first control signal based on the instructions; and a first switch array SM, which, based on the first control signal, turns on the switch from the storage area corresponding to the first matrix to the sub-computing unit array, so that each sub-computing unit array can read the elements of its own group corresponding to its current operation.
[0098] In some embodiments, the instructions further include: the storage starting address of the first matrix; the first control unit Ctrl, including: a first storage unit DB, the first storage unit DB including multiple storage areas (two storage areas are shown in FIG4(a), namely DB1 and DB2); a first address generation unit AG, which generates a data retrieval address Addr1 of the first matrix based on the starting address of the first matrix, extracts the first matrix based on Addr1, and stores the first matrix into the DB according to the grouping parameters.
[0099] In this embodiment, the working process of LD_M1 is as follows:
[0100] When Ctrl receives the decoded instruction I_D, it sends the parameters given to LD_M1 in the instruction to the internal modules of LD_M1 respectively. For example, it sends the storage starting address of the matrix M1 to be imported for each calculation, the size of the storage area occupied, the data retrieval method and other parameters to AG1, and the grouping parameters of the calculation cell array to CL1.
[0101] The address generation module AG1 generates the data retrieval address Addr1 for M1, retrieves all or part of the data from M1, and temporarily stores it in the cache DB according to the storage address of the data generated by AG1 in the DB of LD_M1.
[0102] CL1 generates control signals for SM1 based on the grouping parameters of the computing unit array, opening up the channel from DB to each group of computing unit arrays. This allows the computing unit array PUA to directly retrieve the data from M1 during the current operation, i.e., the first data DO1 read by the first sub-computing array. G1 The first data DO1 read by the second sub-computing array G2 The data is the same; specifically, the same data is read by the same operation unit in the same row of each sub-computing unit array, which can reduce the number of times the same data in M1 is extracted.
[0103] Figure 4(b) is a schematic diagram of the structure of the second data reading module in the data processing device provided in the embodiment of the present invention.
[0104] As shown in Figure 4(b), the second data reading module Ctrl2 includes: a second control unit CL2, used to receive instructions, extract a second matrix according to the instructions, and generate a second control signal based on the instructions; and a second switch array, which, based on the second control signal, connects the switch from the storage area corresponding to the second matrix to the sub-computing unit array, so that each sub-computing unit array can extract the elements of its current group corresponding to its current operation.
[0105] In one embodiment, the instructions further include: the storage starting address of the second matrix; the second control unit CL2, including: a second storage unit DB; and a second address generation unit AG2, which generates a data retrieval address Addr2 for the second matrix based on the starting address of the second matrix, extracts the second matrix based on the data retrieval address of the second matrix, and stores the second matrix into the second storage unit according to the grouping parameters.
[0106] In this embodiment, the working process of LD_M2 is as follows:
[0107] When Ctrl2 receives the decoded instruction I_D, it sends the parameters given to LD_M2 in the instruction to the internal modules of LD_M2. For example, it sends the storage starting address of the matrix M2 to be imported for calculation, the size of the storage area it occupies, the data retrieval method and other parameters to AG2, and the grouping parameters of the calculation cell array to CL2.
[0108] The address generation module AG2 generates the data retrieval address Addr2 for M2, retrieves all or part of the data from M2, and temporarily stores it in the cache DB according to the storage address of the data generated by AG2 in the DB of LD_M2.
[0109] CL2 generates control signals for the switch array SM2 based on the grouping parameters of the computing unit array, opening the channel from DB of LD_M2 to each group of computing units. This allows the computing unit array PUA to directly retrieve the correct data during computation. Specifically, different columns of computing units in different row groups use different data as the second input. For example, the second data DO2G1 read by the first sub-computing array uses data from DB1 of LD_M2, and the second data DO2G2 read by the second sub-computing array uses data from DB2. Within each sub-computing array, computing units belonging to the same column read the same data; computing units in different sub-computing arrays that do not belong to the same column read data from different DBx, reducing the number of times the same data is retrieved from M2.
[0110] It is understood that the number of storage areas in the storage modules of LD_M1 and LD_M2 may be the same or different, and this embodiment is not limited to this.
[0111] In some embodiments, the grouping parameters may include two parameters, K and X, where K represents the K columns of the original input matrix M1 imported by each sub-computation unit each time, and the K rows of M2 imported sequentially each time; X represents how many groups the computation unit array is divided into by rows, for example, X=2, that is, during the computation process, the computation unit array is divided into 2 sub-computation arrays by rows.
[0112] In some embodiments, the grouping parameters may also include two parameters, K and Y, where K represents the K columns of the original input matrix M1 imported by each sub-computation unit each time, and the K rows of M2 imported sequentially each time; Y represents how many groups the computation unit array is divided into by columns, for example, Y=2, that is, during the calculation process, the computation unit array is divided into 2 sub-computation arrays by columns.
[0113] The data processing device provided by the above embodiments of the present invention will be discussed in detail below with reference to specific examples. This embodiment takes a 4x4 computing unit array PUA as an example to implement matrix multiplication of a 2x4 input matrix M1 and a 4x8 input matrix M2 to obtain a 2x8 output matrix.
[0114] Figure 5(a) is a schematic diagram of matrix operations provided by an embodiment of the present invention.
[0115] As shown in Figure 5(a), M1 is a 2*4 matrix and M2 is a 4*8 matrix. Multiplying the two matrices yields a 2*8 output matrix M.
[0116] Figure 5(b) is a schematic diagram of the data processing device provided in the embodiment of the present invention performing matrix operations, and Figure 5(c) is a schematic diagram of the data processing device provided in the embodiment of the present invention performing matrix operations.
[0117] As shown in Figure 5(b), the grouping parameters of the computing unit array are divided according to the number of rows of M1. The grouping parameters include the number of rows and columns of each sub-computing unit array. In this embodiment, the number of rows in the grouping parameters is 2 and the number of columns is 4. That is, the grouping parameters indicate that the computing unit array is divided into two sub-computing unit arrays by rows. Each sub-computing array is a 2*4 array. Both sub-computing arrays use all 2 rows and 4 columns of the original input matrix M1. The first sub-computing matrix reads the data of the 1st to 4th columns of the 4th row of the input matrix M2, and the second sub-computing matrix reads the data of the 5th to 8th columns of the 4th row of the input matrix M2.
[0118] LD_M1 reads M1 according to the instruction and stores M1 as a group of data in DB1 of LD_M1. LD_M2 reads M2 according to the instruction and divides M2 into two groups of data (two second sub-matrices) according to the column dimension of the computing unit array based on the grouping parameters. The two groups of data are then stored in DB1 and DB2 of LD_M respectively.
[0119] Specifically, each sub-computing array needs to read 4 columns of data from matrix M1 in DB1; at the same time, each sub-computing array needs to read 4 rows of data from DB1 and DB2 of LD_M2 respectively. Therefore, LD_M1 stores the 4 columns of M1 as a group in DB1, and LD_M2 stores the first 4 columns of M2 in DB1 and the last 4 columns in DB2.
[0120] The switch array SW1 of LD_M1 connects the inputs of both sub-computation arrays to DB1; the switch array SW2 of LD_M2 connects the input of the first sub-computation array Row Group1 to DB1 and the input of the second computation array Row Group2 to DB2.
[0121] At this point, the original 4x4 computing unit array has been recombined into a 2x8 computing unit array, as shown in Figure 5(c).
[0122] Figure 5(d) is a schematic diagram of the first step of matrix operation performed by the data processing device provided in the embodiment of the present invention.
[0123] As shown in Figure 5(d), the first step of the matrix operation performed by this data processing device includes: LD_M1 selects the corresponding switch for the first operation according to the grouping parameters, so that the first input data paths of DB1 of LD_M1 are connected to Row Group 1 and Row Group 2 respectively. Both Row Group 1 and Row Group 2 of PUA read the data in the first column of DB1 from LD_M1 as the first input of the two sub-computing arrays. Specifically, the first data "1" in the first column is sent to all the 0th row computing units as the first input, and the second data "0" in the first column is sent to all the 1st row computing units as the first input. It can be understood that the 0th row here refers to the 0th row of the recombined 2x8 computing unit array, which is equivalent to the 0th and 2nd rows of the original 4x4 computing unit array; similarly, the 1st row here refers to the 1st row of the recombined 2x8 computing unit array, which is equivalent to the 1st and 3rd rows of the original 4x4 computing unit array.
[0124] LD_M2 selects the corresponding switch for the first operation based on the grouping parameters, connecting DB1 of LD_M2 with the second data path of RowGroup1 of PUA, and connecting DB2 with the second data path of Row Group2. Row Group1 reads the first row of data from DB1 in LD_M2's DB, as the second input of Row Group1; Row Group2 reads the first row of data from DB2 in LD_M2's DB, as the second input of Row Group2. The specific allocation is as follows: The first data "1" in the first row of DB1 is assigned to all the computational units in column 0 of Row Group 1 as the second input (here, column 0 refers to column 0 of the recombined 2x8 computational unit array, equivalent to the first half of column 0 of the original 4x4 computational unit array, containing only the first and second computational units. The same applies below). This method is followed for all other data in DB1. Similarly, the first data "1" in the first row of DB2 is assigned to all the computational units in column 0 of Row Group 2 as the second input (here, column 0 refers to column 4 of the recombined 2x8 computational unit array, equivalent to the second half of column 0 of the original 4x4 computational unit array, containing only the third and fourth computational units. The same applies below). This method is followed for all other data in DB2.
[0125] Each computation unit in Row Group1 and Row Group2 performs a multiplication operation on the first input data and the second input data to obtain the result of this computation unit. The result output by the array of all computation units in Row Group1 and Row Group2 is the intermediate result matrix M_temp of the first operation.
[0126] Figure 5(e) is a schematic diagram of the second step of matrix operation performed by the data processing device provided in the embodiment of the present invention.
[0127] As shown in Figure 5(e), in the second step of the calculation, both Row Group 1 and Row Group 2 of PUA read the data in the second column of DB1 from DB1 of LD_M1 as the first input of Row Group 1 and Row Group 2, respectively. Specifically, the first data "0" in the second column is sent to all calculation units in the 0th row as the first input, and the second data "2" is sent to all calculation units in the 1st row as the first input.
[0128] PUA's Row Group 1 reads the second row of data from DB1 in LD_M2's DB, as the second input of Row Group 1; Row Group 1 reads the second row of data from DB2 in LD_M2's DB, as the second input of Row Group 2.
[0129] Each computational unit in Row Group1 and Row Group2 performs a multiplication operation on the two input data of the current operation and accumulates the intermediate results of the previous operation to obtain the result of the current operation. The result output by the array of all computational units in Row Group1 and Row Group2 is the intermediate result matrix M_temp of the second operation.
[0130] Figure 5(f) is a schematic diagram of the third step of matrix operation performed by the data processing device provided in the embodiment of the present invention.
[0131] As shown in Figure 5(f), both Row Group 1 and Row Group 2 of PUA read the data in the 3rd column of DB1 from DB1 of LD_M1 as the first input of their respective row groups. Specifically, the first data "3" in the 3rd column is sent to all 0th row calculation units as the first input, and the second data "0" is sent to all 1st row calculation units as the first input. Row Group 1 of PUA reads the data in the 3rd row of DB1 from DB1 of LD_M2 as its second input; Row Group 2 reads the data in the 3rd row of DB2 from DB2 of LD_M2 as its second input.
[0132] Each computational unit in Row Group1 and Row Group2 performs a multiplication operation on the two input data of the current operation and accumulates the intermediate results of the previous operation to obtain the result of the current operation. The result output by the array of all computational units in Row Group1 and Row Group2 is the intermediate result matrix M_temp of the third operation.
[0133] Figure 5(g) is a schematic diagram of the fourth step of matrix operation performed by the data processing device provided in the embodiment of the present invention.
[0134] As shown in Figure 5(g), both Row Group 1 and Row Group 2 of PUA read the data in the 4th column of DB1 from the DB of LD_M1 as the first input for the two row groups. Specifically, the first data "0" in the 4th column is sent to all the calculation units in the 0th row as the first input, and the second data "4" is sent to all the calculation units in the 1st row as the first input.
[0135] PUA's Row Group 1 reads the 4th row of data from DB1 in LD_M2's DB, as the second input of Row Group 1; Row Group 2 reads the 4th row of data from DB2 in LD_M2's DB, as the second input of Row Group 2.
[0136] Each computational unit in Row Group1 and Row Group2 performs a multiplication operation on the two input data for this step, and accumulates the intermediate results of the previous step to obtain the result of this operation. The output of the array of computational units in Row Group1 and Row Group2 is the result matrix M of M1 and M2. Finally, the result matrix M is output.
[0137] The data processing device provided by the above embodiments of the present invention can, on the one hand, flexibly combine the array of processing units to effectively utilize the effect of the processing units according to the dimensional characteristics of the data matrix, thereby improving the performance of chip computing power; on the other hand, the data processing device can reuse more data, making the data utilization rate better, thereby reducing the power consumption caused by data transportation.
[0138] According to another embodiment of the present invention, a processing core is provided, including one or more data processing devices provided in the above embodiments.
[0139] In some embodiments, the processing core further includes a decoding unit for decoding the received instructions and sending the decoded instructions to the data processing device.
[0140] According to another embodiment of the present invention, an electronic device is provided, including one or more processing cores provided in the above embodiments.
[0141] According to another embodiment of the present invention, a chip is provided, including one or more processing cores provided in the above embodiments.
[0142] According to another embodiment of the present invention, a cardboard is provided, comprising one or more chips provided in the above embodiments.
[0143] According to another embodiment of the present invention, an electronic device is provided, comprising one or more chips provided in the above embodiments.
[0144] Figure 6 This is a flowchart of the data processing method provided in the embodiments of the present invention.
[0145] like Figure 6 As shown, the method includes:
[0146] Step S101: Receive instruction;
[0147] Step S102: Read the first matrix and the second matrix based on the instruction. The instruction includes grouping parameters for the computing unit array. The grouping parameters are parameters used to divide the computing unit array into sub-computing arrays. The grouping parameters are related to the rows of the first matrix or the columns of the second matrix.
[0148] In step S103, the sub-computing array reads the data of the first matrix and the second matrix and performs operations on the first matrix and the second matrix.
[0149] The data processing device provided in the above embodiments of the present invention can divide the computing unit array into multiple sub-computing arrays according to the grouping parameters in the instructions, which can realize the flexible combination of computing units in the computing unit array and effectively utilize computing units to improve the computing power of the data processing device.
[0150] In some embodiments, the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix, wherein the number of rows of the sub-computing arrays is the same as the number of rows of the first matrix; the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix, wherein the number of columns of the sub-computing arrays is the same as the number of columns of the second matrix.
[0151] In some embodiments, when the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays according to the rows of the first matrix; the sub-computing array reads the first matrix column by column, such that each column computing unit of the sub-computing array reads a corresponding column of data from the first matrix; the sub-computing array divides the second matrix into multiple second sub-matrices in units of column dimensions of the sub-computing array; the sub-computing array reads the corresponding second sub-matrices row by row, such that each row computing unit of the sub-computing array reads a corresponding row of data from the second sub-matrices.
[0152] In some embodiments, when the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix; the sub-computing array divides the first matrix into multiple first sub-matrices in units of the row dimension of the sub-computing array; the sub-computing array reads the corresponding first sub-matrices column by column, so that each column computing unit of the sub-computing array reads a column of data of the corresponding first sub-matrices; the sub-computing array reads the second matrix row by row, so that each row computing unit of the sub-computing array reads a row of data of the second matrix.
[0153] In some embodiments, each computing unit in the sub-computation array is used to accumulate the results of each operation to obtain an output matrix.
[0154] In some embodiments, the method further includes: when the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays according to the rows of the first matrix, the data reading module stores the elements of the first matrix into one storage area and groups the elements of the second matrix into multiple storage areas based on the grouping parameter; or, when the grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays according to the columns of the second matrix, the data reading module is used to store the elements of the second matrix into one storage area and group the elements of the first matrix into multiple storage areas based on the grouping parameter.
[0155] In some embodiments, the method further includes: the data reading module, based on the grouping parameters, turns on the corresponding switch from the storage area to the sub-computing unit array, so that each sub-computing unit array can extract the elements of the array corresponding to its current operation.
[0156] According to some embodiments of the present invention, a computer storage medium is provided, on which a computer program is stored, and when the program is executed by a processor, the data processing method of the above embodiments is provided.
[0157] According to some embodiments of the present invention, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program using the data processing method described in the above embodiments.
[0158] According to some embodiments of the present invention, a computer program product is provided, which includes computer instructions. When the computer instructions are executed by a computing device, the computing device can execute the data processing method of the above embodiments.
[0159] It should be understood that the specific embodiments described above are merely illustrative or explanatory of the principles of the invention and do not constitute a limitation thereof. Therefore, any modifications, equivalent substitutions, improvements, etc., made without departing from the spirit and scope of the invention should be included within the protection scope of the invention. Furthermore, the appended claims are intended to cover all variations and modifications falling within the scope and boundaries of the appended claims, or equivalent forms of such scope and boundaries.
Claims
1. A data processing device, characterized in that, include: A data reading module is used to receive instructions and read a first matrix and a second matrix based on the instructions. The instructions include grouping parameters of the computing unit array. The grouping parameters are parameters used to divide the computing unit array into sub-computing arrays. The grouping parameters are related to the rows of the first matrix or the columns of the second matrix. The data reading module includes multiple storage areas and is used to connect the corresponding storage area to the sub-computing array based on the grouping parameters, so that each sub-computing array can extract the elements of the group corresponding to its current operation. The sub-computing array reads the data from the first matrix and the second matrix and performs the current operation on the first matrix and the second matrix.
2. The data processing device according to claim 1, characterized in that, The grouping parameters are parameters for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix, wherein the number of rows in the sub-computing arrays is the same as the number of rows in the first matrix; or, The grouping parameter is a parameter for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix, wherein the number of columns of the sub-computing arrays is the same as the number of columns of the second matrix.
3. The data processing device according to claim 1 or 2, characterized in that, The grouping parameters are the parameters for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix; The sub-computing array reads the first matrix column by column, so that each column of the sub-computing array reads a corresponding column of data from the first matrix; The sub-computing array divides the second matrix into multiple second sub-matrices based on the column dimension of the sub-computing array. The sub-computing array reads the corresponding second sub-matrices row by row, so that each row of the sub-computing array reads one row of data from the corresponding second sub-matrices.
4. The data processing device as described in claim 1 or 2, characterized in that, The grouping parameters are the parameters for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix; The sub-computing array divides the first matrix into multiple first sub-matrices using the row dimension of the sub-computing array as the unit. The sub-computing array reads the corresponding first sub-matrices column by column, so that each column computing unit of the sub-computing array reads a column of data from the corresponding first sub-matrices. The sub-computing array reads the second matrix row by row, so that each row of the sub-computing array reads one row of data from the second matrix.
5. The data processing device according to claim 4, characterized in that, Each computing unit in the sub-computation array is used to accumulate the results of each operation to obtain an output matrix.
6. The data processing device according to claim 1 or 2, characterized in that, When the grouping parameter is the parameter for dividing the computing unit array into sub-computing arrays based on the rows of the first matrix, The data reading module is used to store the elements of the first matrix into one storage area and group the elements of the second matrix into multiple storage areas based on the grouping parameters; or, When the grouping parameters are parameters for dividing the computing unit array into sub-computing arrays based on the columns of the second matrix, The data reading module is used to store the elements of the second matrix into one storage area and group the elements of the first matrix into multiple storage areas based on the grouping parameters.
7. A processing core, characterized in that, It includes one or more data processing devices as described in any one of claims 1-6.
8. An electronic device, characterized in that, Includes the processing core as described in claim 7.
9. A data processing method, characterized in that, Receive instructions; The first matrix and the second matrix are read based on the instructions, the instructions including grouping parameters of the computing unit array, the grouping parameters being parameters used to divide the computing unit array into sub-computing arrays, the grouping parameters being related to the rows of the first matrix or the columns of the second matrix; Based on the grouping parameters, the switching connection between multiple storage areas and the sub-computing array is controlled, so that each sub-computing array extracts the elements of the group corresponding to its current operation from the corresponding storage area. The sub-computing array reads the data from the first matrix and the second matrix and performs operations on the first matrix and the second matrix.