A matrix computing device, method, system, circuit, chip and apparatus

By directly calculating the compressed matrix and using vector outer product and accumulator, the problem of limited computation speed of compressed matrix is ​​solved, achieving more efficient computational performance and flexible matrix output.

CN119149890BActive Publication Date: 2026-06-26HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2021-02-08
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, compressed matrix calculations require decompression before computation, which limits computation speed to memory access bandwidth and results in low computational efficiency.

Method used

By directly calculating the compressed matrix, using a vector outer product processing engine and accumulator, the row and column coordinates of the elements are preserved, and the matrix is ​​accumulated based on the position coordinate index to obtain the result matrix, thus avoiding the decompression process.

Benefits of technology

It improves the computational efficiency of compressed matrix formats, is suitable for different application scenarios, and can output compressed or uncompressed result matrices, saving transmission resources.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119149890B_ABST
    Figure CN119149890B_ABST
Patent Text Reader

Abstract

A kind of matrix computing device, method, system, circuit, chip and equipment, compressed format matrix can be directly calculated, so as to improve the computing efficiency of compressed format matrix.Matrix computing device includes: vector outer product processing engine and accumulator, matrix computing device is based on vector outer product and carries out the calculation of compressed format first matrix and second matrix, in the process of calculation, vector outer product processing engine carries out vector outer product calculation to the first column vector of the row coordinate reserved and the second column vector of the column coordinate reserved, then accumulator is based on the index of position coordinate, and the third element value of the same position coordinate is accumulated, to obtain the result matrix of two compressed format matrixes for calculating, relative to the method that compressed format matrix needs to be decompressed first in traditional, then matrix calculation is carried out to the decompressed matrix, the matrix computing device provided in the embodiment of the application can effectively improve the computing efficiency of compressed format matrix.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] This application is a divisional application. The original application has the application number 202110181498.6 and the original application date is February 8, 2021. The entire contents of the original application are incorporated herein by reference. Technical Field

[0002] This application relates to the field of computers, and more particularly to a matrix computing device, method, system, circuit, chip, and apparatus. Background Technology

[0003] Matrix computation is an important type of computation in various application scenarios such as artificial intelligence, scientific computing, and graphics computing. A matrix is ​​a collection of element values ​​arranged in a rectangular array. The element values ​​in a matrix may include both zero and non-zero values. When a matrix has a large number of zero values, in order to save storage space, only the non-zero element values ​​can be stored, that is, the matrix is ​​compressed and stored in a compressed format.

[0004] In current technology, the common method for calculating compressed matrices is to first decompress the compressed matrix, that is, to convert the compressed matrix into an uncompressed matrix, and then perform matrix calculations on the uncompressed matrix. During matrix calculations, because the decompression operation is required, and the decompressed data occupies a very large amount of memory, the calculation speed is limited by memory access bandwidth. With a fixed memory access bandwidth, the calculation speed cannot be improved, resulting in low computational efficiency. Summary of the Invention

[0005] This application provides a matrix computing device, method, system, circuit, chip, and apparatus that can directly compute compressed matrices without decompressing them, thereby improving the computational performance of compressed matrices.

[0006] In a first aspect, embodiments of this application provide a matrix calculation device, comprising: a vector outer product processing engine and an accumulator; the vector outer product processing engine is used to calculate the vector outer product of N first column vectors and N first row vectors to obtain N intermediate result matrices, wherein the first column vectors include a first element value and the row coordinate of the first element value, the first row vectors include a second element value and the column coordinate of the second element value, and the intermediate result matrices include a third element value and the position coordinate of the third element value, wherein the position coordinates include the row coordinate of the first element value and the column coordinate of the second element value; wherein the N first column vectors are obtained by converting a first matrix in compressed format, the N first row vectors are obtained by converting a second matrix in compressed format, and N is an integer greater than or equal to 1; the accumulator is used to accumulate the third element values ​​with the same position coordinates in the N intermediate result matrices according to the index of the position coordinate of the third element value to obtain a result matrix. In this embodiment, the matrix calculation device calculates the first and second matrices in compressed format based on the vector outer product. During the calculation, the row coordinates of the element values ​​in the first column vector and the column coordinates of the element values ​​in the second column vector are retained. Then, based on the index of the position coordinates, the third element values ​​with the same position coordinates are accumulated to obtain the result matrix of the two compressed format matrices. Compared with the traditional method that requires decompressing the compressed format matrix first and then performing matrix calculations on the decompressed matrix, the matrix calculation device provided in this embodiment can effectively improve the calculation efficiency of compressed format matrices.

[0007] In one optional implementation, the N intermediate result matrices include at least a first intermediate result matrix and a second intermediate result matrix. The position coordinates of the third element value in the first intermediate result matrix are designated as first position coordinates, and the position coordinates of the third element value in the second intermediate result matrix are designated as second position coordinates. An accumulator is used to write the third element value from the first intermediate result matrix to the corresponding position in a buffer according to its first position coordinate, following the generation order of the N intermediate result matrices. Then, based on the second position coordinates of the third element value in the second intermediate result matrix, the cached value at the corresponding position in the buffer is read, and the third element value from the second intermediate result matrix and the cached value are accumulated to obtain an uncompressed result matrix. In the above optional implementation, the matrix computing device can output an uncompressed matrix according to different application scenarios, increasing the applicability of the matrix computing device.

[0008] In one optional implementation, the matrix computing device further includes a matrix compression unit; the matrix compression unit is used to compress the uncompressed result matrix to obtain a compressed result matrix. In the above optional implementation, the matrix computing device can output a compressed matrix according to different application scenarios, increasing the applicability of the matrix computing device. Furthermore, by compressing the result matrix and outputting a compressed matrix, the matrix computing device saves transmission resources or facilitates subsequent computational operations.

[0009] In one optional implementation, the accumulator is further configured to: sort the third element values ​​in the N intermediate result matrices according to their position coordinates, for example, by row coordinates or column coordinates; then compare the position coordinates in the sorted N intermediate result matrices, add the third element values ​​with the same position coordinates, and remove the position coordinates of zero-valued elements to obtain a compressed result matrix. In this optional implementation, a compressed matrix can be directly obtained. The compressed matrix output by the matrix computing device can be applied to application scenarios where subsequent calculations require a compressed matrix. Furthermore, since the matrix computing device outputs a compressed matrix, the transmission resources for subsequent matrix transmission can be reduced.

[0010] In one alternative implementation, the matrix calculation device further includes a format conversion unit; the format conversion unit is used to obtain the first matrix and the second matrix, convert the first matrix into N first column vectors and retain the row coordinates of the first element values ​​in the first column vectors, and convert the second matrix into N first row vectors and retain the column coordinates of the first element values ​​in the first row vectors, thereby enabling the matrix calculation device to perform calculations on the two compressed format matrices based on the vector outer product.

[0011] In one optional implementation, the matrix computing device further includes a format conversion unit. This unit acquires a fifth matrix and a sixth matrix, performs format conversion on the fifth matrix to obtain a first matrix, and performs format conversion on the sixth matrix to obtain a second matrix. At least one of the fifth and sixth matrices is an uncompressed matrix. In this optional implementation, the matrix computing device can receive an uncompressed matrix and then convert it into a compressed matrix, thereby enabling the matrix computing device to support matrix calculations in multiple formats.

[0012] In one optional implementation, the matrix computation device further includes a format conversion unit, which is further configured to split the first column vector into X second column vectors and split the first row vector into X second row vectors, wherein the precision of the element values ​​in the second column vectors and second row vectors is second precision, and the precision of the element values ​​in the first column vectors and first row vectors is first precision, and the first precision is higher than the second precision, and X is an integer greater than or equal to 2; a vector outer product processing engine is further configured to calculate the vector outer product of the X second column vectors and the X second row vectors to obtain X 2 A fourth matrix is ​​formed, comprising a fourth element value and its position coordinates. The position coordinates of the fourth element value include the row coordinates of the first element value and the column coordinates of the second element value. The precision of the fourth element value is the first precision. The accumulator is then used to adjust the X value based on the index of the fourth element value's position coordinates. 2 The values ​​of the fourth elements with the same position coordinates in each fourth matrix are summed to obtain an intermediate result matrix. The precision of the third element value in the intermediate result matrix is ​​the first precision. In the above optional implementation, the matrix computing device can achieve high-precision matrix calculation based on a low-precision matrix computing device, thereby improving the applicability of the matrix computing unit.

[0013] Secondly, embodiments of this application provide a matrix calculation method. This method is applied to a matrix calculation device and includes: first, obtaining a first calculation instruction, which includes N first column vectors and N first row vectors; then, calculating the vector outer product of the N first column vectors and the N first row vectors to obtain N intermediate result matrices, wherein each first column vector includes a first element value and its row coordinates, each first row vector includes a second element value and its column coordinates, and each intermediate result matrix includes a third element value and its position coordinates, where the position coordinates include the row coordinates of the first element value and the column coordinates of the second element value; the N first column vectors are obtained by converting a first matrix in compressed format, and the N first row vectors are obtained by converting a second matrix in compressed format, where N is an integer greater than or equal to 1; finally, based on the index of the position coordinates of the third element values, accumulating the third element values ​​with the same position coordinates in the N intermediate result matrices to obtain a result matrix. In this embodiment, the first matrix and the second matrix are calculated based on the cross product of N first column vectors and N first row vectors. During the calculation, the row coordinates of the element values ​​in the first column vectors and the column coordinates of the element values ​​in the second column vectors are retained. Then, based on the index of the position coordinates, the third element values ​​with the same position coordinates are accumulated to obtain the result matrix of the two compressed matrixes. Compared with the traditional method that requires decompressing the compressed matrix first and then performing matrix calculations on the decompressed matrix, the matrix calculation device provided in this embodiment can effectively improve the calculation efficiency of compressed matrix.

[0014] In one optional implementation, the N intermediate result matrices include at least a first intermediate result matrix and a second intermediate result matrix. The position coordinates of the third element value in the first intermediate result matrix are the first position coordinates, and the position coordinates of the third element value in the second intermediate result matrix are the second position coordinates. The method involves summing the third element values ​​with the same position coordinates in the N intermediate result matrices according to their index, to obtain the final matrix. Specifically, this includes: first, writing the third element value in the first intermediate result matrix into the corresponding position in the buffer according to its first position coordinate, following the generation order of the N intermediate result matrices; then, further, reading the cached value at the corresponding position in the buffer based on the second position coordinate of the third element value in the second intermediate result matrix; and summing the third element value in the second intermediate result matrix with the cached value to obtain the uncompressed result matrix. In the above optional implementation, the matrix computing device can output an uncompressed matrix according to different application scenarios, increasing the applicability of the matrix computing device.

[0015] In one optional implementation, the method further includes: compressing the uncompressed result matrix to obtain a compressed result matrix. In the above optional implementation, the matrix computing device can output a compressed matrix according to different application scenarios, increasing the applicability of the matrix computing device. Furthermore, the matrix computing device compresses the result matrix and outputs a compressed matrix, thereby saving transmission resources or facilitating subsequent calculation operations.

[0016] In one optional implementation, the method accumulates the third-element values ​​with the same position coordinates from N intermediate result matrices based on the index of the third-element value's position coordinates to obtain the final matrix. Specifically, this includes: first, sorting the third-element values ​​in the N intermediate result matrices according to their position coordinates, for example, sorting them by row coordinates or column coordinates; comparing the position coordinates in the sorted N intermediate result matrices, adding the third-element values ​​with the same position coordinates, and deleting the position coordinates of zero-value elements to obtain a compressed result matrix. In this optional implementation, a compressed matrix can be directly obtained. The compressed matrix output by the matrix computing device can be applied to application scenarios where subsequent calculations require a compressed matrix. Furthermore, since the matrix computing device outputs a compressed matrix, the transmission resources for subsequent matrix transmission can be reduced.

[0017] In one alternative implementation, before obtaining the first calculation instruction, the method further includes: obtaining a second calculation instruction, the second calculation instruction including a first matrix and a second matrix; converting the first matrix into N first column vectors and retaining the row coordinates of the first element values ​​in the first column vectors; converting the second matrix into N first row vectors and retaining the column coordinates of the first element values ​​in the first row vectors, thereby enabling the matrix calculation device to perform calculations on the two compressed format matrices based on the vector outer product.

[0018] In one optional implementation, before obtaining the second calculation instruction, the method further includes: obtaining a third calculation instruction, the third calculation instruction including a fifth matrix and a sixth matrix, wherein at least one of the fifth matrix and the sixth matrix is ​​an uncompressed matrix; then, the fifth matrix is ​​converted to a compressed first matrix, and the sixth matrix is ​​converted to a compressed second matrix. In the above optional implementation, the matrix calculation device can receive an uncompressed matrix and then convert the uncompressed matrix into a compressed matrix, thereby enabling the matrix calculation device to support matrix calculations in multiple formats.

[0019] In one optional implementation, the first column vector can be first split into X second column vectors, and the first row vector can be split into X second row vectors. The precision of the element values ​​in the second column vectors and second row vectors is second precision, while the precision of the element values ​​in the first column vectors and first row vectors is first precision. The first precision is higher than the second precision, and X is an integer greater than or equal to 2. Calculating the outer product of N first column vectors and N first row vectors to obtain N intermediate result matrices can include: calculating the outer product of X second column vectors and X second row vectors to obtain X... 2 A fourth matrix is ​​formed, comprising a fourth element value and its position coordinates. The position coordinates of the fourth element value include the row coordinates of the first element value and the column coordinates of the second element value. The precision of the fourth element value is the same as the precision of the first element value. Further, based on the index of the fourth element value's position coordinates, X... 2 The values ​​of the fourth elements with the same position coordinates in each fourth matrix are summed to obtain an intermediate result matrix. The precision of the third element value in the intermediate result matrix is ​​the first precision. In the above optional implementation, the matrix computing device can achieve high-precision matrix calculation based on a low-precision matrix computing device, thereby improving the applicability of the matrix computing unit.

[0020] Thirdly, a matrix calculation circuit is provided, which is used to execute the operation steps of the matrix calculation method provided by the second aspect or any possible implementation of the second aspect.

[0021] Fourthly, a matrix computing system is provided, comprising a processor and a matrix computing device, wherein the processor is configured to send computing instructions to the matrix computing device, and the matrix computing device is configured to execute the operation steps of the matrix computing method provided in the second aspect or any possible implementation thereof.

[0022] Fifthly, a chip is provided, the chip including a processor, the processor integrating a matrix calculation device for performing the operation steps of the matrix calculation method provided in the second aspect or any possible implementation thereof.

[0023] Sixthly, a matrix computing device is provided, which includes the matrix computing system provided in the fourth aspect above, or the chip provided in the fifth aspect above.

[0024] In a seventh aspect, a readable storage medium is provided, which stores instructions that, when the readable storage medium is operated on a device, cause the device to perform the operational steps of the matrix calculation method provided in the second aspect or any possible implementation thereof.

[0025] Eighthly, a computer program product is provided that, when run on a computer, causes the computer to perform the operational steps of the matrix calculation method provided by the second aspect or any possible implementation thereof.

[0026] Understandably, the apparatus, computer storage medium, or computer program product of any of the matrix calculation methods provided above are used to execute the corresponding methods provided above. Therefore, the beneficial effects that can be achieved can be referred to the beneficial effects of the corresponding methods provided above, and will not be repeated here.

[0027] Based on the implementation methods provided in the above aspects, this application can be further combined to provide more implementation methods. Attached Figure Description

[0028] Figure 1A This is a schematic diagram of the matrix of COO compression format in the embodiments of this application;

[0029] Figure 1B This is a schematic diagram of the matrix of CSR compression format in the embodiments of this application;

[0030] Figure 1C This is a schematic diagram of the matrix in CSC compression format in the embodiments of this application;

[0031] Figure 2 This is a schematic diagram of the structure of a computing device provided in an embodiment of this application;

[0032] Figure 3 This is a schematic diagram of the structure of a processor provided in an embodiment of this application;

[0033] Figure 4A This is a schematic diagram of the structure of a matrix computing device provided in an embodiment of this application;

[0034] Figure 4B This is a schematic diagram of the structure of a vector outer product processing engine provided in the embodiments of this application;

[0035] Figure 4C This is a schematic diagram of the structure of a MAC operation subunit provided in an embodiment of this application;

[0036] Figure 4D This is a schematic diagram of the structure of an accumulator provided in an embodiment of this application;

[0037] Figure 4E This is a schematic diagram of the structure of an adder provided in an embodiment of this application;

[0038] Figure 5A This is a schematic diagram illustrating how the matrix calculation device in this application converts a first matrix into N first column vectors and a second matrix into multiple first row vectors;

[0039] Figure 5B This is a schematic diagram of the matrix calculation device in this application performing an outer product calculation on N first column vectors and N first row vectors to obtain N intermediate result matrices;

[0040] Figure 5C This is a schematic diagram of the element values ​​and position coordinates in the first intermediate result matrix in the embodiments of this application;

[0041] Figure 5D This is a schematic diagram showing the element values ​​and position coordinates in the second intermediate result matrix in the embodiments of this application;

[0042] Figure 5E This is a schematic diagram illustrating how the matrix calculation device in this application accumulates the values ​​of the third element with the same position coordinates in N intermediate result matrices to obtain a result matrix;

[0043] Figure 6 This is a schematic diagram illustrating one implementation method of obtaining a result matrix by summing the values ​​of the third element with the same position coordinates in N intermediate result matrices in this application.

[0044] Figure 7 This is a schematic diagram of another matrix computing device provided in the embodiments of this application;

[0045] Figure 8This is a schematic diagram illustrating another implementation method in this application of obtaining a result matrix by accumulating the values ​​of the third element with the same position coordinates in N intermediate result matrices;

[0046] Figure 9 This is a schematic diagram illustrating how the format conversion unit in this embodiment converts an uncompressed matrix into a compressed matrix.

[0047] Figure 10 This is a schematic diagram illustrating how an integer value of first precision is decomposed into multiple integer values ​​of second precision in an embodiment of this application.

[0048] Figure 11 This is a schematic diagram illustrating the decomposition of a first column vector of first precision into multiple second column vectors of second precision and the decomposition of a first row vector of first precision into multiple second row vectors of second precision in an embodiment of this application.

[0049] Figure 12 This is a schematic diagram of the matrix calculation device in this application performing a vector outer product calculation on the second row vector and the second column vector to obtain a fourth matrix;

[0050] Figure 13 This is a flowchart illustrating the steps of one embodiment of a matrix calculation method according to this application.

[0051] Figure 14 This is a flowchart illustrating the steps of another embodiment of a matrix calculation method in this application. Detailed Implementation

[0052] The technical solutions of the embodiments of this application will now be described with reference to the accompanying drawings. The terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

[0053] To better understand this application, the relevant terms used in this application will be explained first.

[0054] A matrix is ​​an m×n matrix that is a rectangular array of elements arranged in m rows and n columns. For example, matrix A is shown in equation (1) and matrix B is shown in equation (2).

[0055]

[0056] Matrix addition and subtraction: Matrices with the same dimension can be added or subtracted from each other, specifically by adding or subtracting the elements at each position. For example, matrices A and B are both m×n matrices. Adding matrices A and B together yields matrix C, as shown in equation (3) below.

[0057]

[0058] Matrix multiplication: Two matrices can only be multiplied if the number of columns in the first matrix A is equal to the number of rows in the other matrix. For example, if matrix A is an m×n matrix and matrix B is an n×p matrix, the product of matrix A and matrix B is an m×p matrix, and an element of this m×p matrix is ​​shown in equation (4):

[0059] Where 1≤i≤m, 1≤j≤p.

[0060] A row vector is a matrix of dimension 1×m, where m is a positive integer. For example, a row vector is shown in equation (5):

[0061] X = [x1 x2 ... x] m Equation (5).

[0062] A column vector is a matrix of dimension m×1, where m is a positive integer. For example, a column vector is shown in equation (6) below:

[0063]

[0064] Vector outer product: The tensor product of two vectors, which is a matrix. For example, given a column vector U of dimension m×1 and a row vector V of dimension 1×n, the outer product of vectors U and V, U×V, is defined as a matrix D of dimension m×n, as shown in equation (7):

[0065]

[0066] Compressed matrices: When a matrix contains both zero and non-zero elements, to save storage space, the non-zero elements are typically stored in a specific format, while the zero elements are not. This process is called matrix compression, and the compressed matrix is ​​called a compressed matrix. Methods for matrix compression include, but are not limited to, coordinate (COO) representation, compressed sparse row (CSR), and compressed sparse column (CSC).

[0067] The following provides illustrative examples of the three compression methods: COO, CSR, and CSC.

[0068] COO: Matrices are represented using triples. A triple consists of three values: row number, column number, and element value. The row and column numbers identify the position of the element value. For example, a triple is (row number, column number, element value), or (element value, row number, column number), etc. The specific order of the three values ​​in the triple is not limited. For an example, please refer to [link to example]. Figure 1A , Figure 1A The image shows a 4×4 matrix Y containing zero and non-zero elements. The non-zero values ​​are: 1, 2, 3, 4, 5, 6, 7, 8, 9. For example, the non-zero value "1" is located in row 0, column 0, and its triplet is (0, 0, 1). The non-zero value "2" is located in row 0, column 1, and its triplet is (0, 1, 2). The non-zero value "3" is located in row 1, column 1, and its triplet is (1, 1, 3). Each element value is not described in detail here. For example, the triplet form of the compressed matrix is: (0, 0, 1), (0, 1, 2), (1, 1, 3), (1, 2, 4), (2, 0, 5), (2, 2, 6), (2, 3, 7), (3, 1, 8), (3, 3, 9).

[0069] Alternatively, the compressed matrix Y can be represented as shown in equation (8).

[0070] Row coordinates = ([0,0,1,1,2,2,2,3,3])

[0071] Column coordinates = ([0,1,1,2,0,2,3,1,3])

[0072] Element value = ([1,2,3,4,5,6,7,8,9]), equation (8).

[0073] CSR: A matrix is ​​represented using three types of data: element values, column numbers, and row offsets. The element values ​​and column numbers in CSR are represented similarly to those in the COO method described above. The difference between CSR and COO is that the row offset in CSR represents the starting offset position of the first element in a given row relative to all element values. Please refer to [link / reference]. Figure 1B As shown, firstly, Figure 1BThe non-zero elements in matrix Y shown are arranged row-wise, resulting in the following values: 1, 2, 3, 4, 5, 6, 7, 8, 9. The first non-zero element in the first row is "1", and its offset from all other elements is "0". Similarly, the first non-zero element in the second row is "3", and its offset from all other elements is "2". The first non-zero element in the third row is "5", and its offset from all other elements is "4". The first non-zero element in the fourth row is "8", and its offset from all other elements is "7". Finally, the total number of non-zero elements in the matrix (e.g., "9") is appended to the end of the row containing the offset.

[0074] The compressed matrix Y can be represented as shown in equation (9).

[0075] Row offset = ([0,2,4,7,9])

[0076] Column coordinates = ([0,1,1,2,0,2,3,1,3])

[0077] Element value = ([1,2,3,4,5,6,7,8,9]), equation (9).

[0078] CSC: A matrix is ​​represented using three types of data: element values, row numbers, and column offsets. The element values ​​and row numbers in CSC are represented similarly to those in the COO method. The difference between CSC and COO is that the column offset represents the starting offset position of the first element in a given column relative to all element values. Please refer to [link to relevant documentation]. Figure 1C As shown, firstly, Figure 1C The non-zero elements in matrix Y shown are arranged column-wise, resulting in the following values: 1, 5, 2, 3, 8, 4, 6, 7, 9. The first non-zero element in the first column is "1", with its offset from all other elements being "0". Similarly, the first non-zero element in the second column is "5", with its offset from all other elements being "2". The first non-zero element in the third column is "2", with its offset from all other elements being "5". The first non-zero element in the fourth column is "3", with its offset from all other elements being "7". Finally, the total number of non-zero elements in the matrix (e.g., "9") is appended to the end of the row containing the column offsets.

[0079] The compressed matrix Y can be represented as shown in equation (10).

[0080] Column offset = ([0,2,5,7,9])

[0081] Row coordinates = ([0,0,1,1,2,2,2,3,3])

[0082] Element value = ([1,5,2,3,8,4,6,7,9]), equation (10).

[0083] As explained by the three matrix compression methods above, each element in a COO compressed matrix has a corresponding row coordinate (row number) and column coordinate (column number). Each element in a CSR compressed matrix has a corresponding column coordinate. Each element in a CSC compressed matrix has a corresponding row coordinate.

[0084] Uncompressed matrix format: Please refer to Figure 1A The matrix Y shown is an uncompressed matrix that includes both zero and non-zero elements. It should be noted that, generally, a compressed matrix is ​​also called a sparse matrix, while an uncompressed matrix can be called a dense matrix.

[0085] Numerical types: In the field of computer science, numerical types include integers and floating-point numbers. Integers are mainly used to represent integers, while floating-point numbers are mainly used to represent decimals. Integer precision includes int2, int4, int8, int16, int32, etc. Int is used to represent integer functions; adding a number after int indicates the number of bits (bits) in the binary value range. One bit is either 0 or 1. For example, the binary value range of int4 is 4 bits (0000-1111), which is (-8, 7) in decimal. Similarly, the binary value range of int8 is (-2...). 7 ,2 7 -1). A computer's storage capacity unit, 1 byte, is 8 bits. Therefore, an int8 has 1 byte, and an int16 has 2 bytes. The binary value field of an int16 occupies 2 bytes, which is equivalent to (-32768, 32767) in decimal. The binary value field of an int32 occupies 4 bytes, which is equivalent to (-2147483648, 2147483647) in decimal.

[0086] An integer matrix is ​​a matrix whose elements are integer values. For example, an m x n integer matrix contains m × n elements, all of which are integer values. These integer values ​​can have precision values ​​such as int2, int4, int8, int16, or int32. Integer matrices can also contain matrices with different integer formats, such as matrices containing int8 integer values, matrices containing int16 integer values, and matrices containing int32 integer values.

[0087] Floating-point numbers (FP) are primarily used to represent decimals and typically consist of three parts: a sign bit, an exponent bit, and a mantissa bit. The exponent bit can also be called the exponent portion. The sign bit can be 1 bit, while the exponent and mantissa bits can be multiple bits. Floating-point numbers can come in various formats, such as half-precision, single-precision, and double-precision floating-point numbers in the IEEE 754 standard. Half-precision floating-point numbers occupy 16 bits (2 bytes) in computer memory and are also commonly referred to as FP16. The absolute value range that a half-precision floating-point number can represent is approximately [6.10 × 10⁻⁶]. -5 6.55×10 4 Single-precision floating-point numbers occupy 32 bits (4 bytes) in computer memory, and can also be abbreviated as FP32. The absolute value range of values ​​that a single-precision floating-point number can represent is approximately [1.18 × 10⁻⁶]. -38 3.40×10 38 Double-precision floating-point numbers occupy 64 bits (8 bytes) in computer memory, and can also be abbreviated as FP64. Double-precision floating-point numbers can represent 15 or 16 significant decimal digits, and the absolute value range of the represented values ​​is approximately [2.23 × 10⁻⁶]. -308 1.80×10 308 ].

[0088] Table 1 below shows one storage format for the three floating-point numbers mentioned above. In FP16, the 16 bits used are allocated as follows: 1 bit for the sign bit, 5 bits for the exponent, and 10 bits for the mantissa. In FP32, the 32 bits used are allocated as follows: 1 bit for the sign bit, 8 bits for the exponent, and 23 bits for the mantissa. In FP64, the 64 bits used are allocated as follows: 1 bit for the sign bit, 11 bits for the exponent, and 52 bits for the mantissa.

[0089] Table 1

[0090] sign bit Exponent (exponent) mantissa FP16 1 bit 5 bits 10 bits FP32 1 bit 8 bits 23 bits FP64 1 bit 11 bits 52 bits

[0091] A floating-point matrix can be a matrix whose elements are floating-point numbers. For example, an m x n floating-point matrix contains m x n elements, and these m x n elements can be floating-point numbers. Similar to floating-point matrices, floating-point matrices can also contain matrices with different floating-point formats, such as matrices containing FP16 format floating-point numbers, matrices containing FP32 format floating-point numbers, and matrices containing FP64 format floating-point numbers, etc.

[0092] Figure 2 This is a schematic diagram of a computing device provided in this embodiment. The computing device can be a terminal, network device, or server, or other device with computing capabilities. See also... Figure 2 The computing device may include a memory 201, a processor 202, a communication interface 203, and a bus 204, wherein the memory 201, the processor 202, and the communication interface 203 are interconnected via the bus 204.

[0093] The memory 201 can be used to store data, software programs, and modules, mainly including a program storage area and a data storage area. The program storage area can store the operating system, software applications required for at least one function, and middleware software, etc., while the data storage area can store data created during the use of the device. For example, the operating system may include Linux, Unix, or Windows operating systems, etc.; the software applications required for at least one function may include applications related to artificial intelligence, high-performance computing (HPC), deep learning, or scientific computing, etc.; the middleware software may include linear algebra library functions, etc. In one possible example, the memory 201 includes, but is not limited to, static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), or high-speed random access memory, etc. Furthermore, the memory 201 may also include other non-volatile memories, such as at least one disk storage device, flash memory device, or other volatile solid-state storage devices.

[0094] Additionally, the processor 202 is used to control and manage the operation of the computing device, such as by running or executing software programs and / or modules stored in the memory 201, and by calling data stored in the memory 201, to perform various functions of the computing device and process data. In one possible example, the processor 202 includes, but is not limited to, a central processing unit (CPU), a network processing unit (NPU), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, transistor logic devices, logic circuits, or any combination thereof. It can implement or execute various exemplary logic blocks, modules, and circuits described in conjunction with the disclosure of this application. The processor 202 can also be a combination that implements computing functions, such as including one or more microprocessor combinations, digital signal processors, and microprocessors, etc.

[0095] Communication interface 203 is used to enable communication between the computing device and external devices. Communication interface 203 may include an input interface and an output interface. The input interface can be used to acquire the first and second matrices in compressed format as described in the following embodiments. In some feasible embodiments, the input interface may have only one input interface or multiple input interfaces. The output interface can be used to output the result matrix described in the following embodiments. In some feasible embodiments, the result matrix may be directly output by the processor or may be first stored in memory and then output via memory. In other feasible embodiments, there may be only one output interface or multiple output interfaces.

[0096] Bus 204 can be a Peripheral Component Interconnect Express (PCIe) bus or an Extended Industry Standard Architecture (EISA) bus, etc. Bus 204 can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 2 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.

[0097] In this embodiment, the processor 202 may include a matrix computing device, which may be an ASIC, FPGA, or logic circuit, etc. Of course, the device can also be implemented in software, and this application embodiment does not impose specific limitations on this. The matrix computing device can be used to perform matrix calculations related to artificial intelligence, scientific computing, graphics computing, etc.

[0098] Furthermore, the processor 202 may also include one or more other processing units such as a CPU, GPU, or NPU. For example... Figure 3 As shown, taking the processor 202, which includes a CPU 1 and a matrix computing device 2, as an example, the matrix computing device 2 can be integrated with the CPU 1 (for example, the matrix computing device 2 is integrated inside the SoC where the CPU 1 is located), or it can be set up separately alongside the CPU 1 (for example, the matrix computing device 2 is set up in the form of a PCIe card). Specifically... Figure 3 (a) and Figure 3 As shown in (b) above. Furthermore, the CPU 1 may also include a controller 11, one or more arithmetic logic units (ALUs) 12, a cache 13, and a memory management unit (MMU) 14, etc. Figure 3 The following explanation uses memory 201 as an example of dynamic random access memory (DRAM).

[0099] In this embodiment, the matrix computing device can perform calculations on compressed matrices. When calculating the multiplication of two compressed matrices, the matrix computing device first obtains N first column vectors converted from one of the matrices. These first column vectors contain first element values ​​and their row coordinates. Then, it obtains N first row vectors converted from the other matrix. These first row vectors contain second element values ​​and their column coordinates. The matrix computing device calculates the cross product of the N first column vectors and the N first row vectors to obtain N intermediate result matrices. Each intermediate result matrix includes element values ​​and their corresponding position coordinates, where the position coordinates include the row coordinates of the first element value and the column coordinates of the second element value. The matrix computing device can then sum the third element values ​​with the same position coordinates from the N intermediate result matrices according to the index of the position coordinates to obtain the final matrix. In this embodiment, the matrix calculation device performs calculations on the first and second matrices based on the vector outer product. During the calculation process, the row coordinates of the element values ​​in the first column vector and the column coordinates of the element values ​​in the second column vector are retained. Then, based on the index of the position coordinates, the third element values ​​with the same position coordinates are accumulated to obtain the result matrix of the calculation of the two compressed matrices. Compared with the traditional method that requires decompressing the compressed matrix first and then performing matrix calculations on the decompressed matrix, the matrix calculation device provided in this embodiment can effectively improve the calculation efficiency of compressed matrix.

[0100] This application provides a matrix calculation device. Please refer to... Figure 4A As shown, the matrix calculation device includes a vector outer product processing engine 401 and an accumulator 402. Optionally, the matrix calculation device includes a first buffer 403, with the vector outer product processing engine 401, accumulator 402, and first buffer 403 connected sequentially. The first buffer 403 may be a first buffer 403 within the matrix calculation device (such as a register), or the first buffer 403 may also be... Figure 3 The central processing unit 1 shown in the diagram contains a buffer 13. Optionally, the matrix calculation device further includes a format conversion unit 405 and a second buffer 406. The function of the format conversion unit 405 can be achieved through the above-described... Figure 3 The central processing unit 1 in the matrix computing device can be used for implementation, or the function of the format conversion unit 405 can be implemented by logic circuitry within the matrix computing device. The second buffer 406 can be a register within the matrix computing device, or the second buffer 406 can be... Figure 3 The buffer 13 in the central processing unit 1 shown in the figure.

[0101] The format conversion unit 405 is used to convert the first matrix in compressed format into N first column vectors and the second matrix in compressed format into N first row vectors. The first column vector includes the first element value and the row coordinate of the first element value, and the first row vector includes the second element value and the column coordinate of the second element value.

[0102] Optionally, the structure of the vector outer product processing engine 401 is described below. Please refer to [link / reference]. Figure 4B As shown, Figure 4B This is a schematic diagram of the vector outer product processing engine 401. The vector outer product processing engine 401 comprises multiple processing elements (PEs) 4011, which form a two-dimensional array. Each PE 4011 includes a multiply-accumulate (MAC) operation subunit 40110 and a coordinate merging subunit 40111. Each PE receives two sets of input data and outputs one set of data. Each set of input data contains a value and an index. One set of input data includes the first element value and its corresponding row coordinate, and the other set includes the second element value and its corresponding column coordinate.

[0103] Each PE has two functions. One function is to receive two sets of input data and output one set of output data based on those input data. The output data includes the third element value and its position coordinates. The third element value is obtained by multiplying the first and second element values. The position coordinates are obtained by merging the first row coordinates and the first column coordinates. For example, taking the PE in row 0, column 0 of the PE array as an example, one set of input data consists of the first element value (e.g., a0) in the first row vector and its corresponding row coordinate (e.g., i0). The other set of input data consists of the second element value (e.g., b0) in the first column vector and its corresponding column coordinate (e.g., j0). The MAC operation subunit 40110 receives the first and second element values, performs a multiplication calculation on them, and then outputs the product of the first and second element values ​​(i.e., the third element value). The coordinate merging subunit 40111 receives the row coordinates of the first element value and the column coordinates of the second element value, merges these coordinates, and obtains the position coordinates. The data output by a PE includes a third coordinate value and the position coordinates corresponding to that third coordinate value.

[0104] Another function of a PE is to transmit the first element value and its corresponding column coordinate to the next PE in the row direction, and the second element value and its corresponding row coordinate to the next PE in the column direction. For example, after the first clock cycle, the first PE in the PE array (i.e., the PE in row 0, column 0) will transmit a0 and i0 to the next PE in the row direction (e.g., the PE in row 0, column 1), and will transmit b0 and j0 to the next PE in the column direction (i.e., the PE in row 1, column 0). Optionally, the data transmission method of each PE can be to transmit data to the next level of PE every clock cycle, or, within one clock cycle, the first PE will transmit a0 and i0 to the next PE in the row direction (i.e., the PE in row 0, column 1) until it reaches the last PE in that row (i.e., row 0), and transmit b0 and j0 to the last PE in that column (column 0). In this example, the number of levels of transmission by the PE within each clock unit can be designed according to actual needs and is not limited in specific terms.

[0105] Optionally, the structure of the MAC operation subunit in PE is described below. Please refer to [link / reference]. Figure 4C As shown, Figure 4C This is a schematic diagram of the MAC operation subunit. Each PE's MAC operation subunit includes a sign subunit 40112, an exponentiation subunit 40113, an integer subunit 40114, and a precision format conversion subunit 40115. The sign subunit 40112, exponentiation subunit 40113, and integer subunit 40114 are all connected to the precision format conversion subunit 40115. Specifically, the sign subunit 40112 handles the sign of the input value. The exponentiation subunit 40113 handles the decimal point shift calculation when multiplying two floating-point numbers. The integer subunit 40114 calculates the multiplication of two integers. The precision format conversion subunit 40115 outputs a numerical format that conforms to a standard (such as FP16, FP32, FP64 floating-point formats or int32, int16, int8, etc. integer formats). In this example, the MAC operation subunit is a general-purpose unit supporting both floating-point and integer formats, enhancing the applicability of the matrix calculation device.

[0106] Optionally, the structure of the accumulator 402 in the matrix computing device is described below. Please refer to [link / reference]. Figure 4D As shown, Figure 4DThis is a schematic diagram of the accumulator 402. The accumulator 402 includes multiple accumulator (ACC) processing units 4021, that is, the accumulator 402 comprises an array of ACC processing units 4021. Each ACC processing unit 4021 receives a set of data output from the vector outer product processing engine 401. This set of data includes a third element value and the corresponding position coordinates. Each ACC processing unit 4021 includes an adder 40210 and a data acquisition subunit 40211. The data acquisition subunit 40211 receives the position coordinates and outputs them to a first buffer 403, thereby enabling the adder 40210 to retrieve the buffered value corresponding to the position coordinates from the first buffer 403. The adder 40210 then adds the buffered value output from the first buffer 403 to the third element value. For example, an ACC processing unit 4021 receives a "third element value 0" (such as c) from the intermediate result matrix C0 output by the vector outer product processing engine 401. 00 ) and the c 00 The corresponding position coordinates (e.g., i0, j0) are used to write the "third element value 0" into the first buffer 403 according to the position coordinates (e.g., 1, 1). Then, the ACC processing unit 4021 receives the "third element value 1" (e.g., c) from the intermediate result matrix C1 output by the vector outer product processing engine 401. 10 ) and the c 10 When the corresponding position coordinates are (e.g., 1, 1), the ACC processing unit 4021 receives the cached value (e.g., c) corresponding to that position coordinate output by the first buffer 403. 00 Then, adder 40210 is used to convert c 00 and c 10 Add them together, and then add c. 00 and c 10 The accumulated value is output to the first buffer 403.

[0107] Optionally, please refer to Figure 4EAs shown, the adder 40210 further includes a sign unit 40215, an exponent unit 40216, an integer unit 40217, and a precision format conversion unit 40218. The sign unit 40215, exponent unit 40216, and integer unit 40217 are all connected to the precision format conversion unit 40218. Specifically, the sign unit 40215 handles the sign of the input value. The exponent unit 40216 handles the decimal point shift calculation when multiplying two floating-point numbers. The integer unit 40217 calculates the multiplication of two integers. The precision format conversion unit 40218 outputs a numerical format that conforms to a standard (such as floating-point formats like FP16, FP32, FP64, or integer formats like int32, int16, int8). In this example, the adder 40210 is a general-purpose unit supporting both floating-point and integer formats, enhancing the applicability of the matrix calculation device.

[0108] The specific function of the format conversion unit 405 is explained below. The format conversion unit 405 is used to convert the compressed first matrix into N first column vectors and the compressed second matrix into N first row vectors. Each first column vector includes the first element value and its row coordinate, and each first row vector includes the second element value and its column coordinate. The first matrix has a dimension of M×N, and the second matrix has a dimension of N×K, where M, N, and K are integers greater than or equal to 1.

[0109] See Figure 5A As shown, for ease of explanation, M, N, and K are all illustrated with a dimension of 4 in this example, meaning that both the first and second matrices have a dimension of 4×4. The first matrix is ​​illustrated using matrix A as an example, and the second matrix using matrix B as an example. The compressed format of the first and second matrices is illustrated using COO as an example.

[0110] Format conversion unit 405 is used to split matrix A by columns, dividing matrix A into four first column vectors, namely A0, A1, A2, and A3. Format conversion unit 405 also splits matrix B by rows, dividing matrix B into four first row vectors, namely B0, B1, B2, and B3. For example, the first column vector is A0, where A0 is [a0, a1, a2, a3]. TIn matrix A0, a0, a1, a2, and a3 are element values. Each element value has a corresponding row coordinate. For example, the row coordinate of a0 is i0, the row coordinate of a1 is i1, the row coordinate of a2 is i2, and the row coordinate of a3 is i3. Let's take B0 as an example for the first row vector. B0 is [b0, b1, b2, b3], where b0, b1, b2, and b3 are element values. Each element value in B0 has a corresponding column coordinate. For example, the column coordinate of b0 is j0, the column coordinate of b1 is j1, the column coordinate of b2 is j2, and the column coordinate of b3 is j3. It should be understood that matrix A is split into N columns, and the first row vector after splitting only retains the row coordinate of each element value. Similarly, matrix B is split into N rows, and the first row vector after splitting only retains the column coordinate of each element value. Similarly, A1 is [c0, c1, c2, c3] T In A1, c0, c1, c2, and c3 are element values. Each element value has a corresponding row coordinate. For example, the row coordinate of c0 is k0, the row coordinate of c1 is k1, the row coordinate of c2 is k2, and the row coordinate of c3 is k3. B1 is [d0, d1, d2, d3], where d0, d1, d2, and d3 are element values. Each element value in B1 has a corresponding column coordinate. For example, the column coordinate of d0 is l0, the column coordinate of d1 is l1, the column coordinate of d2 is l2, and the column coordinate of d3 is l3.

[0111] For example, the first matrix in compressed format, represented in COO format, is: (0, 0, 1), (1, 0, 2), (1, 1, 3), (1, 2, 4), (2, 0, 5), (2, 2, 6), (2, 3, 7), (3, 1, 8), (3, 3, 9), (3, 0, 6). The format conversion unit 405 splits the first matrix in compressed format by column. The element values ​​in (0, 0, 1), (1, 0, 2), (2, 0, 5), and (3, 0, 6) are all element values ​​in the same column, i.e., they are all element values ​​in column 0. It should be understood that when the format conversion unit 405 splits matrix A by column, it uses the element values ​​"1", "2", "5", and "6" as element values ​​in vector A0. Since the element values ​​"1", "2", "5", and "6" are all element values ​​in the same column, only the row coordinates of each element value are retained. In this example, vector A0 has four element values: “1”, “2”, “5”, and “6”. The row coordinate of element value “1” is “0”. Similarly, the row coordinate of element value “2” is “1”, the row coordinate of element value “5” is “2”, and the row coordinate of element value “6” is “3”.

[0112] For example, the second matrix in compressed format, represented in COO format, is: (0, 0, 1), (1, 0, 2), (2, 1, 3), (1, 2, 4), (2, 0, 5), (2, 2, 6), (2, 3, 7), (3, 1, 8), (3, 3, 9), (3, 0, 6). The format conversion unit 405 splits the second matrix according to row coordinates, that is, the element values ​​of the same row are treated as element values ​​within the same vector. For example, the four triples (2, 1, 3), (2, 0, 5), (2, 2, 6), and (2, 3, 7) have the same row coordinates. Therefore, the element values ​​of these four triples are split into a row vector (e.g., row vector B0), and the element values ​​"3", "5", "6", and "7" are treated as element values ​​in the first row vector B0, with each element value in the first row vector B0 having a corresponding column coordinate. For example, the column coordinate of element value "3" is "1", the column coordinate of element value "5" is "0", the column coordinate of element value "6" is "2", and the column coordinate of element value "7" is "3". It should be noted that the specific values ​​in the COO format mentioned above are merely examples for illustrative purposes and do not constitute a limiting statement of this application.

[0113] The vector outer product processing engine 401 calculates the outer product of N first column vectors and N first row vectors, resulting in N intermediate result matrices. Each intermediate result matrix includes the value of a third element and the coordinates of that third element. The coordinates include the row coordinates of the first element and the column coordinates of the second element. For an example, please refer to [link to example]. Figure 5B As shown, the vector outer product processing engine 401 calculates the vector outer product of the first column vector and the first row vector. For example, the vector outer product processing engine 401 calculates the vector outer product of A0 and B0 to obtain an intermediate result matrix C0. Similarly, the vector outer product processing engine 401 calculates the vector outer product of A1 and B1 to obtain an intermediate result matrix C1. The vector outer product processing engine 401 calculates the vector outer product of A2 and B2 to obtain an intermediate result matrix C2. The vector outer product processing engine 401 calculates the vector outer product of A3 and B3 to obtain an intermediate result matrix C3. Taking C0 as an example, the intermediate result matrix calculated by the vector outer product processing engine 401 is shown in the following equation (11).

[0114]

[0115] The intermediate result matrix C0 includes third element values, each with corresponding position coordinates. These position coordinates are the union of the row coordinates of the first element value and the column coordinates of the second element value. See, for example, [link to example]. Figure 5CAs shown, the row coordinate of a0 is i0, and the column coordinate of b0 is j0, so the position coordinate of the third element value a0b0 is (i0, j0). Similarly, if the row coordinate of a0 is i0 and the column coordinate of b1 is j1, then the position coordinate of the third element value a0b1 is (i0, j1). If the row coordinate of a1 is i1 and the column coordinate of b0 is j0, then the position coordinate of the third element value a1b0 is (i1, j0), and so on. The position coordinates of the third element values ​​in the intermediate result matrix C0 are not illustrated in detail.

[0116] For example, a0 is 1, a1 is 2, a2 ​​is 5, and a3 is 6. The row coordinate of a0 is "0". Similarly, the row coordinate of a1 is "1", the row coordinate of a2 is "2", and the row coordinate of a3 is "3". b0 is 3, b1 is 5, b2 is 6, and b3 is 7. The column coordinate of b0 is "1", the column coordinate of b1 is "0", the column coordinate of b2 is "2", and the column coordinate of b3 is "3". The third element value a0b0 = 1 × 3 = 3, and the position coordinate of a0b0 includes the row coordinate of a0 and the column coordinate of b0, that is, the position coordinate of a0b0 is (0, 3). The third element value a0b1 = 1 × 5 = 5, and the position coordinate of a0b1 includes the row coordinate of a0 and the column coordinate of b1, that is, the position coordinate of a0b1 is (0, 0). The third element value a1b0 = 2 × 3 = 6. The position coordinates of a1b0 include the row coordinates of a1 and the column coordinates of b0, that is, the position coordinates of a1b0 are (1, 3).

[0117] Please see Figure 5D As shown, the vector outer product processing engine 401 obtains the intermediate result matrix according to the vector outer product calculation formula in equation (7) above. The intermediate result matrix calculated by the vector outer product processing engine 401, taking C1 as an example, is shown in equation (12) below.

[0118]

[0119] Similarly, the intermediate result matrix C1 includes a third element value, and each third element value in C1 has a corresponding position coordinate. For example, please refer to [link to example]. Figure 5D As shown, the row coordinate of c0 is k0, and the column coordinate of d0 is l0, so the position coordinate of the third element value c0d0 is (k0, l0). Similarly, if the row coordinate of c0 is k0 and the column coordinate of d1 is l1, then the position coordinate of the third element value c0d1 is (k0, l1). If the row coordinate of c1 is k1 and the column coordinate of d0 is l0, then the position coordinate of the third element value c1d0 is (k1, l0), and so on. The position coordinates of the third element values ​​in the intermediate result matrix C1 are not illustrated in detail.

[0120] Accumulator 402 is used to accumulate the third element values ​​with the same position coordinates from N intermediate result matrices based on the index of the third element value, to obtain the final matrix. Please refer to [link / reference needed]. Figure 5B As shown in the example, this example uses four intermediate result matrices. In each of the intermediate result matrices C0, C1, C2, and C3, the third element value has a position coordinate. The accumulator 402 accumulates the third element values ​​with the same position coordinates to obtain the result matrix. For example, please refer to... Figure 5E As shown, accumulator 402 adds the four third element values ​​of C0, C1, C2, and C3 with position coordinates (0, 0) to obtain a fourth element value in the resulting matrix. Similarly, accumulator 402 adds the four third element values ​​of C0, C1, C2, and C3 with position coordinates (1, 1) to obtain a fourth element value in the resulting matrix, and so on. In this example, the specific numerical values ​​of the matrix position coordinates are merely illustrative and not intended to be limiting.

[0121] Optionally, the accumulator 402 can accumulate the third element values ​​with the same position coordinates in the N intermediate result matrices according to the index of the position coordinates of the third element value. This can be achieved in at least the following two ways.

[0122] In the first feasible approach, the vector outer product processing engine 401 generates intermediate result matrices in a specific order when calculating the outer product of the first row vector and the first column vector. For example, as described above... Figure 5B Taking the calculation of the outer product of the first row vector and the first column vector shown as an example, the outer product processing engine 401 calculates the outer product of A0 and B0, obtaining an intermediate result matrix C0. Then, the outer product processing engine 401 calculates the outer product of A1 and B1, obtaining an intermediate result matrix C1. Next, the outer product processing engine 401 calculates the outer product of A2 and B2, obtaining an intermediate result matrix C2. Finally, the outer product processing engine 401 calculates the outer product of A3 and B3, obtaining an intermediate result matrix C3. For example, the order of the four intermediate result matrices is C0, C1, C2, C3. The accumulator 402 receives the four intermediate result matrices from the outer product processing engine 401 according to the order in which the four intermediate result matrices are generated.

[0123] Please see Figure 6 As shown, the four intermediate result matrices include at least a first intermediate result matrix (e.g., C0) and a second intermediate result matrix (e.g., C1). To distinguish the position coordinates in the first intermediate result matrix from those in the second intermediate result matrix, the position coordinates of the third element value in the first intermediate result matrix are called the "first position coordinates," and the position coordinates of the third element value in the second result matrix are called the "second position coordinates."

[0124] For example, in the first cache 403, the first cache 403 is divided into multiple storage locations by row coordinate identifiers and ordinate identifiers. For instance, in the first cache 403, the storage space of the first cache 403 is divided into q×p storage locations by p rows and q columns. After receiving C0, the accumulator 402 writes the third element value in C0 to the corresponding position in the first cache 403 according to the first position coordinates. For instance, if the position coordinates of the third element value a0b0 in C0 are (i0, j0), and (i0, j0) is (1, 1), then the accumulator 402 writes a0b0 to the position of the first row and first column in the first cache 403 according to the position coordinates (i0, j0). Similarly, the position coordinates of the third element value a0b1 in C0 are (i0, j1). For example, if (i0, j1) is (1, 2), the accumulator 402 writes a0b1 into the first row and second column of the first buffer 403 according to the position coordinates (i0, j1). The position coordinates of the third element value a0b2 in C0 are (i0, j2). For example, if (i0, j2) is (1, 3), the accumulator 402 writes a0b2 into the first row and third column of the first buffer 403 according to the position coordinates (i0, j2). The position coordinates of the third element value a0b3 in C0 are (i0, j3). For example, if (i0, j3) is (1, 4), the accumulator 402 writes a0b3 into the first row and fourth column of the first buffer 403 according to the position coordinates (i0, j3). The process by which accumulator 402 writes the other third element values ​​in C0 to the first buffer 403 will not be described in detail. The final result is that accumulator 402 writes all the third element values ​​in C0 to the first buffer 403 according to the first position coordinate corresponding to each third element value.

[0125] Then, when accumulator 402 receives C1, it searches the first buffer 403 for a cached value at the corresponding position of the second position coordinate of the third element value in C1. If no cached value is found at the corresponding position, accumulator 402 writes the third element value into the corresponding position of the second position coordinate in the first buffer 403. If a cached value is found at the corresponding position, accumulator 402 reads the cached value from the first buffer 403 and adds it to the third element value in C1. For example, if the position coordinates of the third element value c0d0 in C1 are (k0, l0), and (k0, l0) is (1, 0), accumulator 402 searches the first buffer 403 for the position coordinates (k0, l0). If no cached value is found at the first row, first column position in the first buffer 403, accumulator 402 writes c0d0 into the first row, first column position in the first buffer 403. The position coordinates (k0, l1) of the third element value c0d1 in C1 are given. For example, (k0, l1) is (1, 1). Accumulator 402 queries the first buffer 403 and finds a cached value a0b0 at the position of the first row and first column. Accumulator 402 reads a0b0, adds c0d1 and a0b0, and then writes the result (c0d1+a0b0) back to the corresponding position (first row and first column) of the first buffer 403 according to the position coordinates of c0d1. Similarly, the position coordinates (k0, l2) of the third element value c0d2 in C1 are (k0, l2). For example, if (k0, l2) is (1, 2), the accumulator 402 queries the first buffer 403 and finds a cached value a0b1 at the position of the first row and second column. The accumulator 402 reads this cached value a0b1 from the first buffer 403, adds c0d2 and a0b1, and then writes the accumulated value (c0d2 + a0b1) back to the corresponding position (first row, second column) of the first buffer 403 according to the position coordinates of c0d2. In this example, the third element values ​​in C1 are not illustrated one by one. The final result is that the accumulator 402 adds the third element values ​​with the same position coordinates in C1 and C0, and writes the result back to the corresponding position in the first buffer 403 according to the position coordinates. Similarly, when accumulator 402 receives the intermediate result matrix C2 transmitted by vector outer product processing engine 401, accumulator 402 reads the cached value (the sum of the third element values ​​with the same position coordinates in C0 and C1) from the corresponding position in the first buffer 403 according to the position coordinates of each third element value in C2. Accumulator 402 sums the third element values ​​with the same position coordinates in C0, C1, and C2, and then writes the summed result back to the corresponding position in the first buffer 403. It should be understood that the processing of C2 and C3 by accumulator 402 is similar to the processing of C1, and will not be described in detail here.The final processing result of accumulator 402 is to sum the third element values ​​with the same position coordinates among C0, C1, C2, and C3 to obtain the fifth element value. The resulting matrix includes multiple fifth element values. Finally, the first buffer 403 outputs the resulting matrix.

[0126] It should be noted that the fifth element value is the sum of the third element values ​​from at least one of the intermediate result matrices C0, C1, C2, and C3. For an example, please refer to [link to previous document]. Figure 6 As shown, the coordinates of a0b0 in C0 are (1, 0), while none of the third element values ​​in the other three intermediate result matrices (C1, C2, and C3) have a coordinate of (1, 0). Therefore, the value written to the first buffer 403 in the first row and 0 column is a0b0, and this a0b0 will not be accumulated with the third element values ​​in the other intermediate result matrices. Similarly, the coordinates of a0b1 in C0 and c0d1 in C1 are both (1, 1), while none of the third element values ​​in the other two intermediate result matrices (C2 and C3) have a coordinate of (1, 1). Therefore, the value written to the first buffer 403 in the first row and 0 column is the sum of a0b1 and c0d1. In summary, it should be understood that a fifth element value may be a third element value, the sum of two third element values, the sum of three third element values, or the sum of four third element values, etc. In actual calculations, a certain fifth element value in the result matrix is ​​obtained by summing several third element values. The value is determined by the number of third element values ​​corresponding to the same position coordinates, but the specific number is not limited.

[0127] Furthermore, since the fifth element is the sum of the third element values, it may be zero. In this example, the resulting matrix is ​​in uncompressed format. Optionally, to save transmission resources or facilitate subsequent calculations, the matrix computation device can compress the resulting matrix to output a compressed matrix. Please refer to [link to relevant documentation]. Figure 7 As shown, Figure 7 This is another schematic diagram of the matrix computing device. The matrix computing device and matrix compression unit 404 are connected to the first buffer 403. The first buffer 403 outputs a result matrix, and the matrix compression unit 404 receives the result matrix and converts it into a compressed matrix according to the row and column coordinate identifiers in the first buffer 403. The result matrix can be in COO format, CSR format, or CSC format, etc., and is not specifically limited.

[0128] It should be understood that during the accumulation calculation process, the first buffer 403 is used to store the third element value and the result after accumulating the third element values. The storage space of the first buffer 403 needs to be greater than or equal to a certain value. For example, if the dimension of each intermediate result matrix is ​​M×P, that is, each intermediate result matrix includes M×P third element values, and the number of cache locations included in the first buffer 403 is greater than or equal to M×P, then the storage space of the first buffer 403 needs to be able to store at least M×P values.

[0129] In the first possible implementation, the matrix computing device can output uncompressed or compressed matrices according to different application scenarios, thus increasing the applicable scenarios of the matrix computing device.

[0130] In the second possible implementation, please refer to Figure 8 As shown, accumulator 402 sorts the third element values ​​in N intermediate result matrices according to their position coordinates. Then, it compares the position coordinates in the sorted N intermediate result matrices, adds the third element values ​​with the same position coordinates, and removes the position coordinates of zero-value elements to obtain a compressed result matrix. For example, when accumulator 402 receives C0, it first sorts the third element values ​​in C0 according to their position coordinates and writes them into the first buffer 403. Accumulator 402 can sort according to the row coordinates of the third element values; optionally, it can sort according to the column coordinates. In this example, sorting by the row coordinates of the third element values ​​is used as an example. When accumulator 402 receives C1, it sorts the third element values ​​in C1 according to their row coordinates. Accumulator 402 then compares the position coordinates of the third element values ​​in C1 with those in C0 according to the order of their position coordinates. For example, accumulator 402 first compares the row coordinates, such as comparing k0 and i0. If k0 and i0 match, it continues to compare the column coordinates. If k0 and i0 do not match, accumulator 402 continues to compare k0 and i1. After comparing the row coordinates, accumulator 402 compares the column coordinates l0 and j0, l0 and j1, etc., in the order of position coordinates. When position coordinates (k0, l0) and position coordinates (i0, j1) match, accumulator 402 adds the third element value c0d0 corresponding to position coordinates (k0, l0) and the third element value a0b0 corresponding to position coordinates (i0, j1), and then writes the result of the addition into the first buffer 403.

[0131] Similarly, when accumulator 402 receives C2, it sorts the third element values ​​in C2 according to their position coordinates. Accumulator 402 reads the position coordinates from the first buffer 403, compares the position coordinates of the third element values ​​in C2 with those in the first buffer 403, adds the third element values ​​with the same position coordinates to the buffered values, and then stores the accumulated value in the first buffer 403. The calculation of the third element value in C3 by accumulator 402 is similar to that in C2, and will not be elaborated further here. The final result is that accumulator 402 accumulates the third element values ​​with the same position coordinates in the intermediate result matrices C3, C2, C1, and C0 to obtain the fifth element value, and accumulator 402 removes zero elements and their position coordinates from multiple fifth element values. The first buffer 403 outputs a result matrix containing the fifth element value and its position coordinates. Furthermore, the fifth element in the resulting matrix is ​​a non-zero element value, thus the resulting matrix output by the first buffer 403 is a compressed matrix.

[0132] In the second implementation, a compressed matrix can be directly obtained. The compressed matrix output by the matrix computing device can be applied to application scenarios where subsequent calculations require a compressed matrix. Furthermore, since the matrix computing device outputs a compressed matrix, the transmission resources for subsequent matrix transmission can be reduced. Moreover, in the second implementation, because the third element value cached in the first buffer 403 is cached in order of position coordinates, the implementation in this example can be achieved with less cache space, thus saving cache space in the first buffer 403.

[0133] Optionally, in this example, the format conversion unit 405 is further configured to acquire the fifth and sixth matrices, perform format conversion on the uncompressed fifth and sixth matrices to obtain the first and second matrices in compressed format, and output the first and second matrices to the second buffer 406, wherein at least one of the fifth and sixth matrices is an uncompressed matrix. The vector outer product processing engine 401 acquires the compressed first and second matrices from the second buffer 406. In this example, the matrix computing device can receive uncompressed matrices and then convert them into compressed matrices, thereby enabling the matrix computing device to support matrix calculations in multiple formats. In this example, the fifth and sixth matrices may include the following cases.

[0134] First scenario: Please refer to [link / reference] Figure 9As shown, both the fifth and sixth matrices are uncompressed matrices. In this case, the format conversion unit 405 converts the fifth matrix into a compressed first matrix and the sixth matrix into a compressed second matrix. Optionally, the format conversion unit 405 can convert the fifth and sixth matrices into compressed matrices in COO format. Optionally, since the CSC compressed format retains the row coordinates of the element values, while the CSR compressed format retains the column coordinates of the element values, the format conversion unit can convert the uncompressed fifth matrix into a CSC compressed first matrix and the uncompressed sixth matrix into a CSR compressed format. Further, the format conversion unit 405 converts the compressed first matrix into N first column vectors, converts the compressed second matrix into N first row vectors, and writes the N first column vectors and N first row vectors into the first buffer 406. The vector outer product processing engine 401 retrieves the first row vectors and the second row vectors from the second buffer 406.

[0135] The second scenario: One of the fifth and sixth matrices is in uncompressed format, while the other is in compressed format. Let's take an example where the fifth matrix is ​​uncompressed and the sixth matrix is ​​compressed. If the sixth matrix is ​​a compressed matrix in CSC or CSR format, the format conversion unit will convert both the fifth and sixth matrices into compressed matrices in COO format.

[0136] Optionally, the format conversion unit 405 is also used to convert the compressed matrix into a target compressed matrix. For example, when both the first matrix and the second matrix are in CSC or CSR format, the format conversion unit converts both the first matrix and the second matrix into COO format.

[0137] In this example, the matrix computing device can convert an uncompressed matrix into a compressed matrix using a format conversion unit, thus enabling the matrix computing device to support both compressed and uncompressed matrix calculations. Optionally, the format conversion unit can convert a matrix that is not in the target compressed format into a matrix in the target compressed format (such as COO format). In this example, the matrix computing device can convert matrices in other compressed formats into matrices in the target compressed format and perform matrix calculations on the matrices in the target compressed format. The matrix computing device provided in this application can support matrix calculations in various formats.

[0138] Optionally, to improve the applicability of the matrix calculation unit, high-precision matrix calculation can be implemented based on a low-precision matrix calculation device. The precision of the element values ​​included in the first column vector and the first row vector is first precision. The format conversion unit 405 decomposes the first column vector into X second column vectors and the first row vector into X second row vectors, wherein the second column vectors and the second row vectors contain second precision element values, and the first precision is higher than the second precision. Then, the vector outer product processing engine 401 calculates the vector outer product of the X second column vectors and the X second row vectors to obtain X 2 A fourth matrix, comprising the fourth element value and its position coordinates. Finally, accumulator 402 increments X based on the index of the fourth element value's position coordinates. 2 The values ​​of the fourth elements with the same position coordinates in the fourth matrix are summed to obtain the intermediate result matrix. The precision of the third element value in the intermediate result matrix is ​​the first precision.

[0139] The following example illustrates how a vector of first precision is decomposed into a vector of second precision. The format conversion unit 405 converts the compressed first matrix into N first column vectors and the compressed second matrix into N first row vectors. It can further decompose a first column vector into X second column vectors and a first row vector into X second row vectors according to the precision of its element values. The precision of the element values ​​in the first column and first row vectors is the first precision. Both the second column and second row vectors contain element values ​​of second precision. For ease of explanation, a vector containing element values ​​of second precision is called a "vector of second precision," and a vector containing element values ​​of first precision is called a "vector of first precision." X is an integer greater than or equal to 2. The first and second precisions can be integer precision or floating-point precision; the specific precision is not limited. Specifically, the format conversion unit 405 splits each element value in the first precision vector into multiple second precision values, thus decomposing the first precision vector into X second precision vectors. The following explanation covers the cases where the vectors of first precision and second precision are integers, and the vectors of first precision and second precision are floating-point vectors.

[0140] In the first case, both the first-precision vector and the second-precision vector are integers. The first precision is higher than the second precision. For example, if the first precision is int32, the second precision can be int2, int4, int8, or int16. Alternatively, the first precision is int16, and the second precision can be int4 or int8. Or, the first precision is int8, and the second precision is int2 or int4. Or, the first precision is int4, and the second precision is int2, and so on.

[0141] The format conversion unit 405 splits a high-precision integer value into multiple low-precision integer values, splitting the high-precision integer value in order from the most significant digit to the least significant digit. For example, please refer to... Figure 10 As shown, the example uses int32 as the first precision and int16 as the second precision. int32 has a 32-bit value field. Dividing an int32 into two 16-bit segments from the most significant bit to the least significant bit results in two int16 segments. Alternatively, dividing an int32 into four 8-bit segments from the most significant bit to the least significant bit results in four int8 segments. Similarly, int16 has a 16-bit value field. Dividing an int16 into two 8-bit segments from the most significant bit to the least significant bit results in two int8 segments.

[0142] In the second case, the vectors with the first precision and the vectors with the second precision are floating-point types. For example, the first precision is FP32 or FP64, and the second precision is FP16. Alternatively, the first precision is FP64, and the second precision floating-point can be either FP32 or FP16. The following explanation uses the case where the first precision floating-point number is FP32 and the second precision floating-point number is FP16 as an example.

[0143] 1. Decompose an FP32 into three FP16s.

[0144] Currently, the standard format FP32 is composed as shown in Table 1 above. FP32 includes a 1-bit sign, an 8-bit exponent, and a 23-bit mantissa. Additionally, there is an omitted 1-bit integer, which is 1. For a standard format FP32, the integer plus the mantissa totals 24 bits. The standard format FP16 consists of a 1-bit sign, a 5-bit exponent, and a 10-bit mantissa. Additionally, there is an omitted 1-bit integer, which is 1. For a standard format FP16, the integer plus the mantissa totals 11 bits. To decompose a standard format FP32 into a standard format FP16, three standard format FP16s are needed.

[0145] The integer and mantissa of a standard FP32 integer can be divided into three parts: the first part is the integer and the first 10 bits of the mantissa; the second part is the mantissa from bits 11 to 21; and the third part is the mantissa from bits 22 to 23. Each of these three parts is represented by a standard FP16 integer. It should be noted that when representing the mantissa from bits 22 to 23 in the standard FP16 format, nine zeros can be added to the end of the mantissa from bit 23. That is, the mantissa from bits 22 to 23, along with the added zeros, are represented by a single standard FP16 integer.

[0146] Furthermore, the exponent range of FP16 is -15 to 15, meaning it can represent a decimal point shifted 15 bits to the left or right. When representing the first part of the aforementioned FP32 using standard FP16 format, the fixed exponent shift value is 0; when representing the second part, the fixed exponent shift value is -11; and when representing the third part, the fixed exponent shift value is -22. It is evident that when representing the third part, the corresponding fixed exponent shift value alone exceeds the exponent range of FP16. Therefore, the corresponding fixed exponent shift value can be extracted from the exponent of each standard FP16 format.

[0147] Therefore, a standard FP32 format can be represented as:

[0148] Where A1 is the standard format FP32, EA1 is the exponent of A1, a0, a1 and a2 are the three standard formats of FP16 obtained by decomposition, and S1 is the minimum fixed exponent shift value. For this standard format FP16, S1 = 11.

[0149] In addition, a common exponent shift value can be extracted from the exponent of each standard FP16 format. Therefore, similarly, for a standard FP32 format, it can be represented as:

[0150] Where a0′, a1′, and a2′ are the three standard formats of FP16 obtained from the decomposition. In the two representation methods described above, the decomposed FP16s have the following relationship: a1 = a1',

[0151] 2. Decompose an FP32 to obtain two FP16s.

[0152] To reduce the number of FP16 values ​​obtained from decomposition, the current standard FP16 format can be adjusted by changing its mantissa to 13 bits, while keeping the number of bits for the sign and exponent unchanged. This adjusted FP16 can be called a non-standard FP16 format. The integer plus the mantissa in this non-standard FP16 format totals 14 bits. Therefore, if we want to represent the mantissa of a standard FP32 format using a non-standard FP16 format, only two non-standard FP16 formats are needed.

[0153] The standard FP32 integer and mantissa are divided into two parts: the first part is the integer and the first 13 bits of the mantissa, and the second part is bits 14 to 23. These two parts are then represented by the non-standard FP16.

[0154] It should also be noted that when the second part is represented by a non-standard FP16, four zeros can be added to the end of the 23rd bit. That is, the mantissas from the 14th bit to the 23rd bit, along with the added zeros, are represented by a non-standard FP16 format. Similar to the first case above, the corresponding fixed exponent shift value can also be extracted from the exponent of each standard FP16 format.

[0155] Therefore, for a standard FP32 format, it can be represented as:

[0156] Where A2 is the standard format FP32, EA2 is the exponent of A2, a3 and a4 are the two non-standard format FP16 obtained by decomposition, and S2 is a fixed exponent shift value. For the non-standard format FP16, S2 = 14.

[0157] In addition, a common exponent shift value can be extracted from the exponent of each standard FP16 format. Therefore, similarly, for a standard FP32 format, it can be represented as:

[0158] Here, a3′ and a4′ are two non-standard FP16 formats obtained from the decomposition. The FP16 obtained from the decomposition in the two representation methods described above have the following relationship: a4 = a4′.

[0159] Of course, for the case where the first precision floating-point number is FP64 and the second precision floating-point number is FP32, decomposing FP64 into multiple FP32 numbers can be done in the following ways: decomposing one FP64 floating-point number into three FP32 floating-point numbers; or decomposing one FP64 floating-point number into two FP32 floating-point numbers. Optionally, for the case where the first precision floating-point number is FP64 and the second precision floating-point number is FP16, decomposing FP64 into multiple FP16 numbers can be done in the following ways: decomposing one FP64 floating-point number into five FP16 floating-point numbers; or decomposing one FP64 floating-point number into four FP16 floating-point numbers. The decomposition principle is similar to the case described above where the first precision floating-point number is FP32 and the second precision floating-point number is FP16, and will not be elaborated here.

[0160] For example, to simplify the explanation, let's consider splitting the first row vector into two second row vectors and the first column vector into two second column vectors. For instance, the first row vector A0 might be a column vector [a0, a1, a2, a3] with a precision of FP32. TAs described above, the method of splitting a first-precision value into two second-precision values ​​is used. The format conversion unit 405 splits the FP32 precision floating-point number a0 into two FP16 floating-point numbers (e.g., a...). 0M and a 0L Similarly, split a1 into a. 1M and a 1L Split a2 into a 2M and a 2L Split a3 into a 3M and a 3L As shown in Figure 11, the format conversion unit 405 converts the column vector [a0 a1 a2 a3] with a precision of FP32. T Split into two column vectors with precision FP16 [a 0M a 1M a 2M a 3M ] T (denoted as "second column vector 1") and [a 0L a 1L a 2L a 3L ] T (Referred to as "second column vector 2"). Similarly, B0 is a row vector [b0b1 b2 b3] with precision FP32. [b0 b1 b2 b3] is also split into two row vectors [b...]. 0M b 1M b 2M b 3M ](denoted as "second row vector 1") and [b 0L b 1L b 2L b 3L (denoted as "second row vector 2"). It should be noted that the row coordinates of each element value in the first and second column vectors are the same, i.e., a. 0M and a 0L The row coordinates are i0, a 1M and a 1L The row coordinates are i1, a 2M and a 2L The row coordinate is i2. 3M and a 3L The row coordinate is i3. The column coordinate of each element in the first and second row vectors is the same, i.e., b. 0M and b 0L The column coordinates are j0, b 1M and b 1L The column coordinates are j1, b 2M and b 2L The column coordinates are j2, b 3M and b3L The column coordinate is j3.

[0161] Furthermore, the vector outer product processing engine 401 calculates the outer product of X second column vectors and X second row vectors to obtain X. 2 A fourth matrix is ​​formed, comprising a fourth element value and its position coordinates. The position coordinates of the fourth element value include the row coordinates and column coordinates of the element value in the first column vector. The precision of the fourth element value is the first precision (e.g., FP32). For this example, please refer to [link to example]. Figure 12 As shown, X is illustrated using 2 as an example. The vector outer product processing engine 401 calculates [a] 0M a 1M a 2M a 3M ] T and [a 0L a 1L a 2L a 3L ] T These two second column vectors and [b 0M b 1M b 2M b 3M ] and [b 0L b 1L b 2L b 3L The outer product of these two second-row vectors. That is, the outer product processing engine 401 calculates [a]. 0M a 1M a 2M a 3M ] T and [b 0M b 1M b 2M b 3M The outer product of the vectors [a] is used to obtain the fourth matrix 1; calculate [a] 0M a 1M a 2M a 3M ] T and [b 0L b 1L b 2L b 3L The outer product of the vectors [a] is used to obtain the fourth matrix 2; calculate [a] 0L a 1L a 2L a 3L ] T and [b 0M b 1M b 2M b 3M The outer product of the vectors ] yields the fourth matrix 3; [a0L a 1L a 2L a 3L ] T and [b 0L b 1L b 2L b 3L The vector outer product of [] yields the fourth matrix 4. The vector outer product processing engine 401 calculates four fourth matrices through the vector outer product. The accumulator 402 retrieves these four fourth matrices and, based on the index of the fourth element's position coordinates, accumulates the fourth element values ​​with the same position coordinates in the four fourth matrices to obtain an intermediate result matrix. The precision of the third element value in the intermediate result matrix is ​​the first precision. For example, the fourth element value 'a' in fourth matrix 1... 0M b 0L The fourth element value a in the fourth matrix 2 0M b 1M The fourth element value a in the fourth matrix 3 1L b 0L The fourth element value a in the fourth matrix 4 1L b 1M The position coordinates are the same, accumulator 402 will a 0M b 0L a 0M b 1M a 1L b 0L and a 1L b 1M The sum of these four fourth-element values ​​results in a third-element value in the intermediate result matrix C0. In this example, X... 2 For details on the method of accumulating the values ​​of the fourth elements with the same position coordinates in the fourth matrix, please refer to [link / reference]. Figure 6 and Figure 8 The accumulation method of accumulator 402 in the corresponding example is similar; please refer to the above for details. Figure 5E , Figure 6 and Figure 8 The method executed by accumulator 402 in the corresponding example will not be elaborated here. In this example, when calculating the outer product of two vectors, the first precision (high precision) vector can be further split into multiple second precision (low precision) vectors, and then the outer product of the low precision vectors can be calculated. This allows the outer product of the first precision vectors to be obtained by accumulating the results of the outer products of multiple second precision vectors, without loss of precision.

[0162] In this example, if the element values ​​in the first row vector and the first column vector have high precision, the matrix computing device can decompose both the first row vector and the first column vector into multiple low-precision vectors. The matrix computing device performs a vector outer product calculation on the low-precision second column vector and the second row vector to obtain the resulting matrix. This allows for calculations on compressed matrices and enables high-precision matrix calculations based on low-precision matrix computing devices, improving the applicability of the matrix computing device. Furthermore, during the matrix calculation process, upper-layer software applications (such as AI and HPC) based on the matrix computing device are unaware of the specific process of matrix calculation, thereby greatly reducing the cost of software adaptation.

[0163] Based on the matrix computing device provided in this application, significant benefits can be obtained in multiple matrix computing scenarios. For example, when used in AI training and inference scenarios, the matrix computing device can fully support both compressed and uncompressed matrix calculations. Given the sparsity of weights and feature data in AI computation, which averages over 50% (i.e., over 50% are compressed matrices), the matrix computing format in this application can directly perform calculations on compressed matrices without decompressing them, thus improving computational efficiency by more than four times. Furthermore, for HPC scenarios such as scientific computing, whether it's high-computing-power-requirement uncompressed matrix calculations or memory-bandwidth-constrained matrix calculations, the matrix computing device in this application can directly access compressed matrices from memory, enhancing computational efficiency. The matrix computing device supports full-precision numerical calculations and can effectively cover calculations with various precision requirements. For example, AI training scenarios often require FP32 and FP16 floating-point calculations, and some AI training scenarios and HPC scientific computing scenarios requiring FP64 can also be fully supported by the matrix computing device. Furthermore, the MAC in the matrix computing device can also support low-to-medium precision integer formats such as INT1, INT2, INT4, and INT8. For AI inference computing scenarios, it can improve computing power and reduce inference computing time. It can also provide a solution for scenarios with mixed precision in inference computing, greatly enhancing the applicability of the matrix computing device.

[0164] The above has described an embodiment of a matrix computing device; the following describes the method performed by the matrix computing device. Please refer to [link to documentation]. Figure 13 As shown in the embodiments of this application, a matrix calculation method is provided, and the execution subject of this method can be... Figure 2 The computing device shown may optionally be the subject of the method. Figure 3 The matrix computation apparatus shown can optionally be used as the execution subject of the method. Figure 4A The matrix computation apparatus shown can optionally be used as the execution subject of this method. Figure 7 The matrix calculation device shown.

[0165] Step 1301: Obtain the first calculation instruction, which includes N first column vectors and N first row vectors. The N first column vectors are obtained by converting the first matrix in compressed format, and the N first row vectors are obtained by converting the second matrix in compressed format. N is an integer greater than or equal to 1.

[0166] Step 1302: Calculate the cross product of N first column vectors and N first row vectors to obtain N intermediate result matrices. The first column vector includes the first element value and the row coordinate of the first element value, the first row vector includes the second element value and the column coordinate of the second element value, and the intermediate result matrix includes the third element value and the position coordinate of the third element value. The position coordinate includes the row coordinate of the first element value and the column coordinate of the second element value.

[0167] Please refer to the above-described matrix calculation device embodiment for this step. Figure 5B , Figure 5C and Figure 5D The specific details of the functions performed by the vector outer product processing engine 401 in the corresponding example are not elaborated here.

[0168] Step 1303: Based on the index of the position coordinates of the third element value, sum the third element values ​​with the same position coordinates in the N intermediate result matrices to obtain the result matrix.

[0169] The N intermediate result matrices include at least a first intermediate result matrix and a second intermediate result matrix. The position coordinates of the third element value in the first intermediate result matrix are the first position coordinates, and the position coordinates of the third element value in the second intermediate result matrix are the second position coordinates.

[0170] In a first possible implementation, the matrix computing device writes the third element value of the first intermediate result matrix into the corresponding position in the buffer according to the first position coordinate, following the generation order of the N intermediate result matrices. Then, based on the second position coordinate of the third element value in the second intermediate result matrix, it reads the cached value at the corresponding position in the buffer, and accumulates the third element value of the second intermediate result matrix and the cached value to obtain an uncompressed result matrix. Optionally, the matrix computing device compresses the uncompressed result matrix to obtain a compressed result matrix.

[0171] In the first possible implementation, please refer to the matrix calculation device embodiment described above. Figure 5E and Figure 6 The specific details of the function performed by accumulator 402 in the corresponding example are not elaborated here.

[0172] In the second possible implementation, the matrix computing device sorts the third element values ​​in the N intermediate result matrices according to their position coordinates. The matrix computing device then compares the position coordinates in the sorted N intermediate result matrices, adds the third element values ​​with the same position coordinates, and removes the position coordinates of zero-value elements to obtain a compressed result matrix.

[0173] In the second possible implementation, please refer to the matrix calculation device embodiment described above. Figure 5E and Figure 8 The specific details of the function performed by accumulator 402 in the corresponding example are not elaborated here.

[0174] In this embodiment, the matrix computing device can directly perform calculations on compressed matrices without decompressing them as in traditional methods, and then performing matrix calculations on the decompressed matrices. The matrix computing device in this embodiment can improve the computational efficiency of compressed matrices.

[0175] Optionally, please refer to Figure 14 As shown, in order to support matrix calculations in multiple formats, the matrix calculation device can convert uncompressed matrices into compressed matrices, thereby enabling calculations on uncompressed matrices. The format conversion of uncompressed matrices is shown in steps 1401 and 1402 below.

[0176] Step 1401: Obtain the third calculation instruction, which includes a fifth matrix and a sixth matrix, wherein at least one of the fifth matrix and the sixth matrix is ​​an uncompressed matrix.

[0177] Step 1402: Convert the format of the fifth matrix to obtain the first matrix in compressed format, and convert the format of the sixth matrix to obtain the second matrix.

[0178] Please refer to the embodiment of the matrix calculation device described above for steps 1401 and 1402. Figure 9 The specific details of the functions performed by the format conversion unit 405 in the corresponding example are not elaborated here.

[0179] In this embodiment, the matrix computing device can convert uncompressed matrices into compressed matrices via a format conversion unit, thus enabling the matrix computing device to support both compressed and uncompressed matrix calculations. Optionally, the format conversion unit can convert matrices in a non-target compressed format into matrices in a target compressed format (such as COO format). In this example, the matrix computing device can convert matrices in other compressed formats into matrices in the target compressed format and perform matrix calculations on the matrices in the target compressed format. The matrix computing device provided in this application can support matrix calculations in various formats.

[0180] Step 1403: Obtain the second calculation instruction, which includes the first matrix and the second matrix.

[0181] Step 1404: Convert the first matrix into N first column vectors and the second matrix into N first row vectors.

[0182] Please refer to the embodiment of the matrix calculation device described above for steps 1403 and 1404. Figure 5A The function performed by the format conversion unit 405 in the corresponding example will not be elaborated here.

[0183] Optionally, to improve the applicability of the matrix computation unit, high-precision matrix computation can be implemented based on a low-precision matrix computation device. The vector of first precision is split into multiple vectors of second precision, and then the vector outer product is calculated on the vectors of second precision. Please refer to steps 1405 to 1407 below.

[0184] Step 1405: Split the first column vector into X second column vectors, and split the first row vector into X second row vectors. The precision of the element values ​​in the first column vector and the first row vector is the first precision, and the precision of the element values ​​in the second column vector and the second row vector is the second precision. The first precision is higher than the second precision, and X is an integer greater than or equal to 2.

[0185] Please refer to the embodiment of the matrix calculation device described above for this step. Figure 10 and Figure 11 The specific details of how the format conversion unit 405 splits integer values ​​and how it splits floating-point values ​​in the corresponding examples are not elaborated here.

[0186] Step 1406: Calculate the outer product of X second column vectors and X second row vectors to obtain X. 2 A fourth matrix, which includes the fourth element value and the position coordinates of the fourth element value. The position coordinates of the fourth element value include the row coordinates of the first element value and the column coordinates of the second element value. The precision of the fourth element value is the first precision.

[0187] Please refer to the embodiment of the matrix calculation device described above for this step. Figure 12 The description of the functions performed by the vector outer product processing engine 401 in the corresponding example is not repeated here.

[0188] Step 1407: Based on the index of the fourth element's position coordinates, set X... 2 The values ​​of the fourth elements with the same position coordinates in the fourth matrix are summed to obtain the intermediate result matrix. The precision of the third element value in the intermediate result matrix is ​​the first precision.

[0189] Step 1408: Based on the index of the position coordinates of the third element value, sum the third element values ​​with the same position coordinates in the N intermediate result matrices to obtain the result matrix.

[0190] Please refer to the embodiment of the matrix calculation device described above for steps 1407 and 1408. Figure 5E , Figure 6 and Figure 8 The function performed by accumulator 402 in the corresponding example will not be elaborated here.

[0191] In this embodiment of the application, when calculating the cross product of two vectors, the first precision (high precision) vector can be further split into multiple second precision (low precision) vectors, and then the cross product of the low precision vectors can be calculated. This allows the cross product of the first precision vectors to be obtained by accumulating the cross product results of multiple second precision vectors, without loss of precision.

[0192] In one embodiment of this application, a matrix calculation circuit is provided. This matrix calculation circuit is used in one or more steps 1301-1303 of the above method embodiment, or to execute one or more steps 1401-1408. In practical applications, this matrix calculation circuit can be an ASIC, FPGA, or logic circuit, etc.

[0193] In another embodiment of this application, a matrix computing system or chip is also provided, the structure of which can be as follows: Figure 3 As shown, it includes: a processor (taking a central processing unit as an example) 1 and a matrix computing device 1. The processor 1 is used to send calculation instructions to the matrix computing device 1, and the matrix computing device 2 is used to execute one or more steps 1301-1303, or one or more steps 1401-1408, in the above method embodiment.

[0194] In another embodiment of this application, a matrix computing device is provided, the structure of which can be as follows: Figure 2As shown, this device can specifically be a PCIe card, a SoC, a processor, or a server that includes the aforementioned hardware. See also Figure 2 The device includes a memory 201, a processor 202, a communication interface 203, and a bus 204. The communication interface 203 may include an input interface and an output interface.

[0195] The processor 202 can be configured to execute one or more steps 1301-1303, or one or more steps 1401-1408, in the above method embodiments. In some feasible embodiments, the processor 202 may include a matrix calculation unit, which can be used to support the processor in executing one or more steps in the above method embodiments. In practical applications, the matrix calculation unit can be an ASIC, FPGA, or logic circuit, etc. Of course, the matrix calculation unit can also be implemented by software, and this application embodiment does not impose specific limitations on it.

[0196] It should be noted that each component of the matrix calculation circuit, matrix calculation system, and matrix calculation device provided in the embodiments of this application is used to implement the functions of each step in the corresponding method embodiments. Since each step has been described in detail in the foregoing method embodiments, it will not be repeated here.

[0197] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more sets of available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. A semiconductor medium can be a solid-state drive (SSD).

[0198] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit it. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.

Claims

1. A matrix calculation device, characterized in that, include: A vector outer product processing engine is used to obtain N first column vectors and N first row vectors, calculate the vector outer product of the N first column vectors and the N first row vectors, and obtain N intermediate result matrices. The first column vectors include a first element value and its row coordinate; the first row vectors include a second element value and its column coordinate; the intermediate result matrices include a third element value and its position coordinate, where the position coordinate includes the row coordinate of the first element value and the column coordinate of the second element value. The third element value is obtained by multiplying the first element value and the second element value, and the position coordinate is obtained by merging the row coordinate of the first element value and the column coordinate of the second element value. The N first column vectors are obtained by converting a first matrix in compressed format, and the N first row vectors are obtained by converting a second matrix in compressed format. N is an integer greater than or equal to 1. The precision of the element values ​​included in the first column vectors and the first row vectors is a first precision. An accumulator is used to accumulate the third element values ​​with the same position coordinates in N intermediate result matrices according to the index of the position coordinates of the third element value, so as to obtain a result matrix; The vector outer product processing engine is also used to calculate the vector outer product of X second column vectors and X second row vectors to obtain X. 2 A fourth matrix, comprising a fourth element value and the position coordinates of the fourth element value, wherein the position coordinates of the fourth element value include the row coordinates of the first element value and the column coordinates of the second element value, and the precision of the fourth element value is a first precision; wherein, the X second column vectors are obtained by splitting the first column vector, the X second row vectors are obtained by splitting the first row vector, and the precision of the element values ​​contained in the second column vector and the second row vector is a second precision, wherein the first precision is higher than the second precision, and X is an integer greater than or equal to 2; The accumulator is further configured to, based on the index of the position coordinates of the fourth element value, convert the X... 2 The intermediate result matrix is ​​obtained by summing the values ​​of the fourth elements with the same position coordinates in the fourth matrix, and the precision of the third element value in the intermediate result matrix is ​​the first precision. The device also includes a format conversion unit; The format conversion unit is used to obtain the first matrix and the second matrix, convert the first matrix into N first column vectors, and convert the second matrix into N first row vectors.

2. The apparatus according to claim 1, characterized in that, The N intermediate result matrices include at least a first intermediate result matrix and a second intermediate result matrix, wherein the position coordinates of the third element value in the first intermediate result matrix are the first position coordinates, and the position coordinates of the third element value in the second intermediate result matrix are the second position coordinates; The accumulator is also specifically used for: According to the generation order of the N intermediate result matrices, the value of the third element in the first intermediate result matrix is ​​written into the corresponding position in the buffer according to the first position coordinates; Based on the second position coordinate of the third element value in the second intermediate result matrix, the cached value at the corresponding position of the second position coordinate in the cache is read, and the third element value in the second intermediate result matrix and the cached value are summed to obtain the uncompressed result matrix.

3. The apparatus according to claim 2, characterized in that, The device also includes a matrix compression unit; The matrix compression unit is used to compress the uncompressed result matrix to obtain a compressed result matrix.

4. The apparatus according to claim 1, characterized in that, The accumulator is also specifically used for: Sort the third element values ​​in the N intermediate result matrices according to the position coordinates of the third element values; The position coordinates in the sorted N intermediate result matrices are compared, the third element values ​​with the same position coordinates are added together, and the position coordinates with zero element values ​​are deleted to obtain the result matrix in compressed format.

5. The apparatus according to claim 1, characterized in that, The device also includes a format conversion unit; The format conversion unit is further configured to acquire a fifth matrix and a sixth matrix, perform format conversion on the fifth matrix to obtain the first matrix, and perform format conversion on the sixth matrix to obtain the second matrix, wherein at least one of the fifth matrix and the sixth matrix is ​​an uncompressed matrix.

6. A matrix calculation method, characterized in that, include: Obtain the first matrix and the second matrix; Transform the first matrix into N column vectors, and transform the second matrix into N row vectors. Obtain N first column vectors and N first row vectors, where the precision of the element values ​​included in the first column vectors and the first row vectors is the first precision; Calculate the outer product of N first column vectors and N first row vectors to obtain N intermediate result matrices. Each first column vector includes a first element value and its row coordinates; each first row vector includes a second element value and its column coordinates; each intermediate result matrix includes a third element value and its position coordinates. The position coordinates include the row coordinates of the first element value and the column coordinates of the second element value. The third element value is obtained by multiplying the first and second element values. The position coordinates are obtained by merging the row coordinates of the first element value and the column coordinates of the second element value. The N first column vectors are obtained by converting a first matrix in compressed format, and the N first row vectors are obtained by converting a second matrix in compressed format. N is an integer greater than or equal to 1. Based on the index of the position coordinates of the third element value, the third element values ​​with the same position coordinates in the N intermediate result matrices are summed to obtain the result matrix; The step of calculating the outer product of N first column vectors and N first row vectors to obtain N intermediate result matrices includes: Calculate the outer product of X second column vectors and X second row vectors to obtain X. 2 A fourth matrix, comprising a fourth element value and the position coordinates of the fourth element value, wherein the position coordinates of the fourth element value include the row coordinates of the first element value and the column coordinates of the second element value, and the precision of the fourth element value is a first precision; wherein, the X second column vectors are obtained by splitting the first column vector, the X second row vectors are obtained by splitting the first row vector, and the precision of the element values ​​contained in the second column vector and the second row vector is a second precision, wherein the first precision is higher than the second precision, and X is an integer greater than or equal to 2; The method further includes: Based on the index of the position coordinates of the fourth element value, X 2 The intermediate result matrix is ​​obtained by summing the values ​​of the fourth elements with the same position coordinates in the fourth matrix, and the precision of the third element value in the intermediate result matrix is ​​the first precision.

7. The method according to claim 6, characterized in that, The N intermediate result matrices include at least a first intermediate result matrix and a second intermediate result matrix, wherein the position coordinates of the third element value in the first intermediate result matrix are the first position coordinates, and the position coordinates of the third element value in the second intermediate result matrix are the second position coordinates; The step of summing the third element values ​​with the same position coordinates from the N intermediate result matrices according to the index of the position coordinates of the third element value to obtain the result matrix includes: According to the generation order of the N intermediate result matrices, the value of the third element in the first intermediate result matrix is ​​written into the corresponding position in the buffer according to the first position coordinates; Based on the second position coordinate of the third element value in the second intermediate result matrix, the cached value at the corresponding position of the second position coordinate in the cache is read, and the third element value in the second intermediate result matrix and the cached value are summed to obtain the uncompressed result matrix.

8. The method according to claim 7, characterized in that, The method further includes: The uncompressed result matrix is ​​compressed to obtain the compressed result matrix.

9. The method according to claim 6, characterized in that, The step of summing the third element values ​​with the same position coordinates from the N intermediate result matrices according to the index of the position coordinates of the third element value to obtain the result matrix includes: Sort the third element values ​​in the N intermediate result matrices according to the position coordinates of the third element values; The position coordinates in the sorted N intermediate result matrices are compared, the third element values ​​with the same position coordinates are added together, and the position coordinates with zero element values ​​are deleted to obtain the result matrix in compressed format.

10. The method according to claim 6, characterized in that, Before obtaining the second calculation instruction, the method further includes: Obtain a third calculation instruction, the third calculation instruction including a fifth matrix and a sixth matrix, wherein at least one of the fifth matrix and the sixth matrix is ​​an uncompressed matrix; The fifth matrix is ​​converted to a compressed first matrix, and the sixth matrix is ​​converted to a compressed second matrix.

11. A circuit for matrix calculation, characterized in that, The matrix calculation circuit is used to perform the method as described in any one of claims 6-10.

12. A matrix computation system, characterized in that, The system includes a processor and a matrix calculation apparatus, the processor being configured to send calculation instructions to the matrix calculation apparatus, the matrix calculation apparatus being configured to perform the method as described in any one of claims 6-10.

13. A chip, characterized in that, The chip includes a processor, which integrates a matrix calculation device for performing the matrix calculation method as described in any one of claims 6-10.