Element reading method for matrix multiplication and related apparatus
By optimizing the memory rearrangement process and utilizing a combination of vector registers and matrix registers, the problem of high computational overhead in memory rearrangement in existing matrix multiplication is solved, achieving more efficient memory access and computation.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-11-11
- Publication Date
- 2026-06-18
AI Technical Summary
Existing matrix multiplication methods suffer from high computational overhead due to memory rearrangement, which affects computational efficiency. In particular, zip-based schemes have too many instructions, and column-based schemes have high access overhead.
By optimizing the memory reordering process, and using a combination of vector registers and matrix registers, the elements of the input matrix are combined into one element and stored in the matrix register, and then expanded into primitive types in the vector register. This achieves memory reordering, reduces inefficient instructions, and improves memory access continuity.
It effectively reduces memory access latency, improves computing efficiency, reduces the number of instructions, and optimizes CPU pipeline execution.
Smart Images

Figure CN2025134083_18062026_PF_FP_ABST
Abstract
Description
A method and related apparatus for reading elements in matrix multiplication
[0001] This application claims priority to Chinese Patent Application No. CN202411845060.3, filed on December 13, 2024, entitled "A Method and Apparatus for Reading Elements of Matrix Multiplication", the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of matrix calculation technology, and in particular to a method and related apparatus for reading elements of matrix multiplication. Background Technology
[0003] General Matrix-matrix Multiplication (GEMM) represents global matrix-to-matrix multiplication. Essentially, it multiplies two input matrices to obtain a single output matrix. Given matrices A, B, C and scalars α, β, GEMM calculates the matrix multiplication of A and B, performs a transformation, and stores the result in matrix C, expressed by the formula C←αAB+βC.
[0004] GEMM implementations based on matrix registers typically perform memory rearrangement operations on the input matrix before matrix multiplication to ensure contiguous memory access during the multiplication process. Existing memory rearrangement schemes include zip-based and column-based schemes. The zip-based scheme has excessive instructions, impacting CPU pipeline execution. Furthermore, it requires significant vector register usage, making it difficult for the CPU to perform other tasks simultaneously. The column-based scheme uses the `gather` instruction, which incurs high overhead due to the non-contiguous memory addresses accessed, with a large amount of time spent on read instruction execution. In other words, existing GEMMs incur significant overhead for memory rearrangement, with actual measurements showing over 50% of the time dedicated to this computation. Therefore, providing a method for element reading in matrix multiplication that reduces the computational overhead during memory rearrangement is a pressing technical problem for those skilled in the art. Summary of the Invention
[0005] This application provides a method and related apparatus for reading elements in matrix multiplication, which reduces memory access latency and improves computational efficiency by optimizing the memory rearrangement process.
[0006] The first aspect of this application provides a method for reading elements in matrix multiplication, including:
[0007] Determine the first and second matrices to be multiplied, with the first matrix stored contiguously along the row dimension;
[0008] Based on the type of the first element in the first matrix and the second matrix, determine the first vector register with a capacity of P×Q, where P and Q are both integers greater than 1;
[0009] Based on the type of the second element in the third matrix, a first matrix register with a capacity of P×P is determined, wherein the element type of the third matrix is determined based on matrix multiplication of the first and second matrices;
[0010] When performing matrix multiplication on the first matrix and the second matrix, in the first matrix, every Q consecutive first elements in the row dimension are merged into a first combination element, and the first combination elements in the first matrix unit are stored in the first matrix register in groups. The first matrix unit is a submatrix in the first matrix that includes P rows and P columns of first combination elements, and a group of first combination elements is P consecutive first combination elements in one direction.
[0011] The first combined element is taken out from the first matrix register in the first dimension and expanded into the first element, and then stored in the first vector register. The first dimension corresponds to the column dimension of the first matrix.
[0012] Read the first element from the first vector register.
[0013] In this embodiment, the corresponding vector register and matrix register are first determined based on the element types of the two input matrices and the element type of the output matrix. Given that the output matrix elements are wider than the input matrix elements, multiple elements of the input matrix, stored consecutively in the row dimension, are combined into a single element and stored in the matrix register corresponding to the output matrix. Then, the combined element is retrieved column-wise from the matrix register and transferred to the vector register corresponding to the input matrix. In the vector register, it is expanded back to the original input matrix element type, achieving memory rearrangement. This facilitates subsequent reading of elements from the vector register for matrix multiplication calculations. This method allows for memory rearrangement of the matrix register, continuous memory access, fewer instructions, and no inefficient instructions, effectively reducing memory access latency and improving computational efficiency.
[0014] In one possible implementation, after the first combined element is retrieved from the first matrix register in the first dimension, expanded into a first element, and stored in the first vector register, the method further includes:
[0015] Each time a group of first combination elements is retrieved from the first matrix register, the first combination elements in the second matrix unit are transposed and stored back into the first matrix register. The second matrix unit is the next submatrix of the first matrix unit in the first matrix.
[0016] In this embodiment, the space of the first matrix register is utilized by alternating rows and columns to distribute the memory loading pressure.
[0017] In one possible implementation, the second matrix is stored contiguously along the column dimension;
[0018] After determining the first and second matrices for matrix multiplication, the following steps are also included:
[0019] Based on the type of the second element, determine the second matrix register with a capacity of P×P;
[0020] Based on the type of the first element, determine the second vector register with a capacity of P×Q;
[0021] When performing matrix multiplication on the first and second matrices, the following is also included:
[0022] In the second matrix, every Q consecutive first elements in the column dimension are merged into a second combined element, and the second combined elements in the third matrix unit are stored in the second matrix register in groups. The third matrix unit is a submatrix in the second matrix that includes P rows and P columns of second combined elements.
[0023] The second combined element is retrieved from the second matrix register in the second dimension, expanded into the first element, and then stored in the second vector register. The second dimension corresponds to the row dimension of the second matrix.
[0024] Read the first element from the second vector register.
[0025] In this embodiment of the application, when the second matrix is stored continuously in the column dimension, it can also be rearranged in memory through the matrix register in the same way as the first matrix described above.
[0026] In one possible implementation, after the second combined element is retrieved from the second matrix register in the second dimension, expanded into the first element, and stored in the second vector register, the method further includes:
[0027] Each time a group of second combination elements is retrieved from the second matrix register, the second combination elements in the fourth matrix unit are transposed and stored in the second matrix register. The fourth matrix unit is the next submatrix of the third matrix unit in the second matrix.
[0028] In this embodiment of the application, the second matrix can also utilize the space of the second matrix register by alternating rows and columns to distribute the memory loading pressure.
[0029] In one possible implementation, after determining the first and second matrices to be multiplied, the following is also included:
[0030] Based on the type of the second element, determine the third matrix register with a capacity of P×P;
[0031] After reading the first element from the first vector register, and / or reading the first element from the second vector register, the process includes:
[0032] Specify the first vector register and the second vector register to perform the outer product calculation;
[0033] Write the result of the outer product calculation to the third matrix register.
[0034] In this embodiment of the application, considering that the type of the second element (the element of the output matrix) is usually greater than int8 and fp16, the number of matrix registers allocated for the second element is usually greater than 2. Therefore, the memory rearrangement of the first matrix can be implemented by the first matrix register, the memory rearrangement of the second matrix can be implemented by the second matrix register, and the outer product calculation result can be stored by the third matrix register.
[0035] In one possible implementation, after determining the first and second matrices to be multiplied, the following is also included:
[0036] Based on the type of the second element, determine the second and third matrix registers with a capacity of P×P;
[0037] Based on the type of the first element, determine the second and third vector registers with a capacity of P×Q;
[0038] In the second matrix, the first elements of two consecutive columns are stored into the second vector register and the third vector register respectively.
[0039] After reading the first element from the first vector register, the process also includes:
[0040] The first vector register is specified to perform outer product calculations with the second and third vector registers respectively;
[0041] Write the outer product calculation result corresponding to the second vector register to the second matrix register, and write the outer product calculation result corresponding to the third vector register to the third matrix register.
[0042] In this embodiment, only the first matrix needs to be rearranged in memory based on matrix registers. When the memory rearrangement process of the second matrix does not need to be considered or has been pre-rearranged, matrix multiplication can be performed using other matrix registers corresponding to the second element, thereby improving the utilization of matrix registers, eliminating memory write-back during memory rearrangement, and improving performance.
[0043] A second aspect of this application provides an element reading device for matrix multiplication, comprising:
[0044] The acquisition module is used to determine the first and second matrices to be multiplied, with the first matrix being stored contiguously in the row dimension.
[0045] The determination module is used to determine a first vector register with a capacity of P×Q, where P and Q are both integers greater than 1, based on the type of the first element in the first matrix and the second matrix.
[0046] The determination module is also used to determine a first matrix register with a capacity of P×P based on the type of the second element in the third matrix, wherein the element type of the third matrix is determined based on matrix multiplication of the first and second matrices;
[0047] The rearrangement module is used to perform matrix multiplication on the first matrix and the second matrix. In the first matrix, every Q consecutive first elements in the row dimension are merged into a first combination element, and the first combination elements in the first matrix unit are stored in the first matrix register in groups. The first matrix unit is a submatrix in the first matrix that includes P rows and P columns of first combination elements. A group of first combination elements corresponds to P consecutive first combination elements in one direction.
[0048] The transfer module is used to retrieve the first combined element from the first matrix register in the first dimension direction, expand it into the first element, and then store it into the first vector register. The first dimension corresponds to the column dimension of the first matrix.
[0049] The read module is used to read the first element from the first vector register.
[0050] In one possible implementation, the transfer module is further configured to, after retrieving a group of first combined elements from the first matrix register each time, transpose the first combined elements in the second matrix unit and store them back into the first matrix register, where the second matrix unit is the next submatrix of the first matrix unit in the first matrix.
[0051] In one possible implementation, the second matrix is stored contiguously along the column dimension;
[0052] The determination module is also used to determine a second matrix register with a capacity of P×P based on the type of the second element; and to determine a second vector register with a capacity of P×Q based on the type of the first element.
[0053] The rearrangement module is also used to, when performing matrix multiplication on the first matrix and the second matrix, merge every Q consecutive first elements in the column dimension into a second combined element in the second matrix, and store the second combined elements in the third matrix unit into the second matrix register in groups. The third matrix unit is a submatrix in the second matrix that includes P rows and P columns of second combined elements.
[0054] The transfer module is also used to retrieve the second combined element from the second matrix register in the second dimension direction, expand it into the first element, and then store it into the second vector register. The second dimension corresponds to the row dimension of the second matrix.
[0055] The read module is also used to read the first element from the second vector register.
[0056] In one possible implementation, the transfer module is also used to, after retrieving a group of second combined elements from the second matrix register each time, transpose the second combined elements in the fourth matrix unit and store them in the second matrix register, where the fourth matrix unit is the next submatrix of the third matrix unit in the second matrix.
[0057] In one possible implementation, the determining module is also used to determine the corresponding third matrix register with a capacity of P×P based on the type of the second element;
[0058] It also includes a calculation module, which is used to specify the first vector register and the second vector register to perform the outer product calculation; and to write the outer product calculation result into the third matrix register.
[0059] In one possible implementation, the read module is also used to determine the second matrix register and the third matrix register with a capacity of P×P based on the type of the second element;
[0060] Based on the type of the first element, determine the second and third vector registers with a capacity of P×Q.
[0061] The transfer module is also used to sequentially store two consecutive sets of first elements in the column dimension into the second vector register and the third vector register in the second matrix.
[0062] It also includes a calculation module, which is used to specify the first vector register to perform outer product calculations with the second vector register and the third vector register respectively;
[0063] Write the outer product calculation result corresponding to the second vector register to the second matrix register, and write the outer product calculation result corresponding to the third vector register to the third matrix register.
[0064] A third aspect of this application provides a computing device, including a processor;
[0065] The processor is configured to execute computer programs or computer instructions stored in memory to perform the method described in the first aspect above.
[0066] In one possible implementation, the memory is also included.
[0067] The fourth aspect of this application provides a computer program product containing instructions that, when executed on a computer, cause the computer to perform the implementation as described in the first aspect.
[0068] The fifth aspect of this application provides a computer-readable storage medium including computer instructions that, when executed on a computer, cause the computer to perform the implementation as described in the first aspect.
[0069] The beneficial effects of the technical solutions provided in the second to fifth aspects above can be referred to the beneficial effects of the technical solutions in the first aspect, and will not be repeated here. Attached Figure Description
[0070] Figure 1 shows a schematic diagram of data from some typical matrix registers;
[0071] Figure 2 shows the P and Q values in some typical scenarios;
[0072] Figure 3 is a schematic diagram of memory rearrangement in traditional technology;
[0073] Figure 4 is a flowchart of the element reading method for matrix multiplication provided in the embodiments of this application;
[0074] Figure 5 is a schematic diagram of memory rearrangement in an embodiment of this application;
[0075] Figure 6 is a schematic diagram of the interleaved storage of rows and columns in a matrix register provided in an embodiment of this application;
[0076] Figure 7 is a flowchart of the element reading method for matrix multiplication provided in the embodiments of this application;
[0077] Figures 8a to 8c are schematic diagrams of the principle of the outer product instruction when P=4 and Q=2;
[0078] Figures 9a to 9c are schematic diagrams illustrating the principle of the outer product instruction when P = 8 and Q = 4;
[0079] Figure 10 is a schematic diagram of the matrix multiplication calculation process provided in the embodiments of this application;
[0080] Figure 11 is a schematic diagram of the matrix multiplication calculation process provided in the embodiment of this application;
[0081] Figure 12 is a schematic diagram of the element reading device for matrix multiplication provided in an embodiment of this application;
[0082] Figure 13 is a schematic diagram of the structure of the computing device provided in the embodiment of this application. Detailed Implementation
[0083] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application are described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Those skilled in the art will understand that with the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
[0084] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those explicitly listed, but may include other steps or modules not explicitly listed or inherent to such processes, methods, products, or devices. The naming or numbering of steps appearing in this application does not imply that the steps in the method flow must be performed in the chronological / logical order indicated by the naming or numbering. The execution order of named or numbered process steps can be changed according to the desired technical purpose, as long as the same or similar technical effect is achieved. The division of units in this application is a logical division. In practical applications, there may be other division methods. For example, multiple units may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the shown or discussed mutual coupling, direct coupling, or communication connection may be through some interface, and the indirect coupling or communication connection between units may be electrical or other similar forms, none of which are limited in this application. Furthermore, the units or sub-units described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed among multiple circuit units. Some or all of the units can be selected to achieve the purpose of the solution in this application according to actual needs.
[0085] General Matrix-matrix Multiplication (GEMM) represents global matrix-to-matrix multiplication. Essentially, it multiplies two input matrices to obtain a single output matrix. Given matrices A, B, C and scalars α, β, GEMM calculates the matrix multiplication of A and B, performs a transformation, and stores the result in matrix C, expressed by the formula C←αAB+βC.
[0086] Matrices can be stored in various formats. Matrix A (dimension [M, K]) is stored contiguously along dimension M when transposed in column-major or row-major order, and contiguously along dimension K when transposed in row-major or column-major order. Matrix B (dimension [K, N]) is stored contiguously along dimension K when transposed in column-major or row-major order, and contiguously along dimension N when transposed in row-major or column-major order.
[0087] In some cases, the range and precision requirements of the elements in the result of matrix computation (matrix C) are higher than those of the input matrices (matrixes A and B). Therefore, the element type of matrix C can be wider than that of matrices A and B. For example, the element type of matrices A and B is 32-bit floating-point numbers, while the element type of matrix C is 64-bit floating-point numbers; the element type of matrices A and B is 8-bit integers, while the element type of matrix C is 32-bit integers.
[0088] Scalable Vector Extension (SVE) is an instruction set introduced in the ARMv8-A architecture that provides scalable vector lengths, thereby improving the efficiency and flexibility of vector processing. SVE defines a series of vector registers Zn. Assuming a vector register can hold VL bytes, if stored in int8, then Zn.B (also written as Zn, where ".B" indicates 8-bit resolution) can store VL int8 bytes; if stored in int32, then Zn.S (also written as Zn, where ".S" indicates 32-bit resolution) can store VL / 4 int32 bytes.
[0089] Scalable Matrix Extension (SME) is an instruction set introduced in the ARMv9-A architecture. Building upon SVE, it adds the ability to process matrices and can be used to accelerate processes such as GEMM. SME defines a series of matrix registers ZAx. Assuming a vector register can store VL bytes, the matrix registers can store a total of VL×VL bytes. If stored in int8, matrix register ZA0.B (also written as ZA0) can store VL×VL int8 bytes; if stored in int32, four matrix registers ZA0.S, ZA1.S, ZA2.S, and ZA3.S can each store (VL / 4)×(VL / 4) int32 bytes. Figure 1 shows a schematic diagram of some typical matrix registers.
[0090] GEMM implementations based on matrix registers typically perform memory rearrangement operations on the input matrix before matrix multiplication to ensure that memory accesses are as continuous as possible during the matrix multiplication process.
[0091] Suppose a vector register can hold P matrix elements of type C, or P × Q matrix elements of type A or B. As shown in Figure 2, Figure 2 shows the values of P and Q in some typical scenarios.
[0092] If the matrix is stored contiguously in K dimensions, memory rearrangement requires sequentially traversing the P rows and Q columns of the matrix and merging them into one row; then merging the rightmost P rows and Q columns into one row, and so on; after processing the current P rows, processing the next P rows, and so on. Please refer to Figure 3, which is a schematic diagram of memory rearrangement in traditional techniques, where P = 8 and Q = 4.
[0093] Existing memory rearrangement schemes include:
[0094] The zip-based approach requires reading P rows and P×Q columns of data and storing them in P vector registers. Multiple zip instructions are used to move the data, achieving a transpose-like function. During this movement, multiple registers are needed to store intermediate data. This approach requires executing Plog2P zip instructions, which is excessive and impacts CPU pipeline execution. Furthermore, it requires at least P+1 vector registers (P for memory rearrangement and 1 for storing intermediate data), making it difficult for the CPU to perform other tasks simultaneously.
[0095] The column-fetching approach uses the `gather` instruction to load P rows and Q columns of data into a vector register, and then writes the vector register back to memory, thus achieving memory rearrangement. However, because the P rows and Q columns are distributed across P non-contiguous memory locations, a significant amount of time is spent executing the read instruction, resulting in substantial access overhead.
[0096] To address the aforementioned issues, this application provides a method and related apparatus for reading elements in matrix multiplication. First, based on the element types of the two input matrices and the element type of the output matrix, corresponding vector registers and matrix registers are determined. Given that the output matrix elements are wider than the input matrix elements, multiple elements of the input matrix, stored consecutively in the row dimension, are combined into a single element and stored in the matrix register corresponding to the output matrix. Then, the combined element is retrieved column-wise from the matrix register and transferred to the vector register corresponding to the input matrix. In the vector register, the element is expanded back to the original input matrix element type, achieving memory rearrangement. This facilitates subsequent element reading from the vector register for matrix multiplication calculations. This method allows for memory rearrangement of the matrix register, ensures continuous memory access, reduces the number of instructions, eliminates inefficient instructions, effectively reduces memory access latency, and improves computational efficiency.
[0097] The following describes the element reading method for matrix multiplication provided in the embodiments of this application. Please refer to Figure 4, which is a flowchart of the element reading method for matrix multiplication provided in the embodiments of this application, including:
[0098] 401. Determine the first and second matrices to be multiplied, with the first matrix stored contiguously along the row dimension.
[0099] Understandably, to perform matrix multiplication, we first need to determine the two input matrices of GEMM: the first matrix and the second matrix. The first matrix is stored contiguously along the row dimension, meaning that the elements of each row of the first matrix are read consecutively from memory.
[0100] 402. Based on the type of the first element in the first matrix and the second matrix, determine the first vector register with a capacity of P×Q, where P and Q are both integers greater than 1.
[0101] It is understandable that the type of the first element is the element type of the input matrix. As shown in Figure 2, different element types of input matrices correspond to different vector registers based on different vector lengths. Taking a vector length of 256 bits and the type of the first element as int8 as an example, the corresponding first vector register has P=8 and Q=4.
[0102] 403. Based on the type of the second element in the third matrix, determine the first matrix register with a capacity of P×P, wherein the element type of the third matrix is determined based on matrix multiplication of the first and second matrices.
[0103] The type of the second element is the element type of the output matrix. As shown in Figure 2, when the vector length is 256 bits, the type of the first element is int8, and the corresponding type of the second element is int32. At this time, P=8 of the first matrix register.
[0104] 404. When performing matrix multiplication on the first matrix and the second matrix, in the first matrix, every Q consecutive first elements in the row dimension are merged into a first combination element, and the first combination elements in the first matrix unit are stored in the first matrix register in groups. The first matrix unit is a submatrix in the first matrix that includes P rows and P columns of first combination elements. A group of first combination elements corresponds to P consecutive first combination elements in one direction.
[0105] It is understandable that when performing matrix multiplication on the first matrix and the second matrix, the elements in the input matrix need to be rearranged in memory. In this embodiment, since the first matrix is stored continuously in the row dimension, the Q consecutive first elements in a row of the first matrix are merged into one first combined element. It is understandable that the element type of the first combined element is the same as the type of the second element of the output matrix. Therefore, the Q first elements can be stored in a register unit of the matrix register corresponding to the output matrix.
[0106] Specifically, the first matrix includes multiple first elements, which form multiple submatrices of P rows and P × Q columns. After merging each Q first elements into a first combination element, each submatrix includes P rows and P first combination elements. These P rows and P combination elements can be stored in the first matrix register row by row or column by column (in the first direction). Since the Q first element types are equivalent to one second element type, the first matrix register is equivalent to having P × P elements of the second element type.
[0107] Please refer to Figure 5 for details. Figure 5 is a schematic diagram of memory rearrangement in an embodiment of this application, where the first element is of type int8, the second element is of type int32, P=8, and Q=4.
[0108] As shown in Figure 5, the first matrix is stored continuously in the row dimension. The first element type (int8) of P rows and P×Q columns in the first matrix is reinterpreted as the first combined element of P rows and P columns. The type of the first combined element is equivalent to the second element type (int32). It is stored in the first matrix register, so that the first matrix register includes P×P (8 rows and 8 columns) first combined elements (of type int32).
[0109] 405. The first combined element is taken out from the first matrix register in the first dimension direction, expanded into the first element, and then stored in the first vector register. The first dimension corresponds to the column dimension of the first matrix.
[0110] Understandably, after storing the first element of the first matrix in the form of the first combined element into the first matrix register, a memory rearrangement operation is performed. Specifically, the first combined element is retrieved from the first matrix register in groups according to the first dimension direction. The first dimension refers to the dimension corresponding to the column dimension of the first matrix. For example, when the form in which the first combined element is stored in the first matrix register is the same as the arrangement form of the first matrix, which is equivalent to the storage form from the first figure to the second figure in Figure 5, then when the first combined element is retrieved from the first matrix register, it is retrieved column by column. If the first combined element is transposed when stored in the first matrix register, that is, the arrangement form after the first combined element is stored in the first matrix register is equivalent to transposing the matrix in the second figure of Figure 5, with the first row being combined elements 0 to 7, the second row being combined elements 8 to 15, and so on, then when the first combined element is retrieved from the first matrix register, it is retrieved row by row.
[0111] 406. Read the first element from the first vector register.
[0112] After memory rearrangement using the method provided in this application embodiment, each first element in the first matrix is stored in the corresponding first vector register, so the first element can be read from the first vector register for matrix multiplication.
[0113] The matrix multiplication element reading method provided in this application first determines the corresponding vector register and matrix register based on the element types of the two input matrices and the element type of the output matrix. Given that the output matrix elements are wider than the input matrix elements, multiple elements of the input matrix stored consecutively in the row dimension are combined into a single element and stored in the matrix register corresponding to the output matrix. Then, the combined element is retrieved column-wise from the matrix register and transferred to the vector register corresponding to the input matrix. In the vector register, it is expanded back to the original input matrix element type, achieving memory rearrangement. This facilitates subsequent element reading from the vector register for matrix multiplication calculations. This method allows for memory rearrangement of the matrix register, continuous memory access, fewer instructions, and no inefficient instructions, effectively reducing memory access latency and improving computational efficiency.
[0114] In one possible implementation, after step 405 of the embodiment corresponding to FIG4, the method further includes:
[0115] 4051. After each set of the first combination elements is retrieved from the first matrix register, the first combination elements in the second matrix unit are transposed and stored in the first matrix register, whereby the second matrix unit is the next submatrix of the first matrix unit in the first matrix.
[0116] In conventional techniques, when reordering a batch of elements (i.e., the first matrix unit including the first combination of P rows and P columns) in memory, the next batch of elements can only be stored after all the elements in the first matrix register have been retrieved. This can easily lead to memory loading instructions being concentrated in one place, which is not conducive to CPU pipeline execution.
[0117] To address the aforementioned issues, this application proposes that after the i-th column of the first matrix register is retrieved, this column can be filled with the i-th row of the next batch of elements (including the second matrix unit of the first combination of P rows and P columns); similarly, after the i-th row of the first matrix register is retrieved, this row can be filled with the i-th row of the next batch of elements. By alternating rows and columns, the space of the first matrix register is utilized, thus distributing the memory loading pressure.
[0118] The method will now be described in detail with reference to the accompanying drawings. Please refer to Figure 6, which is a schematic diagram of the interleaved storage of rows and columns in a matrix register provided in an embodiment of this application.
[0119] Step ① in Figure 6 is equivalent to step 404 in the corresponding embodiment of Figure 4, storing the first combined element in the first matrix unit into the first matrix register. Step ② in Figure 6 is equivalent to step 405 in the corresponding embodiment of Figure 4, retrieving the first combined element in groups from the first matrix register along the first dimension. At this time, the first dimension corresponds to the column dimension of the first matrix, and also to the column dimension of the first matrix unit. Step ③ in Figure 6 is equivalent to step 4051 above, after retrieving a group of first combined elements from the first matrix register, retrieving the combined elements in the second matrix unit (the sub-matrix following the first matrix unit in the first matrix) in groups, transposing them, and storing them in the empty space of the first matrix register, which is equivalent to replacing the retrieved first combined elements. Steps ④ and ⑤, ⑥ and ⑦, and ⑧ and ⑨ in Figure 6 are similar to steps ② and ③. Each column of the first combination elements is retrieved from the first register. For each retrieved group of first combination elements, the first combination elements are transposed row by row from the second matrix unit and stored in the empty space of the first matrix register, until all the first combination elements in the second matrix unit are stored in the first matrix register. The next submatrix of the first matrix in the second matrix unit is the third matrix unit; please refer to steps ⑩ and ⑨. Because the first combined element in the second matrix unit was transposed when stored into the first matrix register, the first combined element is retrieved from the first matrix register in groups along the first dimension. At this point, the first dimension still corresponds to the column dimension of the first matrix. However, at the current moment, the column dimension of the first matrix has been transposed and becomes the row dimension of the first matrix register. Therefore, it needs to be retrieved row by row from the first matrix register. Similarly, for each group of first combined elements retrieved from the first matrix register, a group of first combined elements is stored from the third matrix unit into the corresponding position in the first matrix register.
[0120] It is understandable that, since the second matrix is also an input matrix, in one possible implementation, when the second matrix is stored contiguously in the column dimension, it can also be memory-rearranged through the matrix register in the same way as the first matrix described above, and the matrix register can be used for row-column interleaving. Please refer to Figure 7, which is a flowchart of the element reading method for matrix multiplication provided in this application embodiment. Based on the embodiment corresponding to Figure 4, it further includes:
[0121] 701. Based on the type of the second element, determine the second matrix register with a capacity of P×P;
[0122] 702. Based on the type of the first element, determine the second vector register with a capacity of P×Q;
[0123] When performing matrix multiplication on the first and second matrices, the following is also included:
[0124] 703. In the second matrix, every Q consecutive first elements in the column dimension are merged into a second combined element, and the second combined elements in the third matrix unit are stored in the second matrix register in groups. The third matrix unit is a submatrix in the second matrix that includes P rows and P columns of second combined elements.
[0125] 704. The second combined element is retrieved from the second matrix register in the second dimension, expanded into the first element, and then stored in the second vector register. The second dimension corresponds to the row dimension of the second matrix.
[0126] 705. Read the first element from the second vector register.
[0127] It is understood that steps 701 and 702 follow step 401 in the embodiment corresponding to Figure 4. The key matrix instruction of GEMM is the outer product instruction, which specifies vector registers Zn and Zm, corresponding to two input matrices respectively. It is understood that, as shown in the embodiment corresponding to Figure 4, the first vector register corresponds to the first matrix, and therefore the second vector register corresponds to the second matrix. Furthermore, depending on the element type of the output matrix, when the second element is greater than int8 or fp16, it corresponds to at least four matrix registers. As explained in the embodiment corresponding to Figure 4, the first matrix register is used for memory rearrangement of the first matrix; therefore, the second matrix register can be used to rearrange the second matrix.
[0128] When performing matrix multiplication on the first and second matrices, steps 703 to 705 are equivalent to steps 404 to 406 in the corresponding embodiment of Figure 4, that is, the memory rearrangement of the second matrix is achieved through the second matrix register. It should be noted that since the second matrix is stored contiguously along the column dimension, when merging the corresponding second combined elements, Q first elements along the column dimension are merged. Furthermore, when retrieving the second combined elements in groups from the second matrix register along the second dimension, this second dimension corresponds to the row dimension of the second matrix. For other related descriptions, please refer to the relevant content in steps 404 to 406 in the corresponding embodiment of Figure 4, which will not be repeated here.
[0129] Understandably, for GEMM scenarios where the input elements are of type int8 and the output elements are of type int32 (i.e., the first and second matrices are of type int8 and the third matrix is of type int32), memory rearrangement can reinterpret four int8 elements as one int32 element. Similarly, for GEMM scenarios where the input elements are of type fp16 and the output elements are of type fp32 (i.e., the first and second matrices are of type fp16 and the third matrix is of type fp32), memory rearrangement can reinterpret two fp16 elements as one. For GEMM scenarios where the input elements are of type fp16 and the output elements are of type fp64 (i.e., the first and second matrices are of type fp16 and the third matrix is of type fp64), during memory rearrangement, four fp16 elements can be reinterpreted as one fp64 element. For GEMM scenarios where the input elements are of type fp32 and the output elements are of type fp64 (i.e., the first and second matrices are of type fp32 and the third matrix is of type fp64), during memory rearrangement, two fp32 elements can be reinterpreted as one fp64 element.
[0130] Furthermore, in one possible implementation, based on the above steps, after step 704, the following is also included:
[0131] 7041. After each set of second combination elements is retrieved from the second matrix register, the second combination elements in the fourth matrix unit are transposed and stored in the second matrix register. The fourth matrix unit is the next submatrix of the third matrix unit in the second matrix.
[0132] Understandably, to avoid the situation where memory load instructions are concentrated in one place because all elements in the second matrix register must be retrieved before elements can be stored in it, the second matrix register can also adopt the row-column interleaving storage method shown in Figure 6. In this embodiment, the third matrix unit and the fourth matrix unit are sub-matrices of the second matrix, and the fourth matrix unit is the next sub-matrix of the third matrix unit. For related details, please refer to the relevant content in the embodiment corresponding to Figure 6 above, which will not be repeated here.
[0133] In one possible implementation, considering that the type of the second element (the element of the output matrix) is usually greater than int8 and fp16, then referring to Figure 1, the number of matrix registers allocated for the second element is usually greater than 2. For example, int32 and fp32 correspond to 4 matrix registers, and fp64 corresponds to 8 matrix registers. Based on this, this application embodiment also provides a method to improve the utilization of matrix registers, specifically:
[0134] Following step 401, the following is also included:
[0135] 706. Based on the type of the second element, determine a third matrix register with a capacity of P×P;
[0136] Additionally, following step 406 and / or step 705, the following is also included:
[0137] 707. Specify the first vector register and the second vector register to perform the outer product calculation;
[0138] 708. Write the outer product calculation result into the third matrix register.
[0139] Understandably, when performing GEMM on matrices A and B, the key matrix instruction is the outer product instruction. This instruction specifies the vector registers Zn and Zm (containing the element types of P×Q input matrices A and B) and the matrix register ZAx (containing the element types of P×P output matrix C). Zn is arranged into a P×Q matrix in row-major order, and Zm is arranged into a Q×P matrix in column-major order. The matrix multiplication of these two matrices yields a P×P matrix, which is then added to ZAx. For example, please refer to Figures 8a to 8c, which illustrate the principle of the outer product instruction when P = 4 and Q = 2.
[0140] As shown in Figures 8a to 8c, the GEMM implementation based on the matrix register can be understood as consisting of three loops: the outermost loop is M, the next outermost loop is N, and the innermost loop is K. The first two loops iterate through M and N, ensuring that the innermost loop only needs to handle the matrix multiplication of P rows of matrix A and P columns of matrix B. This process is equivalent to dividing the P rows of matrix A into several P×Q blocks and the P columns of matrix B into several Q×P blocks, then using the outer product instruction to calculate and accumulate the results onto ZAx.
[0141] In this embodiment of the application, the memory rearrangement of the first matrix can be implemented by the first matrix register, the memory rearrangement of the second matrix can be implemented by the second matrix register, and the outer product calculation result can be stored by the third matrix register.
[0142] To facilitate understanding, the above scheme will be described in detail below with reference to the accompanying diagram. When each element in input matrix A (the first matrix) and input matrix B (the second matrix) is of type int8, and each element in output matrix C (the third matrix) is of type int32, assuming a vector register can hold P int32 elements, or P × Q int8 elements, then Q = 4. The value of P depends on the platform: on a 256-bit vector platform, P = 8; on a 512-bit vector platform, P = 16.
[0143] In this embodiment, matrix A is stored contiguously in the row dimension, and matrix B is stored contiguously in the column dimension. The key matrix instruction for GEMM with input elements of type int8 is a 4-way outer product instruction. This instruction specifies the vector registers Zn.B (P × 4 int8) and Zm.B (P × 4 int8), and the matrix register ZAx.S (P × P int32). Zn.B is arranged into a P × 4 matrix in row-major order, and Zm.B is arranged into a 4 × P matrix in column-major order. The matrix multiplication of these two matrices is calculated to obtain a P × P int32 matrix, which is then added to ZAx.S. On a 256-bit vector platform (P = 8), please refer to Figures 9a to 9c. Figures 9a to 9c are schematic diagrams of the outer product instruction when P = 8 and Q = 4. The GEMM, whose input elements are of type int8, uses int32 type matrix registers. There are four P×P int32 matrix registers, namely ZA0.S, ZA1.S, ZA2.S, and ZA3.S. In this embodiment, during calculation, ZA0.S is used for memory rearrangement of matrix A, ZA1.S for memory rearrangement of matrix B, ZA2.S for matrix multiplication calculation, and ZA3.S is reserved. As shown in Figure 10, Figure 10 is a schematic diagram of the matrix multiplication calculation process provided in this embodiment, including:
[0144] 1. Store the P rows and P×4 columns of matrix A as int32 type into ZA0.S.
[0145] 2. Store the P rows and P×4 columns of matrix B as int32 type into ZA1.S.
[0146] 3. Clear ZA2.S.
[0147] 4. The first batch of matrix calculations follows the procedure:
[0148] a) The first matrix calculation process is as follows:
[0149] i. Take the first column of ZA0.S and store it in the vector register Z0.S.
[0150] ii. Store the first row of the next batch of P rows and P×4 columns of matrix A into the first column of ZA0.S as an int32 type.
[0151] iii. Take out the first column of ZA1.S and store it in the vector register Z1.S.
[0152] iv. Store the first row of the next batch of P rows and P×4 columns of matrix B into the first column of ZA1.S as an int32 type.
[0153] v. Calculate SMOPA, specifying Z0.B, Z1.B, ZA2.S.
[0154] b) For the second matrix calculation, change the "1" highlighted in yellow in the above process to "2".
[0155] c) This process continues until P matrix calculations are performed, at which point this batch of calculations is complete.
[0156] 5. For the second batch of matrix calculations, compared to the first batch, accessing ZA0.S and ZA1.S requires row and column swapping (row and column interleaving). The process is as follows:
[0157] a) The first matrix calculation process is as follows:
[0158] i. Retrieve the first row of ZA0.S and store it in the vector register Z0.S.
[0159] ii. Store the first row of the next batch of P rows and P×4 columns of matrix A into the first row of ZA0.S as an int32 type.
[0160] iii. Take out the first row of ZA1.S and store it in the vector register Z1.S.
[0161] iv. Store the first row of the next batch of P rows and P×4 columns of matrix B into the first row of ZA1.S as an int32 type.
[0162] v. Calculate SMOPA, specifying Z0.B, Z1.B, ZA2.S.
[0163] b) For the second matrix calculation, change the "1" highlighted in yellow in the above process to "2".
[0164] c) This process continues until P matrix calculations are performed, at which point this batch of calculations is complete.
[0165] 6. The matrix calculation for the third batch is the same as that for the first batch.
[0166] 7. Continue in this manner until matrices A and B have been traversed. Write the P×P result of ZA2.S back to memory.
[0167] In one possible implementation, for example in a linear layer scenario of deep learning, the second matrix is the weight matrix, which does not change during inference, while the first matrix is the activation values, which change during inference. Therefore, the second matrix can be pre-rearranged in memory, and then the first and second matrices can be GEMMed.
[0168] In this embodiment of the application, based on the embodiment corresponding to FIG4, after step 401, the following is further included:
[0169] 4011. Based on the type of the second element, determine the second and third matrix registers with a capacity of P×P;
[0170] 4012. Based on the type of the first element, determine the second and third vector registers with a capacity of P×Q;
[0171] 4013. In the second matrix, the first elements of two consecutive columns are stored in the second vector register and the third vector register respectively.
[0172] In addition, after step 406, the following is also included:
[0173] 4061. Specify that the first vector register is used to perform outer product calculations with the second and third vector registers respectively;
[0174] 4062. Write the outer product calculation result corresponding to the second vector register to the second matrix register, and write the outer product calculation result corresponding to the third vector register to the third matrix register.
[0175] In this embodiment, it is also considered that the type of the second element (the element of the output matrix) is usually greater than int8 and fp16. Therefore, the number of matrix registers allocated for the second element is usually greater than 2. For example, int32 and fp32 correspond to 4 matrix registers, and fp64 corresponds to 8 matrix registers. Since the second matrix will not change, it can be pre-arranged in memory without occupying matrix registers. However, the first matrix will change, so one matrix register can be occupied for memory rearrangement. In the case that the second element is int32, there are still 3 matrix registers left, which can be used for matrix multiplication calculations.
[0176] In step 4011, based on the second element type, a second matrix register and a third matrix register are also determined. In step 4062, the second matrix register and the third matrix register are used to store the results of the two outer product calculations, which indicates that multiple matrices can be used to perform GEMM in this embodiment of the application.
[0177] Correspondingly, in step 4012, based on the type of the first element, a second vector register and a third vector register are determined, both used to store the first element of the first matrix after memory rearrangement. In step 4061, after the first element of the first matrix is stored in the first vector register, it can be used to perform outer product calculations with the second vector register and the third vector register respectively, thereby improving matrix calculation efficiency.
[0178] To facilitate understanding, the above steps will be described below with reference to the accompanying drawings. In this embodiment of the application, in the GEMM scenario where the input elements are of type int8, matrix A (the first matrix) is stored continuously in the row dimension, and matrix B (the second matrix) needs to be rearranged in memory before the GEMM.
[0179] GEMM with input elements of type int8 uses int32 matrix registers, as shown in Figure 1, corresponding to four P×P int32 matrix registers, namely ZA0.S, ZA1.S, ZA2.S, and ZA3.S. In this embodiment, ZA0.S is used for memory rearrangement of matrix A, and ZA1.S, ZA2.S, and ZA3.S are used for matrix multiplication. It is understood that this embodiment uses three matrix registers for multiplication as an example; in practical applications, the number of matrix registers can be set to other numbers depending on the element type. This application does not limit the specific number of matrix registers.
[0180] Compared to the calculation process in the embodiment corresponding to Figure 10, this embodiment has more matrix registers available for matrix multiplication, thus allowing simultaneous matrix multiplication of P rows of matrix A with P×3 columns of matrix B. As shown in Figure 11, which is a schematic diagram of the matrix multiplication calculation process provided by this embodiment, it includes:
[0181] 1. Store the P rows and P×4 columns of matrix A as int32 type into ZA0.S.
[0182] 2. Clear ZA1.S, ZA2.S, and ZA3.S.
[0183] 3. The first batch of matrix calculations follows the procedure:
[0184] a) The first matrix calculation process is as follows:
[0185] i. Take the first column of ZA0.S and store it in the vector register Z0.S.
[0186] ii. Store the first row of the next batch of P rows and P×4 columns of matrix A into the first column of ZA0.S as an int32 type.
[0187] iii. Divide the first 4th row and 3Pth column of matrix B into three 4th row and 3Pth columns, and load them into Z1.B, Z2.B, and Z3.B respectively.
[0188] iv. Calculate SMOPA, specifying Z0.B, Z1.B, ZA1.S.
[0189] v. Calculate SMOPA, specifying Z0.B, Z2.B, ZA2.S.
[0190] vi. Calculate SMOPA, specifying Z0.B, Z3.B, ZA3.S.
[0191] b) For the second matrix calculation, change the "1" highlighted in yellow in the above process to "2".
[0192] c) This process continues until P matrix calculations are performed, at which point this batch of calculations is complete.
[0193] 4. For the second batch of matrix calculations, compared to the first batch, ZA0.S requires row and column swapping (row and column interleaving), the process is omitted.
[0194] 5. The matrix calculation for the third batch is the same as that for the first batch.
[0195] 6. Continue in this manner until matrices A and B have been traversed. Write the P×3P results of ZA1.S, ZA2.S, and ZA3.S back to memory.
[0196] As can be understood (referring to Figure 1), for a GEMM scenario where the input elements are of type int8 and the output elements are of type int32 (i.e., the first and second matrices are of type int8, and the third matrix is of type int32), the first and second matrices can be rearranged in memory using ZAx.S, and the remaining ZAx.S is used for matrix multiplication. The corresponding matrix instruction parameters are specified as Zn.B, Zm.B, and ZAx.S. For a GEMM scenario where both the input and output elements are of type FP16 (i.e., the first, second, and third matrices are all of type fp16), the first or second matrix can be rearranged in memory using Z0.H, and matrix multiplication can be performed using Z1.H. The corresponding matrix... The matrix instruction parameters are Zn.H, Zm.H, and ZAx.H. For GEMM scenarios where the input elements are of type fp16 and the output elements are of type fp32 (i.e., the first and second matrices are of type fp16 and the third matrix is of type fp32), after reinterpreting the two fp16 matrices into one fp32 matrix, the first and second matrices can be rearranged in memory using ZAx.S, and the remaining ZAx.S is used for matrix multiplication. The corresponding matrix instruction parameters are Zn.H, Zm.H, and ZAx.S. For GEMM scenarios where the input elements are of type fp16 and the output elements are of type f64 (i.e., the first and second matrices are of type fp16 and the third matrix is of type fp64), the four fp16 matrices can be rearranged in memory using ZAx.S. After p16 is reinterpreted as one fp64, both the first and second matrices can be rearranged in memory using ZAx.D, and the remaining ZAx.D is used for matrix multiplication. The corresponding matrix instruction parameters are specified as Zn.H, Zm.H, and ZAx.D. For GEMM scenarios where the input elements are of type fp32 and the output elements are of type fp64 (i.e., the first and second matrices are of type fp32 and the third matrix is of type fp64), after the two fp32 elements are reinterpreted as one fp64, both the first and second matrices can be rearranged in memory using ZAx.D, and the remaining ZAx.D is used for matrix multiplication. The corresponding matrix instruction parameters are specified as Zn.S, Zm.S, and ZAx.D. For the input... In a GEMM scenario where both input and output elements are of type fp32 (i.e., the first, second, and third matrices are all of type fp32), the first and second matrices can be rearranged in memory using ZAx.S, and the remaining ZAx.S is used for matrix multiplication. The corresponding matrix instruction parameters are specified as Zn.S, Zm.S, and ZAx.S. For a GEMM scenario where both input and output elements are of type fp64 (i.e., the first, second, and third matrices are all of type fp64), the first and second matrices can be rearranged in memory using ZAx.D, and the remaining ZAx.D is used for matrix multiplication. The corresponding matrix instruction parameters are specified as Zn.D, Zm.D, and ZAx.D.
[0197] In summary, the matrix multiplication element reading method provided in this application achieves continuous memory access, fewer instructions, and no inefficient instructions by proposing a memory rearrangement scheme for merging elements, allowing matrix registers to be used for memory rearrangement; by proposing a row-column interleaved matrix register element storage scheme, it can maximize the utilization of matrix register space and distribute memory loading pressure; by proposing to use some matrix registers corresponding to the second element type for memory rearrangement of the input matrix, and use the remaining matrix registers for matrix multiplication calculation, it can improve matrix register utilization, eliminate memory write-back of memory rearrangement, and significantly improve performance.
[0198] This application also provides an element reading device for matrix multiplication. Please refer to Figure 12, which is a schematic diagram of the structure of the element reading device for matrix multiplication provided in an embodiment of this application, including:
[0199] The acquisition module 1201 is used to determine the first and second matrices to be multiplied, wherein the first matrix is stored continuously in the row dimension.
[0200] The determination module 1202 is used to determine a first vector register with a capacity of P×Q based on the type of the first element in the first matrix and the second matrix, where P and Q are both integers greater than 1;
[0201] The determination module 1202 is also used to determine a first matrix register with a capacity of P×P based on the type of the second element in the third matrix, wherein the element type of the third matrix is determined based on matrix multiplication of the first matrix and the second matrix;
[0202] The rearrangement module 1203 is used to perform matrix multiplication on the first matrix and the second matrix. In the first matrix, every Q consecutive first elements in the row dimension are merged into a first combination element, and the first combination elements in the first matrix unit are stored in the first matrix register in groups. The first matrix unit is a sub-matrix in the first matrix that includes P rows and P columns of first combination elements. A group of first combination elements corresponds to P consecutive first combination elements in one direction.
[0203] The transfer module 1204 is used to retrieve the first combined element from the first matrix register in the first dimension direction, expand it into the first element, and store it into the first vector register. The first dimension corresponds to the column dimension of the first matrix.
[0204] The read module 1205 is used to read the first element from the first vector register.
[0205] In one possible implementation, the transfer module 1204 is further configured to, after retrieving a group of first combined elements from the first matrix register each time, transpose the first combined elements in the second matrix unit and store them back into the first matrix register, wherein the second matrix unit is the next submatrix of the first matrix unit in the first matrix.
[0206] In one possible implementation, the second matrix is stored contiguously along the column dimension;
[0207] The determination module 1202 is also used to determine a second matrix register with a capacity of P×P based on the type of the second element; and to determine a second vector register with a capacity of P×Q based on the type of the first element.
[0208] The rearrangement module 1203 is also used to, when performing matrix multiplication on the first matrix and the second matrix, merge every Q consecutive first elements in the column dimension into a second combined element in the second matrix, and store the second combined elements in the third matrix unit into the second matrix register in groups. The third matrix unit is a submatrix in the second matrix that includes P rows and P columns of second combined elements.
[0209] The transfer module 1204 is also used to retrieve the second combined element from the second matrix register in the second dimension direction, expand it into the first element, and store it into the second vector register. The second dimension corresponds to the row dimension of the second matrix.
[0210] The read module 1205 is also used to read the first element from the second vector register.
[0211] In one possible implementation, the transfer module 1204 is further configured to, after retrieving a group of second combined elements from the second matrix register each time, transpose the second combined elements in the fourth matrix unit and store them in the second matrix register, where the fourth matrix unit is the next submatrix of the third matrix unit in the second matrix.
[0212] In one possible implementation, the determining module 1202 is also used to determine the corresponding third matrix register with a capacity of P×P based on the type of the second element;
[0213] It also includes a calculation module, which is used to specify the first vector register and the second vector register to perform the outer product calculation; and to write the outer product calculation result into the third matrix register.
[0214] In one possible implementation, the read module 1205 is also used to determine the second matrix register and the third matrix register with a capacity of P×P based on the type of the second element.
[0215] Based on the type of the first element, determine the second and third vector registers with a capacity of P×Q.
[0216] The transfer module 1204 is also used to sequentially store two consecutive sets of first elements in the column dimension into the second vector register and the third vector register in the second matrix.
[0217] It also includes a calculation module, which is used to specify the first vector register to perform outer product calculations with the second vector register and the third vector register respectively;
[0218] Write the outer product calculation result corresponding to the second vector register to the second matrix register, and write the outer product calculation result corresponding to the third vector register to the third matrix register.
[0219] It is understood that the matrix multiplication element reading device provided in this application corresponds to the matrix multiplication element reading method in the embodiments of Figures 4 and 7 above. For relevant descriptions, please refer to the above text, and they will not be repeated here.
[0220] This application also provides a computing device. Please refer to Figure 13, which is a schematic diagram of the structure of the computing device provided in an embodiment of this application. The computing device can be used to execute computer programs or computer instructions stored in memory to perform the methods shown in the embodiments of Figure 4 or Figure 7. Refer to the relevant descriptions in the above method embodiments.
[0221] The computing device includes a processor 1101. Optionally, the computing device may also include a memory 1102 and a transceiver 1103.
[0222] This application also provides a computer program product including instructions that, when run on a computer, cause the computer to perform the method of the embodiment shown in FIG4 or FIG7 above.
[0223] This application also provides a computer-readable storage medium including computer instructions that, when executed on a computer, cause the computer to perform the method of the embodiment shown in FIG4 or FIG7 above.
[0224] This application also provides a chip device, including a processor for connection to a memory, and for calling a program stored in the memory so that the processor executes the method of the embodiment shown in FIG4 or FIG7 above.
[0225] The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits used to control the execution of a program for controlling the methods of the embodiments shown in Figures 4 or 7. The memory mentioned above can be read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).
[0226] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0227] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, or indirect coupling or communication connection between apparatuses or units, and may be electrical, mechanical, or other forms.
[0228] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0229] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0230] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
Claims
1. A method for reading elements in matrix multiplication, characterized in that, include: Determine the first and second matrices to be multiplied, wherein the first matrix is stored contiguously in the row dimension; Based on the type of the first element in the first matrix and the second matrix, a first vector register with a capacity of P×Q is determined, where P and Q are both integers greater than 1; Based on the type of the second element in the third matrix, a first matrix register with a capacity of P×P is determined, wherein the element type of the third matrix is determined based on matrix multiplication of the first matrix and the second matrix; When performing matrix multiplication on the first matrix and the second matrix, in the first matrix, every Q consecutive first elements in the row dimension are merged into a first combination element, and the first combination elements in the first matrix unit are stored in the first matrix register in groups. The first matrix unit is a sub-matrix in the first matrix that includes P rows and P columns of the first combination elements, and a group of the first combination elements is P consecutive first combination elements in one direction. The first combined element is taken out from the first matrix register in the first dimension and expanded into the first element, and then stored in the first vector register. The first dimension corresponds to the column dimension of the first matrix. Read the first element from the first vector register.
2. The method according to claim 1, characterized in that, After the first combined element is retrieved from the first matrix register in the first dimension direction, expanded into the first element, and stored in the first vector register, the method further includes: Each time a group of the first combination elements is retrieved from the first matrix register, the first combination elements in the second matrix unit are transposed and stored in the first matrix register. The second matrix unit is the next submatrix of the first matrix unit in the first matrix.
3. The method according to claim 1 or 2, characterized in that, The second matrix is stored contiguously along the column dimension; After determining the first and second matrices for matrix multiplication, the process further includes: Based on the type of the second element, determine the second matrix register with a capacity of P×P; Based on the type of the first element, determine a second vector register with a capacity of P×Q; When performing matrix multiplication on the first matrix and the second matrix, the following is also included: In the second matrix, every Q consecutive first elements in the column dimension are merged into a second combined element, and the second combined elements in the third matrix unit are stored in the second matrix register in groups. The third matrix unit is a sub-matrix in the second matrix that includes P rows and P columns of the second combined elements. The second combined element is taken out from the second matrix register in the second dimension and expanded into the first element, and then stored in the second vector register. The second dimension corresponds to the row dimension of the second matrix. Read the first element from the second vector register.
4. The method according to claim 3, characterized in that, After the second combined element is retrieved from the second matrix register in the second dimension, expanded into the first element, and stored in the second vector register, the method further includes: Each time a group of the second combination elements is retrieved from the second matrix register, the second combination elements in the fourth matrix unit are transposed and stored in the second matrix register. The fourth matrix unit is the next sub-matrix of the third matrix unit in the second matrix.
5. The method according to claim 3 or 4, characterized in that, After determining the first and second matrices for matrix multiplication, the process further includes: Based on the type of the second element, determine the third matrix register with a capacity of P×P; After reading the first element from the first vector register, and / or after reading the first element from the second vector register, the process includes: Specify the first vector register and the second vector register to perform the outer product calculation; The result of the outer product calculation is written into the third matrix register.
6. The method according to claim 1 or 2, characterized in that, After determining the first and second matrices for matrix multiplication, the process further includes: Based on the type of the second element, determine the second matrix register and the third matrix register with a capacity of P×P; Based on the type of the first element, determine the second vector register and the third vector register with a capacity of P×Q; In the second matrix, the two consecutive sets of the first elements along the column dimension are stored in the second vector register and the third vector register respectively. After reading the first element from the first vector register, the method further includes: The first vector register is specified to perform outer product calculations with the second vector register and the third vector register, respectively; Write the outer product calculation result corresponding to the second vector register into the second matrix register, and write the outer product calculation result corresponding to the third vector register into the third matrix register.
7. An element reading device for matrix multiplication, characterized in that, include: The acquisition module is used to determine the first and second matrices to be multiplied, wherein the first matrix is stored continuously in the row dimension. The determination module is used to determine a first vector register with a capacity of P×Q based on the type of the first element in the first matrix and the second matrix, where P and Q are both integers greater than 1; The determining module is further configured to determine a first matrix register with a capacity of P×P based on the type of the second element in the third matrix, wherein the element type of the third matrix is determined based on matrix multiplication of the first matrix and the second matrix; The rearrangement module is used to perform matrix multiplication on the first matrix and the second matrix. In the first matrix, every Q consecutive first elements in the row dimension are merged into a first combination element, and the first combination elements in the first matrix unit are stored in the first matrix register in groups. The first matrix unit is a sub-matrix in the first matrix that includes P rows and P columns of the first combination elements. A group of the first combination elements corresponds to P consecutive first combination elements in one direction. The transfer module is used to retrieve the first combined elements from the first matrix register in the first dimension direction, expand them into the first elements, and store them into the first vector register. The first dimension corresponds to the column dimension of the first matrix. A read module is used to read the first element from the first vector register.
8. A computing device, characterized in that, Including the processor; The processor is configured to execute computer programs or computer instructions stored in memory to perform the method as described in any one of claims 1 to 6.
9. The computing device according to claim 8, characterized in that, It also includes the memory.
10. A computer program product containing instructions, characterized in that, When the instruction is executed by the computing device, the computing device performs the method as described in any one of claims 1 to 6.
11. A computer-readable storage medium, characterized in that, It includes computer program instructions, which, when executed by a computing device, cause the computing device to perform the method as described in any one of claims 1 to 6.