A heterogeneous processor-oriented batch matrix multiplication optimization implementation method and system
By optimizing matrix partitioning and implementing a double buffering mechanism on heterogeneous processors, the problem of insufficient hardware resource utilization is solved, and an efficient batch matrix multiplication algorithm is realized, which improves the calculation speed and resource utilization, especially significantly accelerating the calculation efficiency in the field of deep learning.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUNAN UNIV
- Filing Date
- 2023-08-14
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies for batch matrix multiplication on heterogeneous processors do not fully utilize hardware resources, resulting in inefficient performance, especially in convolutional neural network computations in the field of deep learning.
By performing decision parameter calculations on the CPU side, optimizing matrix multiplication operations in blocks, and employing a double buffering mechanism and DMA data transfer overlap technology on the DSP side, an efficient batch matrix multiplication algorithm is achieved. This includes the application of the principles of spatial locality and temporal locality to optimize matrix block division and batch calculation.
It improves the computing speed and resource utilization of heterogeneous processors, enhances computing efficiency in fields such as deep learning, reduces memory access latency, and improves computing speed and efficiency.
Smart Images

Figure CN117150194B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of computer technology, specifically relating to a batch processing matrix multiplication optimization implementation method and system for heterogeneous processors. Background Technology
[0002] The Basic Linear Algebra Subprograms (BLAS) is a numerical library interface standard for basic linear algebra operations. The most core algorithm interface used is General Matrix Multiplication (GEMM), which has wide applications in deep learning, signal processing, physics, computational chemistry, and other fields.
[0003] In recent decades, academia and industry have made extensive optimizations to the GEMM implementation. With the development of chip technology and the continuous improvement of hardware computing power, how to fully utilize hardware resources and improve the parallelism of problem solving has become a research hotspot in recent years. The current trend is to decompose large-scale tasks into multiple smaller subtasks, with the processor solving these subtasks in parallel. However, if the computational workload of the subtasks is too small, hardware computing resources are not fully utilized, thus failing to fully leverage processor performance and achieve the desired acceleration. Therefore, batch processing can be introduced to solve multiple subtasks at once, improving resource utilization.
[0004] To meet the widespread demand for batch processing in various fields, experts have extended the original BLAS interface and added a batch processing interface standard (Batched BLAS). The corresponding batch matrix multiplication (Batched GEMM, BGEMM) has also become the most widely used interface.
[0005] Taking deep learning as an example, in a Convolutional Neural Network (CNN) architecture, 90% of the model's computation time is consumed in fully connected layers and convolutional layers, and this computation process can be transformed into BGEMM computation. Therefore, optimizing the implementation of BGEMM is of great significance for accelerating AI training and inference.
[0006] Each hardware manufacturer has made specific optimizations for its chip architecture and provided dedicated computing libraries. Currently, well-known computing libraries have implemented BGEMM interfaces optimized for specific instruction set architectures. For example, there is ARMPL (Arm Performance Libraries) for ARM architecture, oneMKL (Intel oneAPI Math Kernel Library) for x86 architecture, cuDNN (NVIDIA CUDA Deep Neural Network library) and MAGMA (Matrix Algebra on GPU and Multi-core Architectures) for NVIDIA GPU architecture.
[0007] The above optimizations are all for their specific chip architecture instruction sets. They cannot be directly reused for a chip that uses a new instruction set architecture. If they are simply ported, it will lead to a decrease in performance. Summary of the Invention
[0008] The purpose of this application is to provide a batch processing matrix multiplication optimization method and system for heterogeneous processors, which enables collaborative work between the host and device ends; and implements an efficient batch processing matrix multiplication algorithm on the device end, thereby solving at least one of the technical problems involved in the background art.
[0009] To solve the above-mentioned technical problems, this application provides the following technical solution:
[0010] An optimized implementation method for batch matrix multiplication on heterogeneous processors includes the following steps:
[0011] Step S1: The CPU allocates space for the matrix in shared DDR memory using the hthread_malloc function;
[0012] Step S2: The CPU calculates the decision parameters through a decision algorithm. The decision parameters include the matrix block size parameter and the m_batch size parameter.
[0013] Step S3: Start the DSP function based on the decision parameters. This DSP function uses the following formula for batch matrix multiplication:
[0014] C i =α i ·A i ×B i +β i ·C i
[0015] Where i = 1, 2, 3, ... N represents the index of each matrix multiplication; and For the input matrix, For the output matrix; α i and β i It is a scalar, representing the coefficient of the operation.
[0016] Optionally, in step S2, the CPU calculates decision parameters through a decision algorithm, including matrix block size parameter calculation and m_batch size parameter calculation.
[0017] Optionally, the calculation of the matrix block size parameter includes:
[0018] Utilizing the principles of spatial and temporal locality, and based on the actual storage space of the DSP, the two matrix multiplication operations are refined into a set of panel-panel multiplication operations and further refined into a set of block-panel multiplication operations, so that the data that the DSP computing unit needs to calculate each time is stored in scalar storage space and vector storage space.
[0019] By satisfying the following formula constraints on the parameters of the sub-blocks of the matrix and solving for the maximum computation-to-memory access ratio, the matrix block size parameters are determined:
[0020] M SM *K CG *2*w≤size(SM)
[0021] (M CG +K CG )*N AM *2*w≤size(AM)
[0022] In the formula, w represents the data type byte, and M... SM *K CG K is a sub-block of matrix A. CG *N AM M is a sub-block of matrix B. SM *N AM SM is a sub-block of matrix C, SM is scalar storage space, and AM is vector storage space.
[0023] Optionally, the calculation of the m_batch size parameter includes:
[0024] Based on the core number, assign the matrix number to be calculated, and the parameters satisfy the following formula constraints:
[0025] m_batch*M CG *K CG *w≤size(GSM)
[0026] In the formula, GSM is for accelerating shared memory within a cluster.
[0027] Optionally, in step S3, the step of starting the DSP function based on decision parameters includes:
[0028] Step S31: Prepare the calculation data. The DMA transfers the data of the first batch of m_batch sub-blocks of matrix A from DDR to GSM. Each core moves the data of the corresponding m_batch sub-blocks of matrix B and matrix C from DDR to the first buffer of AM.
[0029] Step S32: Further divide the sub-blocks of matrix A in the GSM into blocks, and the DMA moves the data of the sub-blocks of matrix A from the on-chip GSM to the first buffer of the SM.
[0030] In step S33, the DSP computing unit reads the data stored in the first buffer of AM and SM, calls the assembly kernel for calculation, and at the same time, the DMA moves the data of the sub-block of matrix A from the on-chip GSM to the second buffer of SM, and moves the data of the sub-blocks of matrix B and matrix C from DDR to the second buffer of AM.
[0031] After the calculations in step S34 and S33 are completed, the DMA moves the calculation results of the first buffer of AM to DDR, and at the same time the DSP calculation unit reads the data saved in the second buffer of AM and SM for calculation.
[0032] In step S35, the DSP computing unit continues to read the data stored in the second buffer of AM and SM for calculation. At the same time, the DMA moves the data of the sub-block of matrix A from the on-chip GSM to the first buffer of SM, and moves the data of the sub-blocks of matrix B and matrix C from DDR to the first buffer of AM.
[0033] Step S36: Repeat steps S31-S35 until the calculation of m_batch matrices GEMM is completed.
[0034] Step S37: Take the next m_batch matrix for calculation, and continue until all N matrices have been calculated.
[0035] Optionally, the heterogeneous processor is the MT7032 heterogeneous many-core microprocessor.
[0036] This application also provides a batch processing matrix multiplication optimization implementation system for heterogeneous processors for the method described, the system comprising:
[0037] The space allocation module is used by the CPU to allocate space for the matrix on shared DDR memory using the hthread_malloc function;
[0038] The decision parameter calculation module is used by the CPU to calculate decision parameters through a decision algorithm.
[0039] The core computing module is used to launch DSP functions based on decision parameters to achieve batch matrix multiplication optimization.
[0040] The beneficial effects of this application are as follows:
[0041] 1. A high-efficiency batch matrix multiplication (BGEMM) algorithm has been implemented on DSP, which can effectively accelerate applications in multiple fields, including deep learning;
[0042] 2. By refining the two matrix multiplication operations into a set of panel-panel multiplication operations and a set of block-panel multiplication operations, the data that the DSP computing unit needs to calculate each time can be stored in SM and AM, thereby reducing the memory access latency of the DSP computing unit and improving the calculation speed.
[0043] 3. A dual-buffer mechanism is employed during DPS calculation, dividing SM, AM, and GSM into two parts: one for computation and the other for data transfer, thus masking memory access time with computation time. Two buffers are set up for sub-blocks of matrix A in the scalar storage space (SM), and two buffers are set up for sub-blocks of matrices B and C in the vector storage space (AM). This DMA dual-buffering approach overlaps core computation with DMA data transfer, hiding memory access time and improving the computational efficiency of GEMM. Attached Figure Description
[0044] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort, wherein:
[0045] Figure 1 This is a DSP memory hierarchy diagram provided in the embodiments of this application;
[0046] Figure 2 This is a matrix block computation diagram provided in the embodiments of this application;
[0047] Figure 3 This is a structural block diagram of a batch processing matrix multiplication optimization implementation system for heterogeneous processors provided in an embodiment of this application. Detailed Implementation
[0048] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0049] The terms "first," "second," etc., used in the specification and claims of this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such use of data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class and the number of objects is not limited; for example, a first object can be one or more. Furthermore, in the specification and claims, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.
[0050] The following description, in conjunction with the accompanying drawings, details the batch processing matrix multiplication optimization implementation method for heterogeneous processors provided in this application through specific embodiments and application scenarios.
[0051] This application provides a batch processing matrix multiplication optimization method for heterogeneous processors. It should be noted that the heterogeneous processor is the MT7032 heterogeneous many-core microprocessor, which is the verification chip for the prototype of the Tianhe next-generation supercomputer. It was entirely independently developed and designed in my country and boasts powerful computing performance. The MT7032 consists of a 16-core ARMv8 CPU and four independent DSP acceleration clusters. Each DSP cluster consists of eight DSP cores, sharing 6MB of on-chip memory (Global Shared Memory) and 32GB of off-chip DDR space. The DSP core mainly includes an Instruction Fetch Unit (IFU), a Scalar Processing Unit (SPU), and 64KB of Scalar Memory (SM), a Vector Processing Unit (VPU) and 768KB of Array Memory (AM). Each VPU contains 16 Vector Processing Elements (VPEs), and each VPE can execute three floating-point multiply-accumulate instructions simultaneously, as well as a Direct Memory Access (DMA) unit.
[0052] The method includes the following steps:
[0053] Step S1: The CPU allocates space for the matrix in shared DDR memory using the hthread_malloc function;
[0054] Step S2: The CPU calculates the decision parameters through a decision algorithm. The decision parameters include the matrix block size parameter and the m_batch size parameter.
[0055] Step S3: Start the DSP function based on the decision parameters. This DSP function uses the following formula for batch matrix multiplication:
[0056] C i =α i ·A i ×B i +β i ·C i
[0057] Where i = 1, 2, 3, ... N represents the index of each matrix multiplication; and For the input matrix, For the output matrix; α i and β i It is a scalar, representing the coefficient of the operation.
[0058] In step S2, the CPU calculates decision parameters through a decision algorithm, including matrix block size parameter calculation and m_batch size parameter calculation.
[0059] The calculation of the matrix block size parameter includes:
[0060] In the BGEMM calculation process, the storage resources required by the DSP can be divided into a four-layer pyramid structure from top to bottom, such as... Figure 1 As shown. Among them, DSP accesses AM and SM faster, but accesses DDR memory slower. This application rationally divides the matrix into blocks, utilizing the principles of spatial and temporal locality, and based on the actual storage space size of the DSP, refines the matrix multiplication operation into a set of panel-panel multiplication operations and further refines it into a set of block-panel multiplication operations. Specifically, it combines... Figure 2 As shown, the data that the DSP computing unit needs to calculate each time is stored in the scalar storage space and AM, thereby reducing the memory access latency of the DSP computing unit and improving the computing speed.
[0061] Considering the capacity limitations of vector memory, scalar memory, and on-chip global shared cache, for a data type of byte w, a sub-block (M) of matrix A... SM *K CG ), sub-blocks of matrix B (K)CG *N AM ) and sub-blocks of matrix C (M) SM *N AM The parameters of ) need to meet the following constraints:
[0062] M SM *K CG *2*w≤size(SM)
[0063] (M CG +K CG )*N AM *2*w≤size(AM)
[0064] m_batch*M CG *K CG *w≤size(GSM)
[0065] Meanwhile, under the above constraints, the maximum computational memory access ratio is solved to determine the decision parameters.
[0066] For a single computation of m_batch matrices, a DSP cluster has 8 cores. If the above computation involved m_batch = 2, then cores 0-3 compute the first matrix, and cores 4-7 compute the second matrix. That is, the matrix numbers are assigned based on the core number. The computation performed by each core is a GEMM operation, and the computations of multiple cores constitute a BGEMM operation.
[0067] In step S3, the step of starting the DSP function based on decision parameters includes:
[0068] Step S31: Prepare the calculation data. The DMA transfers the data of the first batch of m_batch sub-blocks of matrix A from DDR to GSM. Each core moves the data of the corresponding m_batch sub-blocks of matrix B and matrix C from DDR to the first buffer of AM.
[0069] Step S32: Further divide the sub-blocks of matrix A in the GSM into blocks, and the DMA moves the data of the sub-blocks of matrix A from the on-chip GSM to the first buffer of the SM.
[0070] In step S33, the DSP computing unit reads the data stored in the first buffer of AM and SM, calls the assembly kernel for calculation, and at the same time, the DMA moves the data of the sub-block of matrix A from the on-chip GSM to the second buffer of SM, and moves the data of the sub-blocks of matrix B and matrix C from DDR to the second buffer of AM.
[0071] After the calculations in step S34 and S33 are completed, the DMA moves the calculation results of the first buffer of AM to DDR, and at the same time the DSP calculation unit reads the data saved in the second buffer of AM and SM for calculation.
[0072] In step S35, the DSP computing unit continues to read the data stored in the second buffer of AM and SM for calculation. At the same time, the DMA moves the data of the sub-block of matrix A from the on-chip GSM to the first buffer of SM, and moves the data of the sub-blocks of matrix B and matrix C from DDR to the first buffer of AM.
[0073] Step S36: Repeat steps S31-S35 until the calculation of m_batch matrices GEMM is completed.
[0074] Step S37: Take the next m_batch matrix for calculation, and continue until all N matrices have been calculated.
[0075] It should be noted that the DSP function employs a double-buffer mechanism during computation, dividing the Scalar Memory (SM), AM, and GSM into two parts: one for computation and the other for data transfer, thus masking memory access time with computation time. Two buffers are set up for sub-blocks of matrix A in the Scalar Memory (SM), and two buffers are set up for sub-blocks of matrices B and C in the Vector Memory (AM). This DMA double-buffering approach overlaps core computation with DMA data transfer, hiding memory access time and improving the computational efficiency of GEMM.
[0076] Combined Figure 3 As shown, this application also provides a batch processing matrix multiplication optimization implementation system for heterogeneous processors for the method described above. The system includes: a space allocation module 1, a decision parameter calculation module 2, and a core calculation module 3.
[0077] The space allocation module 1 is used by the CPU to allocate space for the matrix on the shared DDR memory through the hthread_malloc function;
[0078] The decision parameter calculation module 2 is used by the CPU to calculate decision parameters through a decision algorithm;
[0079] The core computing module 3 is used to start the DSP function based on decision parameters to achieve batch processing matrix multiplication optimization.
[0080] The beneficial effects of this application are as follows:
[0081] 1. A high-efficiency batch matrix multiplication (BGEMM) algorithm has been implemented on DSP, which can effectively accelerate applications in multiple fields, including deep learning;
[0082] 2. By refining two matrix multiplication operations into a set of panel-panel multiplication operations and further refining them into a set of block-panel multiplication operations, the data that the DSP computing unit needs to calculate each time can be stored in SM and AM, thereby reducing the memory access latency of the DSP computing unit and improving the calculation speed.
[0083] 3. A dual-buffer mechanism is employed during DSP computation, dividing SM, AM, and GSM into two parts: one for computation and the other for data transfer, thus masking memory access time with computation time. Two buffers are set up for sub-blocks of matrix A in the scalar storage space (SM), and two buffers are set up for sub-blocks of matrices B and C in the vector storage space (AM). This DMA dual-buffering approach overlaps core computation with DMA data transfer, hiding memory access time and improving the computational efficiency of GEMM.
[0084] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.
Claims
1. A batch processing matrix multiplication optimization implementation method for heterogeneous processors, characterized in that, Includes the following steps: Step S1: The CPU allocates space for the matrix in shared DDR memory using the hthread_malloc function; Step S2, the CPU calculates decision parameters using a decision algorithm, the decision parameters including matrix block size parameter and m_batch size parameter; the calculation of the matrix block size parameter includes: Utilizing the principles of spatial and temporal locality, and based on the actual storage space of the DSP, the two matrix multiplication operations are refined into a set of panel-panel multiplication operations and further refined into a set of block-panel multiplication operations, so that the data that the DSP computing unit needs to calculate each time is stored in scalar storage space and vector storage space. By satisfying the following formula constraints on the parameters of the sub-blocks of the matrix and solving for the maximum computation-to-memory access ratio, the matrix block size parameters are determined: M SM K CG 2 w ≤size(SM) (M CG +K CG ) N AM 2 w≤size(AM) In the formula, w represents the data type byte, and M... SM K CG K is a sub-block of matrix A. CG N AM SM is a sub-block of matrix B, SM is a scalar storage space, and AM is a vector storage space; The calculation of the m_batch size parameter includes: Based on the core number, assign the matrix number to be calculated, and the parameters satisfy the following formula constraints: m_batch M CG K CG w≤size(GSM) In the formula, GSM is for accelerating shared memory within a cluster; Step S3: Start the DSP function based on the decision parameters. This DSP function uses the following formula for batch matrix multiplication: ; in, Indicates the subscripts for matrix multiplication; and For the input matrix, This is the output matrix; and It is a scalar, representing the coefficient of the operation; The DSP function based on decision parameters includes: Step S31: Prepare the calculation data. The DMA transfers the data of the first batch of m_batch sub-blocks of matrix A from DDR to GSM. Each core moves the data of the corresponding m_batch sub-blocks of matrix B and matrix C from DDR to the first buffer of AM. Step S32: Further divide the sub-blocks of matrix A in the GSM into blocks, and the DMA moves the data of the sub-blocks of matrix A from the on-chip GSM to the first buffer of the SM. In step S33, the DSP computing unit reads the data stored in the first buffer of AM and SM, calls the assembly kernel for calculation, and at the same time, the DMA moves the data of the sub-block of matrix A from the on-chip GSM to the second buffer of SM, and moves the data of the sub-blocks of matrix B and matrix C from DDR to the second buffer of AM. After the calculations in step S34 and S33 are completed, the DMA moves the calculation results of the first buffer of AM to DDR, and at the same time the DSP calculation unit reads the data saved in the second buffer of AM and SM for calculation. In step S35, the DSP computing unit continues to read the data stored in the second buffer of AM and SM for calculation. At the same time, the DMA moves the data of the sub-block of matrix A from the on-chip GSM to the first buffer of SM, and moves the data of the sub-blocks of matrix B and matrix C from DDR to the first buffer of AM. Step S36: Repeat steps S31-S35 until the calculation of m_batch matrices GEMM is completed. Step S37: Take the next m_batch matrix for calculation, and continue until all N matrices have been calculated.
2. The method according to claim 1, characterized in that, The heterogeneous processor is the MT7032 heterogeneous many-core microprocessor.
3. A batch processing matrix multiplication optimization implementation system for heterogeneous processors for running the method according to any one of claims 1-2, characterized in that, The system includes: The space allocation module is used by the CPU to allocate space for the matrix on shared DDR memory using the hthread_malloc function; The decision parameter calculation module is used by the CPU to calculate decision parameters through a decision algorithm. The core computing module is used to launch DSP functions based on decision parameters to achieve batch matrix multiplication optimization.