A matrix multiplication processing method and device for a digital signal processor and a medium
By transferring the matrix in blocks to different memories and performing matrix multiplication operations, the bandwidth pressure problem caused by matrix multiplication algorithms in digital signal processors is solved, thereby reducing memory access and improving operational efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2023-05-29
- Publication Date
- 2026-06-26
AI Technical Summary
In existing technologies, the matrix multiplication algorithm of digital signal processors results in high bandwidth pressure and high memory access volume for double-rate synchronous dynamic random access memory, which has not been effectively reduced.
By transferring the matrix in blocks from the double-rate synchronous dynamic random access memory to the array memory and the global shared memory, and storing the matrix in the corresponding array memory and scalar memory when a matrix multiplication operation instruction is detected, the matrix multiplication operation is performed, and data transmission and computation are carried out in a dual-pipeline manner.
It reduces the amount of memory access required by the digital signal processor, reduces the bandwidth pressure on the double-rate synchronous dynamic random access memory, and improves the efficiency of matrix multiplication operations.
Smart Images

Figure CN116701833B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of vector processors, and in particular to a matrix multiplication processing method, apparatus and medium for digital signal processors. Background Technology
[0002] To maximize the peak performance of matrix multiplication, the computation needs to mask the data movement. For Digital Signal Processors (DSPs), Direct Memory Access (DMA) is used as the data movement method. Because DSPs use a multi-core Global Shared Memory (GSM) architecture, each core has scalar and vector caches. The memory hierarchy is more complex than that of the Central Processing Unit (CPU), and optimal matrix multiplication algorithms based on CPU memory access have been proposed. However, matrix multiplication algorithms with lower memory access requirements based on DSPs have not yet been proposed. When memory access is large, Double Data Rate Synchronous Dynamic Random Access Memory (DDR) experiences significant bandwidth pressure.
[0003] Therefore, providing a matrix multiplication processing method for digital signal processors to reduce memory access and thus reduce the bandwidth pressure on double-rate synchronous dynamic random access memory is a technical problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0004] The purpose of this application is to provide a matrix multiplication processing method, apparatus, and medium for digital signal processors, which reduces memory access and thus reduces the bandwidth pressure of double-rate synchronous dynamic random access memory.
[0005] To address the aforementioned technical problems, this application provides a matrix multiplication processing method for digital signal processors, comprising:
[0006] Obtain the first, second, and third matrices to be multiplied;
[0007] The first matrix is transferred from the double-rate synchronous dynamic random access memory to the array memory;
[0008] The second and third matrices are transferred from the double-rate synchronous dynamic random access memory to the global shared memory;
[0009] Upon detecting an instruction to perform the matrix multiplication operation, the second matrix is stored from the global shared memory to the array memory, and the third matrix is stored from the global shared memory to the scalar memory;
[0010] The matrix multiplication operation is performed on the first matrix, the second matrix stored in the array memory, and the third matrix stored in the scalar memory.
[0011] Preferably, before performing the matrix multiplication operation on the first matrix, the second matrix, and the third matrix stored in the array memory, the method further includes:
[0012] The first matrix, the second matrix, and the third matrix are split into blocks, and the block parameters of each block are obtained; wherein, the block parameters include at least the number of rows and columns of the block.
[0013] When the block parameters meet the preset requirements, the matrix multiplication operation is performed on each of the matrix blocks corresponding to the first matrix stored in the array memory, each of the matrix blocks corresponding to the second matrix, and each of the matrix blocks corresponding to the third matrix stored in the scalar memory.
[0014] Preferably, the preset requirement is determined based on at least one of the following:
[0015] The storage capacity of the global shared memory, the size of the thread grid consisting of all the cores of the digital signal processor, the storage capacity of the array memory, and the storage capacity of the scalar memory.
[0016] Preferably, before transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory, the method further includes:
[0017] Obtain the size of the global shared storage;
[0018] Based on the size of the global shared storage, the first matrix is split into a first matrix block, the second matrix is split into a second matrix block, and the third matrix is split into a third matrix block; wherein, the number of rows in the first matrix block is the same as the number of rows in the second matrix block, the number of columns in the first matrix block is the same as the number of columns in the third matrix block, and the number of columns in the second matrix block is the same as the number of rows in the third matrix block;
[0019] Correspondingly, the step of transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory includes:
[0020] Each of the first matrix blocks is transferred from the double-rate synchronous dynamic random access memory to the array memory;
[0021] The step of transferring the second matrix and the third matrix from the double-rate synchronous dynamic random access memory to the global shared memory includes:
[0022] Each of the second matrix blocks and each of the third matrix blocks are transferred from the double-rate synchronous dynamic random access memory to the global shared memory.
[0023] Preferably, after splitting the first matrix into a first matrix block, the second matrix into a second matrix block, and the third matrix into a third matrix block according to the size of the global shared memory, and before transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory, the method further includes:
[0024] Obtain the number of cores of the digital signal processor and the size of the thread grid composed of all the cores;
[0025] Based on the size of the thread grid, the first matrix block is split into a fourth matrix block, the second matrix block is split into a fifth matrix block, and the third matrix block is split into a sixth matrix block;
[0026] Correspondingly, the step of transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory includes:
[0027] Each of the fourth matrix blocks is transferred from the double-rate synchronous dynamic random access memory to the array memory;
[0028] The step of transferring the second matrix and the third matrix from the double-rate synchronous dynamic random access memory to the global shared memory includes:
[0029] Each of the fifth matrix blocks and each of the sixth matrix blocks are transferred from the double-rate synchronous dynamic random access memory to the global shared memory.
[0030] Preferably, after transferring each of the fifth matrix blocks and each of the sixth matrix blocks from the double-rate synchronous dynamic random access memory to the global shared memory, and before storing the second matrix from the global shared memory to the array memory and the third matrix from the global shared memory to the scalar memory, the method further includes:
[0031] Obtain the storage capacity of the array memory and the storage capacity of the scalar memory;
[0032] Based on the storage capacity of the array memory and the storage capacity of the scalar memory, the fourth matrix block is split into the seventh matrix block, the fifth matrix block is split into the eighth matrix block, and the sixth matrix block is split into the ninth matrix block;
[0033] Correspondingly, storing the second matrix from the global shared memory to the array memory and storing the third matrix from the global shared memory to the scalar memory includes:
[0034] The eighth matrix block is stored from the global shared memory to the array memory, and the ninth matrix block is stored from the global shared memory to the scalar memory;
[0035] The matrix multiplication operation performed on the first matrix, the second matrix stored in the array memory, and the third matrix stored in the scalar memory includes:
[0036] The matrix multiplication operation is performed on the seventh matrix block, the eighth matrix block stored in the array memory, and the ninth matrix block stored in the scalar memory.
[0037] Preferably, the matrix multiplication operation and the transfer of the first matrix, the second matrix, and the third matrix from the double-rate synchronous dynamic random access memory are performed in a dual-pipeline manner.
[0038] To address the aforementioned technical problems, this application also provides a matrix multiplication processing apparatus for a digital signal processor, comprising:
[0039] The acquisition module is used to acquire the first, second, and third matrices to be multiplied.
[0040] The first transmission module is used to transmit the first matrix from the double-rate synchronous dynamic random access memory to the array memory.
[0041] The second transmission module is used to transmit the second matrix and the third matrix from the double-rate synchronous dynamic random access memory to the global shared memory;
[0042] A storage module is configured to, upon detecting an instruction to perform the matrix multiplication operation, store the second matrix from the global shared memory to the array memory and the third matrix from the global shared memory to the scalar memory;
[0043] A matrix multiplication module is used to perform the matrix multiplication operation on the first matrix, the second matrix stored in the array memory, and the third matrix stored in the scalar memory.
[0044] To address the aforementioned technical problems, this application also provides a matrix multiplication processing apparatus for a digital signal processor, comprising:
[0045] Memory, used to store computer programs;
[0046] A processor is used to implement the steps of the matrix multiplication processing method for a digital signal processor described above when executing the computer program.
[0047] To address the aforementioned technical problems, this application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the matrix multiplication processing method for digital signal processors described above.
[0048] The matrix multiplication processing method for digital signal processors provided in this application includes: acquiring a first matrix, a second matrix, and a third matrix to be multiplied; transferring the first matrix from Double Data Rate (DDR) synchronous dynamic random access memory (DRAM) to an array memory; transferring the second and third matrices from DRAM to global shared memory (GRM); upon detecting an instruction to perform a matrix multiplication operation, storing the second matrix from GRM to the array memory and the third matrix from GRM to a scalar memory; and performing a matrix multiplication operation on the first matrix stored in the array memory and the third matrix stored in the scalar memory. In this method, before performing the matrix multiplication operation, the first matrix is stored in the array memory, and the second and third matrices are stored in the GRM. This ensures that during the matrix multiplication operation, the first matrix is read only once, while the second and third matrices are shared by multiple cores, resulting in lower memory access requirements for the digital signal processor and thus reducing the bandwidth pressure on the DRAM.
[0049] In addition, this application also provides a matrix multiplication processing apparatus for a digital signal processor and a computer-readable storage medium, which have the same or corresponding technical features as the matrix multiplication processing method for a digital signal processor mentioned above, and have the same effect. Attached Figure Description
[0050] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0051] Figure 1 A schematic diagram of an ideal cache model provided in an embodiment of this application;
[0052] Figure 2 A flowchart of a matrix multiplication processing method for digital signal processors provided in this application embodiment;
[0053] Figure 3 A structural diagram of a matrix multiplication processing apparatus for a digital signal processor provided in an embodiment of this application;
[0054] Figure 4 This is a structural diagram of a matrix multiplication processing apparatus for a digital signal processor, provided in another embodiment of this application. Detailed Implementation
[0055] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.
[0056] The core of this application is to provide a matrix multiplication processing method, apparatus, and medium for digital signal processors, which reduces memory access and thus reduces the bandwidth pressure of double-rate synchronous dynamic random access memory.
[0057] In theoretical computer science, the concept of I / O complexity is used to define the amount of memory accessed by an algorithm. Figure 1 This is a schematic diagram of an ideal cache model provided in an embodiment of this application. Assume a processor with only two memory layers: a fast cache 1 (Cache) of size S and an infinitely large slow storage disk 2 (DISK). For data to be processed by the CPU 3, it needs to be loaded from the DISK into the Cache. Under the ideal cache model, the minimum amount of data movement is defined as the algorithm's I / O complexity (primarily considering the amount of data movement from the DISK to the Cache).
[0058] Taking Matrix-DSP as an example, each core shares global shared memory and has its own private array memory (AM) and scalar memory (SM). Scalar processors can only access scalar memory, not array memory. Vector processors can only access array memory, not scalar memory. Data movement between Double Rate Synchronous Dynamic Random Access Memory (DRAM), global shared memory, array memory, and scalar memory is achieved through DRAM. Matrix multiplication is performed, i.e., C = b*C + a*A*B, where A, B, and C are matrices, and a and b are scalar data.
[0059] The formula for calculating the I / O complexity F of the matrix multiplication algorithm is as follows:
[0060]
[0061] Where m, n, and k represent the number of rows or columns of the matrix, and S represents the size of the cache in the ideal cache model.
[0062] When 2mnk is much larger than 2S, the I / O complexity F of the matrix multiplication algorithm takes... Calculate memory access ratio
[0063]
[0064] The current algorithm's computation-to-memory ratio is far smaller than... This results in high bandwidth pressure on Double Data Rate (DDR) synchronous dynamic random access memory (DRAM). Therefore, the matrix multiplication processing method for digital signal processors proposed in this application reduces memory access frequency, thereby reducing the bandwidth pressure on DRAM.
[0065] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Figure 2 A flowchart of a matrix multiplication processing method for digital signal processors provided in this application embodiment is shown below. Figure 2 As shown, the method includes:
[0066] S10: Obtain the first, second, and third matrices to be multiplied by the matrix;
[0067] S11: Transfer the first matrix from the double-rate synchronous dynamic random access memory to the array memory;
[0068] S12: Transfer the second and third matrices from the double-rate synchronous dynamic random access memory to the global shared memory;
[0069] S13: Upon detecting an instruction to perform a matrix multiplication operation, store the second matrix from the global shared memory to the array memory and store the third matrix from the global shared memory to the scalar memory;
[0070] S14: Perform matrix multiplication on the first matrix, the second matrix stored in the array memory, and the third matrix stored in the scalar memory.
[0071] This embodiment uses a first matrix C, a second matrix A, and a third matrix C as an example to illustrate the matrix multiplication process. Matrix C is transferred from Double Rate Synchronous Dynamic Random Access Memory (DRAM) to Array Memory, and matrices A and B are transferred from DRAM to Global Shared Memory (GSM). Upon detecting an instruction to perform a matrix multiplication operation, matrix A is stored from GSM to Array Memory, and matrix B is stored from GSM to scalar memory. The matrix multiplication operation is then performed on the C and B matrices stored in Array Memory and the C matrix stored in scalar memory, i.e., C = b*C + a*A*B. In practice, to improve the efficiency of matrix multiplication, a preferred embodiment is that the matrix multiplication operation and the transfer of the first, second, and third matrices from DRAM are performed in a dual-pipelined manner.
[0072] The matrix multiplication processing method for digital signal processors provided in this embodiment includes: acquiring a first matrix, a second matrix, and a third matrix to be multiplied; transferring the first matrix from Double Data Rate (DDR) synchronous dynamic random access memory (DRAM) to an array memory; transferring the second and third matrices from DRAM to global shared memory (GRM); upon detecting an instruction to perform a matrix multiplication operation, storing the second matrix from GRM to the array memory and the third matrix from GRM to a scalar memory; and performing a matrix multiplication operation on the first and second matrices stored in the array memory and the third matrix stored in the scalar memory. In this method, before performing the matrix multiplication operation, the first matrix is stored in the array memory, and the second and third matrices are stored in the GRM. This ensures that during the matrix multiplication operation, the first matrix is read only once, while the second and third matrices are shared by multiple cores, resulting in lower memory access requirements for the digital signal processor and reducing the bandwidth pressure on the DRAM.
[0073] During the process of transferring the first matrix to the array memory, transferring the second and third matrices to the global shared memory, storing the second matrix in the array memory, and storing the third matrix in the scalar memory, the first, second, and third matrices need to be segmented due to limitations such as the size of the global shared memory, the storage capacity of the array memory, and the storage capacity of the scalar memory. Therefore, in a preferred embodiment, before performing matrix multiplication on the first matrix, the second matrix, and the third matrix stored in the scalar memory, the matrix multiplication processing method further includes:
[0074] The first matrix, the second matrix, and the third matrix are split into blocks respectively, and the block parameters of each block are obtained; wherein the block parameters include at least the number of rows and columns of the matrix block;
[0075] If the block parameters meet the preset requirements, matrix multiplication operations are performed on the matrix blocks corresponding to the first matrix stored in the array memory, the matrix blocks corresponding to the second matrix, and the matrix blocks corresponding to the third matrix stored in the scalar memory.
[0076] Taking the aforementioned matrices A, B, and C as examples, due to limitations such as the size of the global shared memory, the storage capacity of the array memory, and the storage capacity of the scalar memory, it is necessary to divide matrices A, B, and C into corresponding matrix blocks, and then perform matrix multiplication operations on the matrix blocks of matrices A, B, and C.
[0077] In practice, to improve the efficiency of matrix multiplication, a multi-threaded group is used to perform matrix multiplication calculations.
[0078] When dividing the blocks, in order to make the selected block parameters more reasonable, the preferred implementation is that the block parameters satisfy at least one of the following preset requirements:
[0079] The storage capacity of global shared memory, the size of the thread grid consisting of all the cores of a digital signal processor, the storage capacity of array memory, and the storage capacity of scalar memory.
[0080] In the method provided in this embodiment, the matrix is divided according to the storage capacity of the array memory, the storage capacity of the scalar memory, the size of the global shared memory, etc., so that the size of the divided matrix blocks is more reasonable.
[0081] Specifically, before transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory, the process also includes:
[0082] Get the size of the global shared storage;
[0083] Based on the size of the global shared storage, the first matrix is split into a first matrix block, the second matrix is split into a second matrix block, and the third matrix is split into a third matrix block; wherein the number of rows in the first matrix block is the same as the number of rows in the second matrix block, the number of columns in the first matrix block is the same as the number of columns in the third matrix block, and the number of columns in the second matrix block is the same as the number of rows in the third matrix block.
[0084] Correspondingly, transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory includes:
[0085] Each first matrix block is transferred from the double-rate synchronous dynamic random access memory to the array memory;
[0086] The transfer of the second and third matrices from double-rate synchronous dynamic random access memory to global shared memory includes:
[0087] Each second matrix block and each third matrix block are transferred from the double-rate synchronous dynamic random access memory to the global shared memory.
[0088] Double-rate synchronous dynamic random access memory (DRAM) overlaps with computation through a dual-pipeline architecture. The formula for limiting the size of the global shared memory is:
[0089]
[0090] Matrix C is divided into m²×n² matrix blocks, i.e., the first matrix block; matrix A is divided into m²×k² matrix blocks, i.e., the second matrix block; and matrix B is divided into k²×n² matrix blocks, i.e., the third matrix block. Each m²×n² matrix block is transferred from Double Data Rate Synchronous Dynamic Random Access Memory (DRAM) to Array Memory, and each m²×k² and k²×n² matrix block is transferred from DRAM to Global Shared Memory. That is, A and B are shared across multiple cores and stored in Global Shared Memory, while C is allocated to multiple cores and stored in Array Memory. The specific program is as follows:
[0091]
[0092]
[0093] In the method provided in this embodiment, the size of the matrix blocks of each matrix is set according to the size of the global shared storage, so that the block division is more reasonable.
[0094] Based on the above embodiments, in order to improve the efficiency of matrix multiplication operations, a preferred embodiment further includes, after splitting the first matrix into first matrix blocks, the second matrix into second matrix blocks, and the third matrix into third matrix blocks according to the size of the global shared memory, and before transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory, the following steps are also included:
[0095] Obtain the number of cores in the digital signal processor and the size of the thread grid composed of all the cores;
[0096] Based on the size of the thread grid, the first matrix block is split into the fourth matrix block, the second matrix block into the fifth matrix block, and the third matrix block into the sixth matrix block;
[0097] Correspondingly, transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory includes:
[0098] Each fourth matrix block is transferred from the double-rate synchronous dynamic random access memory to the array memory;
[0099] The transfer of the second and third matrices from double-rate synchronous dynamic random access memory to global shared memory includes:
[0100] Each fifth matrix block and each sixth matrix block are transferred from the double-rate synchronous dynamic random access memory to the global shared memory.
[0101] Using a cyclical order of m then n then k, matrix C is divided into m²×n² matrix blocks, matrix A into m²×k² matrix blocks, and matrix B into k²×n² matrix blocks. Multi-core p... m ×p n In the thread group, before the k-loop begins, the m2×n2 matrix block is further divided into m1×n1 matrix blocks, and each thread is responsible for the calculation of one of these blocks. It should be noted that, during the process of further dividing the m2×n2 matrix block into m1×n1 matrix blocks, the m2×k2 matrix block A is simultaneously divided into m1×k2 matrix blocks and the k2×n2 matrix block into k2×n1 matrix blocks in both m and n dimensions.
[0102] The operation is as follows: Before the k-loop begins, each thread transfers the m1×n1 matrix block it is responsible for calculating from Double Rate Synchronous Dynamic Random Access Memory (DRAM) to the array memory. After the k-loop begins, the first step is that the threads are divided into two groups, transferring the m2×n2 matrix block A and the k2×n2 matrix block B from DRAM to the global shared memory respectively. The second step is to call the gemmMacroKernel function for calculation. Unrolling the k-loop, the first and second steps form a double pipelined structure. The specific program is as follows:
[0103]
[0104] The i and j loops are completed in parallel by multiple threads, with each thread starting execution directly from the ii and jj loops.
[0105] In the method provided in this embodiment, the efficiency of matrix multiplication is improved by using a multi-thread group; and the first matrix is further split according to the thread group, so that the thread group can complete the calculation of the matrix block.
[0106] During the process of storing the second matrix to the array memory and the third matrix to the scalar memory, due to limitations in the storage capacity of the array memory and the scalar memory, a preferred embodiment includes, after transferring each fifth matrix block and each sixth matrix block from the double-rate synchronous dynamic random access memory to the global shared memory, and before storing the second matrix from the global shared memory to the array memory and the third matrix from the global shared memory to the scalar memory, the following steps are also included:
[0107] Obtain the storage capacity of the array memory and the storage capacity of the scalar memory;
[0108] Based on the storage capacity of the array memory and the storage capacity of the scalar memory, the fourth matrix block is split into the seventh matrix block, the fifth matrix block is split into the eighth matrix block, and the sixth matrix block is split into the ninth matrix block;
[0109] Correspondingly, storing the second matrix from the global shared memory to the array memory and storing the third matrix from the global shared memory to the scalar memory include:
[0110] Store the eighth matrix block from the global shared memory to the array memory and store the ninth matrix block from the global shared memory to the scalar memory;
[0111] Performing matrix multiplication on the first matrix, the second matrix stored in the array memory, and the third matrix stored in the scalar memory includes:
[0112] Perform matrix multiplication on the seventh and eighth matrix blocks stored in the array memory and the ninth matrix block stored in the scalar memory.
[0113] Based on the above embodiments, the m2×n2 matrix block is divided into m0×k2 matrix blocks, and the k2×n2 matrix block is divided into k2×n0.
[0114] Step 1: Store the m0×k2 matrix block A from the global shared memory to the array memory and store the k2×n0 matrix block B from the global shared memory to the scalar memory;
[0115] The second step is to call the microkernel program, which is written in assembly language, to complete the calculation.
[0116] Expand ii, jj loop, the first step and the second step form a double pipeline calculation.
[0117] The formula for limiting the size of the array memory is:
[0118] (m1n1+2k2n0)≤S AM
[0119] The formula for limiting the size of scalar memory is:
[0120] (2k2m0)≤S SM
[0121] The block parameter relationships are as follows:
[0122] m2=p m m1
[0123] n2 = p n n1
[0124] m1 = N m m0
[0125] n1 = N n n1
[0126] Where, N m N represents the number of iterations in m dimensions. n This represents the number of iterations in n dimensions.
[0127] Each nucleus is composed of p = p m *p n The thread grid, divided into m and n dimensions and distributed across multiple cores, corresponds to the computation-to-memory ratio:
[0128]
[0129]
[0130] Among them, Q GSM Q represents the memory access ratio of the globally shared memory. DDR This indicates the access ratio of Double Rate Synchronous Dynamic Random Access Memory.
[0131] When K is large enough
[0132] From the AM-GM inequality, we get
[0133] The equality holds if and only if m2 = n2.
[0134] Let's take m2 = n2 for discussion.
[0135]
[0136] Q DDR =m2
[0137] Since the bandwidth of the on-chip global shared memory is sufficient, the main consideration is the bandwidth pressure of the double-rate synchronous dynamic random access memory.
[0138] maxQ DDR =m2
[0139] The constraints are:
[0140]
[0141]
[0142]
[0143] p = p m *p n
[0144] The parameters to be solved are m2, k2, and p. m pn N m N n .
[0145] In current digital signal processors, S GSM ≤pS AM Therefore, the theoretical maximum computed memory access ratio is
[0146] Because the array memory must contain some space storage matrix B, therefore
[0147] When matrix B occupies less space in the array memory, it approaches the lower bound of memory access, similar to the case of a CPU. That is, the upper bound of the computational memory access ratio for Double Data Rate Synchronous Dynamic Random Access Memory is... Since it conforms to I / O complexity theory, the implementation of this application is memory-optimal.
[0148] Furthermore, the method for solving m2 is exhaustive search. Because m2, k2, and p... m p n N m N n All are integers, and p m ×p n The value equals the number of cores. Therefore, the search space is very limited. Since larger m2 is better, we can start from... Begin a decreasing search. Stop the search when a parameter satisfies the above constraints, thus completing the parameter setting. Generally, m2 is an integer between 0 and 2000.
[0149] The matrix multiplication processing method provided in this embodiment achieves the minimum memory access by using constraints determined by the storage capacity of the global shared memory, the size of the thread grid composed of all cores of the digital signal processor, the storage capacity of the array memory, and the storage capacity of the scalar memory. This achieves optimal memory access and reduces the bandwidth pressure of the double-rate synchronous dynamic random access memory.
[0150] In the above embodiments, a matrix multiplication processing method for digital signal processors has been described in detail. This application also provides embodiments of a matrix multiplication processing apparatus for digital signal processors. It should be noted that this application describes the apparatus embodiments from two perspectives: one based on functional modules and the other based on hardware.
[0151] Figure 3 A structural diagram of a matrix multiplication processing apparatus for a digital signal processor provided in one embodiment of this application. This embodiment, based on functional modules, includes:
[0152] The acquisition module 10 is used to acquire the first matrix, the second matrix, and the third matrix to be multiplied.
[0153] The first transmission module 11 is used to transmit the first matrix from the double-rate synchronous dynamic random access memory to the array memory;
[0154] The second transmission module 12 is used to transmit the second matrix and the third matrix from the double-rate synchronous dynamic random access memory to the global shared memory;
[0155] Storage module 13 is used to store the second matrix from global shared memory to array memory and the third matrix from global shared memory to scalar memory when an instruction for performing matrix multiplication is detected.
[0156] The matrix multiplication module 14 is used to perform matrix multiplication operations on the first matrix, the second matrix stored in the array memory, and the third matrix stored in the scalar memory.
[0157] Since the embodiments of the apparatus section correspond to the embodiments of the method section, please refer to the description of the embodiments of the method section for the embodiments of the apparatus section, and they will not be repeated here. Furthermore, it has the same beneficial effects as the matrix multiplication processing method for digital signal processors mentioned above.
[0158] Figure 4 This is a structural diagram of a matrix multiplication processing apparatus for a digital signal processor, provided as another embodiment of this application. This embodiment is based on a hardware perspective, such as... Figure 4 As shown, the matrix multiplication processing device for a digital signal processor includes:
[0159] Memory 20 is used to store computer programs;
[0160] The processor 21 is used to implement the steps of the matrix multiplication processing method for digital signal processors as described in the above embodiments when executing a computer program.
[0161] The processor 21 may include one or more processing cores, such as a quad-core processor or an octa-core processor. The processor 21 may be implemented using at least one hardware form selected from DSP, Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 21 may also include a main processor and a coprocessor. The main processor, also known as the CPU, is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In some embodiments, the processor 21 may integrate a Graphics Processing Unit (GPU), which is responsible for rendering and drawing the content to be displayed on the screen. In some embodiments, the processor 21 may also include an Artificial Intelligence (AI) processor, which is used to handle computational operations related to machine learning.
[0162] The memory 20 may include one or more computer-readable storage media, which may be non-transitory. The memory 20 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In this embodiment, the memory 20 is used to store at least the following computer program 201, which, after being loaded and executed by the processor 21, is capable of implementing the relevant steps of the matrix multiplication processing method for digital signal processors disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may also include an operating system 202 and data 203, and the storage method may be temporary or permanent storage. The operating system 202 may include Windows, Unix, Linux, etc. The data 203 may include, but is not limited to, the data involved in the matrix multiplication processing method for digital signal processors mentioned above.
[0163] In some embodiments, the matrix multiplication processing device for a digital signal processor may further include a display screen 22, an input / output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.
[0164] Those skilled in the art will understand that Figure 4 The structure shown does not constitute a limitation on a matrix multiplication processing device for a digital signal processor and may include more or fewer components than shown.
[0165] The matrix multiplication processing apparatus for digital signal processors provided in this application includes a memory and a processor. When the processor executes the program stored in the memory, it can implement the following method: a matrix multiplication processing method for digital signal processors, with the same effect as above.
[0166] Finally, this application also provides an embodiment corresponding to a computer-readable storage medium. The computer-readable storage medium stores a computer program, which, when executed by a processor, implements the steps described in the above method embodiments.
[0167] It is understood that if the methods in the above embodiments are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and executes all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0168] The computer-readable storage medium provided in this application includes the matrix multiplication processing method for digital signal processors mentioned above, with the same effect.
[0169] The foregoing provides a detailed description of a matrix multiplication processing method, apparatus, and medium for digital signal processors. The various embodiments are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to in the method section. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from the principles of this application, and these improvements and modifications also fall within the protection scope of the claims of this application.
[0170] It should also be noted that, in this specification, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
Claims
1. A matrix multiplication processing method for digital signal processors, characterized in that, include: Obtain the first, second, and third matrices to be multiplied; The first matrix is transferred from the double-rate synchronous dynamic random access memory to the array memory; The second and third matrices are transferred from the double-rate synchronous dynamic random access memory to the global shared memory; Upon detecting an instruction to perform the matrix multiplication operation, the second matrix is stored from the global shared memory to the array memory, and the third matrix is stored from the global shared memory to the scalar memory; The matrix multiplication operation is performed on the first matrix, the second matrix stored in the array memory, and the third matrix stored in the scalar memory.
2. The matrix multiplication processing method for digital signal processors according to claim 1, characterized in that, Before performing the matrix multiplication operation on the first matrix, the second matrix stored in the array memory, and the third matrix stored in the scalar memory, the method further includes: The first matrix, the second matrix, and the third matrix are split into blocks, and the block parameters of each block are obtained; wherein, the block parameters include at least the number of rows and columns of the block. When the block parameters meet the preset requirements, the matrix multiplication operation is performed on each of the matrix blocks corresponding to the first matrix stored in the array memory, each of the matrix blocks corresponding to the second matrix, and each of the matrix blocks corresponding to the third matrix stored in the scalar memory.
3. The matrix multiplication processing method for digital signal processors according to claim 2, characterized in that, The preset requirement is determined based on at least one of the following: The storage capacity of the global shared memory, the size of the thread grid consisting of all the cores of the digital signal processor, the storage capacity of the array memory, and the storage capacity of the scalar memory.
4. The matrix multiplication processing method for digital signal processors according to claim 1, characterized in that, Before transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory, the method further includes: Obtain the size of the global shared storage; Based on the size of the global shared storage, the first matrix is split into a first matrix block, the second matrix is split into a second matrix block, and the third matrix is split into a third matrix block; wherein, the number of rows in the first matrix block is the same as the number of rows in the second matrix block, the number of columns in the first matrix block is the same as the number of columns in the third matrix block, and the number of columns in the second matrix block is the same as the number of rows in the third matrix block; Correspondingly, the step of transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory includes: Each of the first matrix blocks is transferred from the double-rate synchronous dynamic random access memory to the array memory; The step of transferring the second matrix and the third matrix from the double-rate synchronous dynamic random access memory to the global shared memory includes: Each of the second matrix blocks and each of the third matrix blocks are transferred from the double-rate synchronous dynamic random access memory to the global shared memory.
5. The matrix multiplication processing method for digital signal processors according to claim 4, characterized in that, After splitting the first matrix into a first matrix block, the second matrix into a second matrix block, and the third matrix into a third matrix block according to the size of the global shared memory, and before transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory, the method further includes: Obtain the number of cores of the digital signal processor and the size of the thread grid composed of all the cores; Based on the size of the thread grid, the first matrix block is split into a fourth matrix block, the second matrix block is split into a fifth matrix block, and the third matrix block is split into a sixth matrix block; Correspondingly, the step of transferring the first matrix from the double-rate synchronous dynamic random access memory to the array memory includes: Each of the fourth matrix blocks is transferred from the double-rate synchronous dynamic random access memory to the array memory; The step of transferring the second matrix and the third matrix from the double-rate synchronous dynamic random access memory to the global shared memory includes: Each of the fifth matrix blocks and each of the sixth matrix blocks are transferred from the double-rate synchronous dynamic random access memory to the global shared memory.
6. The matrix multiplication processing method for digital signal processors according to claim 5, characterized in that, After transferring each of the fifth matrix blocks and each of the sixth matrix blocks from the double-rate synchronous dynamic random access memory to the global shared memory, and before storing the second matrix from the global shared memory to the array memory and the third matrix from the global shared memory to the scalar memory, the method further includes: Obtain the storage capacity of the array memory and the storage capacity of the scalar memory; Based on the storage capacity of the array memory and the storage capacity of the scalar memory, the fourth matrix block is split into the seventh matrix block, the fifth matrix block is split into the eighth matrix block, and the sixth matrix block is split into the ninth matrix block; Correspondingly, storing the second matrix from the global shared memory to the array memory and storing the third matrix from the global shared memory to the scalar memory includes: The eighth matrix block is stored from the global shared memory to the array memory, and the ninth matrix block is stored from the global shared memory to the scalar memory; The matrix multiplication operation performed on the first matrix, the second matrix stored in the array memory, and the third matrix stored in the scalar memory includes: The matrix multiplication operation is performed on the seventh matrix block, the eighth matrix block stored in the array memory, and the ninth matrix block stored in the scalar memory.
7. The matrix multiplication processing method for digital signal processors according to any one of claims 1 to 6, characterized in that, The matrix multiplication operation and the transfer of the first matrix, the second matrix, and the third matrix from the double-rate synchronous dynamic random access memory are performed in a dual-pipeline manner.
8. A matrix multiplication processing device for digital signal processors, characterized in that, include: The acquisition module is used to acquire the first, second, and third matrices to be multiplied. The first transmission module is used to transmit the first matrix from the double-rate synchronous dynamic random access memory to the array memory. The second transmission module is used to transmit the second matrix and the third matrix from the double-rate synchronous dynamic random access memory to the global shared memory; A storage module is configured to, upon detecting an instruction to perform the matrix multiplication operation, store the second matrix from the global shared memory to the array memory and the third matrix from the global shared memory to the scalar memory; A matrix multiplication module is used to perform the matrix multiplication operation on the first matrix, the second matrix stored in the array memory, and the third matrix stored in the scalar memory.
9. A matrix multiplication processing device for digital signal processors, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the matrix multiplication processing method for a digital signal processor as described in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the matrix multiplication processing method for a digital signal processor as described in any one of claims 1 to 7.