Aggregation method, system, and storage medium
By using ping-pong buffers and iterative merging techniques in sparse matrix-matrix multiplication, memory management is optimized, solving the problems of low computational throughput and storage efficiency in sparse matrix-matrix multiplication, and achieving efficient computation and memory utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2022-07-01
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies have failed to effectively optimize computational throughput and storage efficiency in sparse matrix-matrix multiplication, resulting in irregular memory access and low computational efficiency.
We employ an efficient memory-utilizing accumulation method, obtain sparse matrices through processors associated with caches, and use ping-pong buffers and iterative merging techniques to reduce memory consumption and maximize memory reuse, thereby optimizing memory management to improve computational efficiency.
It significantly improves the computational and storage efficiency of sparse matrix-matrix multiplication, reduces memory usage, increases computational throughput, and reduces latency.
Smart Images

Figure CN117370224B_ABST
Abstract
Description
Technical Field
[0001] This disclosure generally relates to accumulation methods, systems, and storage media for sparse matrix-matrix multiplication that utilize memory efficiently. Background Technology
[0002] In many practical applications, General Sparse Matrix-Matrix Multiplication (spGEMM) is a primitive and expensive computational method that involves performing SpGEMM on sparse matrices. For example, the publicly available SuiteSparse Matrix Collection is a large and rapidly growing collection of sparse matrices from a wide range of fields, such as semiconductor devices, computer graphics and vision, robotics and kinematics, quantum chemistry, and chemical process simulation.
[0003] Although SpGEMM is considered a memory-bound algorithm, most existing work focuses on optimizing computational throughput rather than storage efficiency. In fact, a more efficient storage design for SpGEMM could allow most data accesses and computations to be performed in the cache, further improving computational throughput. Therefore, there is an urgent need for a performance model that comprehensively considers both computational throughput and storage overhead. Summary of the Invention
[0004] Various embodiments of this specification may include hardware circuitry, systems, and methods for performing accumulation in SpGEMM applications.
[0005] According to one aspect, a computer implementation method for achieving memory-efficient accumulation during SpGEMM computation is described. This memory-efficient accumulation method effectively reduces memory consumption and maximizes memory reuse, allowing memory-intensive intermediate product accumulation steps to be performed in low-latency memory. The method first obtains a first sparse matrix and a second sparse matrix for performing SpGEMM computation via a processor associated with a cache; allocates a pair of buffers from the cache via the processor, with a first pointer and a second pointer pointing to the pair of buffers respectively; for each first row in the first sparse matrix containing multiple non-zero elements, the processor identifies multiple second rows in the second sparse matrix corresponding to those multiple non-zero elements; obtains multiple intermediate lists via the processor, the multiple intermediate lists being computed based on each non-zero element in the first row and a second row in the multiple second rows corresponding to that non-zero element; and stores the multiple intermediate lists in the buffer pointed to by the first pointer via the processor. At this point, multiple intermediate lists are prepared to be merged into a final merged list, which will be used as part of the output matrix of SpGEMM. The merging process can be iterative, comprising: executing an iterative process via the processor, which includes: merging multiple intermediate lists in a buffer pointed to by a first pointer into a smaller number of intermediate lists; storing the smaller number of intermediate lists into a buffer pointed to by a second pointer; swapping the first and second pointers; and determining whether an exit condition for exiting the iterative process is met, wherein the exit condition includes whether the smaller number of intermediate lists contains a final merged list. After obtaining a final merged list, the process can migrate it as output to system memory.
[0006] In some embodiments, the method may further include: allocating an offset buffer comprising a plurality of offsets, the plurality of offsets corresponding to a plurality of intermediate lists, wherein each offset points to the first unmerged element in the corresponding intermediate list.
[0007] In some embodiments, the method may further include: updating the offset buffer in response to merging multiple intermediate lists into a smaller number of intermediate lists, such that each index points to the offset of the first unmerged element in one of the smaller number of intermediate lists.
[0008] In some embodiments, merging multiple intermediate lists of a buffer pointed to by a first pointer into a smaller number of intermediate lists and storing the smaller number of intermediate lists into a buffer pointed to by a second pointer includes: for two adjacent intermediate lists of multiple intermediate lists: (1) determining two memory offsets in the buffer pointed to by the first pointer, the two memory offsets pointing to the two adjacent intermediate lists respectively; (2) determining a target memory offset in the buffer pointed to by the second pointer; (3) extracting the column indices of the two elements at the memory offset in the buffer pointed to by the first pointer; (4) in response to the two elements having the same column index, aggregating the values of the two unmerged elements to obtain an aggregated value, and storing the aggregated value into a merged list starting from the target memory offset in the buffer pointed to by the second pointer; (5) in response to the two elements having different column indices, storing one of the two unmerged elements containing the smaller value into the merged list starting from the target memory offset in the buffer pointed to by the second pointer; and repeating steps (1)-(5) until the two intermediate lists are merged into a merged list in the buffer pointed to by the second pointer.
[0009] In some embodiments, the first sparse matrix and the second sparse matrix are stored in a compact data format, wherein the compact data format excludes zero-value data in the first sparse matrix and the second sparse matrix.
[0010] In some embodiments, the method may further include: determining the buffer size of a pair of buffers by performing symbolic computation based on the index of the non-zero element in the first sparse matrix and the row size of the second sparse matrix, wherein the symbolic computation estimates the maximum number of floating-point multiplication operations (FLOPs) in the hypothetical multiplication between each of the plurality of first rows and the second sparse matrix, wherein allocating the pair of buffers includes: allocating each of the pair of buffers using the buffer size.
[0011] In some embodiments, symbolic computation may be performed as follows: for each of the plurality of first rows that includes one or more non-zero elements, identify one or more corresponding second rows in the second sparse matrix; determine the value of FLOP for each first row based on the number of non-zero elements in the corresponding one or more second rows in the second sparse matrix; and use the maximum value of FLOP for the plurality of first rows as the buffer size.
[0012] In some embodiments, the processor may include a multi-core processor, and the method further includes: dividing the rows of the first sparse matrix into multiple groups of first rows according to the multiple cores in the multi-core processor; and assigning the multiple groups of first rows to the multiple cores for parallel processing, wherein each core is assigned a corresponding pair of buffers.
[0013] In some embodiments, a multi-core processor may include a multi-core CPU or a GPU, and the multiple cores include multiple streaming multiprocessors (SMs) of the GPU.
[0014] In some embodiments, the offset buffer includes a pair of index lists corresponding to a pair of buffers.
[0015] In some embodiments, the method may further include: determining the buffer size of the offset buffer based on the maximum number of non-zero data in each row of the first sparse matrix.
[0016] According to another aspect, an accumulation system for accelerating SpGEMM computation with efficient memory utilization is described. The system may include: one or more processors; one or more non-transitory computer-readable memories coupled to the one or more processors and configured with instructions executable by the one or more processors to cause the system to perform operations including: acquiring a first sparse matrix and a second sparse matrix for performing SpGEMM; allocating a pair of buffers, with a first pointer and a second pointer pointing to the pair of buffers respectively; for each first row in the first sparse matrix comprising a plurality of non-zero elements, identifying a plurality of second rows in the second sparse matrix corresponding to the plurality of non-zero elements; acquiring a plurality of intermediate lists. The multiple intermediate lists are calculated based on each non-zero element in a first row and a second row corresponding to that non-zero element in a second row; the multiple intermediate lists are stored in a buffer pointed to by a first pointer; an iterative process is performed, which includes: merging the multiple intermediate lists in the buffer pointed to by the first pointer into a smaller number of intermediate lists; storing the smaller number of intermediate lists in a buffer pointed to by a second pointer; swapping the first pointer and the second pointer; and determining whether an exit condition for exiting the iterative process is met, wherein the exit condition includes whether the smaller number of intermediate lists includes a final merged list; and migrating the final merged list as a row of the output matrix of SpGEMM to system memory.
[0017] According to another aspect, a non-transitory computer-readable storage medium is described for accumulation operations in accelerated SpGEMM computations with efficient memory utilization. Instructions stored in a non-transitory computer-readable storage medium, when executed by one or more processors, cause the processors to perform operations including: acquiring a first sparse matrix and a second sparse matrix for performing SpGEMM; allocating a pair of buffers, with a first pointer and a second pointer pointing to the pair of buffers respectively; for each first row in the first sparse matrix comprising a plurality of non-zero elements, identifying a plurality of second rows in the second sparse matrix corresponding to the plurality of non-zero elements; acquiring a plurality of intermediate lists, the plurality of intermediate lists being calculated based on each of the plurality of non-zero elements in the first row and a second row in the plurality of second rows corresponding to the non-zero elements; storing the plurality of intermediate lists into the buffer pointed to by the first pointer; performing an iterative process, the iterative process including: merging the plurality of intermediate lists in the buffer pointed to by the first pointer into a smaller number of intermediate lists; storing the smaller number of intermediate lists into the buffer pointed to by the second pointer; swapping the first pointer and the second pointer; and determining whether an exit condition for exiting the iterative process is met, wherein the exit condition includes whether the smaller number of intermediate lists includes a final merged list; and migrating the final merged list as a row of the output matrix of SpGEMM to system memory.
[0018] These and other features of the systems, methods, and hardware devices of this disclosure, as well as the operation and function of related elements of the structure, and the economics of component assembly and manufacture, will become more apparent upon consideration of the following description and appended claims with reference to the accompanying drawings, which form part of this specification, wherein similar reference numerals denote corresponding portions in the drawings. However, it should be understood that the drawings are for illustration and description only and are not intended to be a definition of limitation of this disclosure. Attached Figure Description
[0019] Figure 1 The diagram illustrates an exemplary hardware environment for efficient memory utilization of accumulation for General Sparse matrix-matrix Multiplication (spGEMM) according to some embodiments.
[0020] Figure 2 Exemplary row-based methods for performing spGEMM with efficient memory allocation are shown according to some embodiments.
[0021] Figure 3 A block diagram of an exemplary distributed, cumulative multi-core processor for SpGEMM with efficient memory utilization is shown according to some embodiments.
[0022] Figure 4 Exemplary memory layouts and workflows for performing spGEMM accumulation with efficient memory utilization are shown according to some embodiments.
[0023] Figure 5 Exemplary methods for performing spGEMM accumulation with efficient memory utilization are shown according to some embodiments.
[0024] Figure 6 An exemplary accumulation method for spGEMM that efficiently utilizes memory is shown according to some embodiments.
[0025] Figure 7 A block diagram of a hardware device for spGEMM accumulation with efficient memory utilization is shown according to some embodiments. Specific Implementation
[0026] This disclosure is intended to enable those skilled in the art to make and use embodiments, and is provided in the context of specific applications and their requirements. Various modifications to the embodiments of this disclosure will be apparent to those skilled in the art, and the general principles defined herein can be applied to other embodiments and applications without departing from the spirit and scope of this disclosure. Therefore, this disclosure is not limited to the embodiments shown, but is accorded the widest scope consistent with the principles and features of this disclosure.
[0027] In sparse matrices, the number of non-zero elements (NNZ) is typically much smaller than the number of zero elements. When storing sparse matrices in computer systems, compact data structures are often used to save storage space. These data structures can include coordinate lists (COO), compressed sparse rows (CSR), bitmaps, etc. Generalized sparse matrix multiplication (spGEMM) involves multiplying two sparse matrices and is a fundamental but computationally expensive method in many scientific computing applications and graph algorithms (e.g., algebraic multigrid solvers, triangle counting, multi-source breadth-first search, etc.).
[0028] As is well known, SpGEMM is difficult to optimize due to its essential irregularities. For example, it exhibits irregular access patterns to the memory system and computational irregularities that lead to load imbalances in multithreaded designs. This disclosure provides algorithmic optimizations for SpGEMM on modern multi-core CPU architectures because 1) CPU platforms are widely used by more researchers and research institutes, 2) applications incorporating SpGEMM kernels are more flexible to develop on CPU platforms, making CPUs the most widely used platform for these irregular applications, and 3) previous SpGEMM algorithms have been observed to fail to achieve optimal algorithmic design on CPUs. However, the memory-efficient accumulation method disclosed herein is also applicable to other processor architectures, such as GPUs with appropriate memory configurations. Those skilled in the art will understand the architectural differences and that the method is platform-independent.
[0029] For ease of understanding, several terms used in the following description are defined herein, unless otherwise stated.
[0030] Uppercase letters A, B, and C represent matrices. A and B are input matrices, with sizes (M, K) and (K, N) respectively. C is the result matrix of dimension (M*N). ij " represents the i-th row (i th row) and column j) th The non-zero elements in the column. i* " represents the i-th row (i th All non-zero elements in the row.
[0031] “nnz(c i* ")" represents the i-th row (i th The number of non-zero elements in a row.
[0032] "flop(c i* ")" indicates that in the calculation of the i-th row (i th The number of multiplications involved in the process of (doing).
[0033] FLOP refers to the number of floating-point multiplication operations performed to compute the output matrix or a row of the output matrix. In this disclosure, floating-point multiplication is used as an example. In some embodiments, multiplication operations may be based on integers.
[0034] FLOPR refers to the number of floating-point multiplication operations used to calculate a row of the output matrix.
[0035] NNZ refers to the number of non-zero elements in a matrix or a row of a matrix.
[0036] NNZR refers to the number of non-zero elements in a row of a matrix.
[0037] Compression ratio (CR) refers to the FLOP / NNZ of the output matrix or a row of the output matrix.
[0038] Due to the flexibility of matrix multiplication and the irregularity of sparse matrix storage format, SpGEMM can employ a very wide range of different computational routines. At a high level, SpGEMM algorithms are well classified into four categories: inner product SpGEMM, outer product SpGEMM, row-wise SpGEMM, and column-wise SpGEMM.
[0039] In the inner product SpGEMM, an atomic task is defined as the i-th row (i) of A. th row) and column j of B (j th The inner product of columns (c, c) is used to obtain an element c in C. ij This scheme may require the multiplication of two sparse vectors, which is inefficient due to numerous invalid data accesses to zero elements. Furthermore, the result of multiplying two sparse vectors is likely to be zero, contributing nothing to the output matrix in the sparse storage format. Therefore, due to its extremely low efficiency, this scheme is theoretically infeasible.
[0040] In the outer product SpGEMM, the first atomic task is defined as the i-th column (i... th (column) and B in row i (i th The first atomic task is to perform an outer product of the K partial result matrices (rows) to obtain the total partial result matrix. In this task, there will be K partial result matrices. The second atomic task is to merge these K partial result matrices row by row to obtain the final result matrix. To complete the second atomic task, memory space needs to be allocated for all intermediate partial matrices, or a hash data structure can be used to dynamically write each partial matrix to a result buffer. In high-performance settings, multi-threaded operation is the norm. Therefore, large memory footprints or concurrent writes may be performance limitations of the outer product SpGEMM.
[0041] In the row-wise SpGEMM, the atomic task is defined as the i-th row (i... th The i-th row of C is obtained by multiplying and summing the corresponding rows of B with the corresponding rows of C. th (Row-based SpGEMM). It's naturally easy to parallelize at the single-row granularity without encountering concurrent write issues. Furthermore, it offers flexibility in employing various allocation methods to handle intermediate result matrices. More details on row-based SpGEMM can be found in [reference needed]. Figure 2 And the corresponding description.
[0042] The computation process and memory access pattern of column-based SpGEMM are the dual of row-based SpGEMM. Therefore, the row-based SpGEMM algorithm can be easily converted into a column-based SpGEMM algorithm with the same theoretical performance. For the sake of brevity, the following description focuses on the design of the row-based SpGEMM algorithm.
[0043] Figure 1 A schematic diagram is shown of an exemplary hardware environment for efficient memory utilization of accumulation for General Sparse matrix-matrix Multiplication (spGEMM) according to some embodiments.
[0044] like Figure 1 As shown, the hardware environment includes a memory pool 210, processing circuitry 220, and spGEMM accumulator circuitry 230. The layout of the components in the hardware environment is for illustrative purposes and can be implemented differently depending on the actual hardware configuration. In some embodiments, the spGEMM accumulator circuitry 230 may be implemented as a stand-alone hardware accelerator, decoupled from the processing circuitry 220 (e.g., one or more CPUs or GPUs). In some embodiments, the spGEMM accumulator circuitry 230 may be implemented as part of the processing circuitry 220 (e.g., part of one or more CPUs or GPUs) to improve the efficiency of memory management. The memory pool 210 may refer to external storage devices, system RAM, other types of memory resources, or any combination thereof.
[0045] In some embodiments, the processing circuitry 220 may include one or more processors 222 and a cache 221 shared by the one or more processors 222. Each processor 222 may include an instruction fetching unit (IFU) 223, an instruction decoding unit (IDU) 224, an instruction transmitting unit (ITU) 225, and an instruction execution unit (IEU) 226.
[0046] In some embodiments, IFU 223 may fetch the instructions or data to be executed from memory 210 into register file 229. In some embodiments, the instructions or data to be executed may be fetched into cache 221 and sent to IFU 223 via microcontroller unit (MCU) 227. After obtaining the instructions or data, processing circuit 220 enters the instruction decoding stage. IDU 224 decodes the obtained instructions according to a preset instruction format to determine operand fetch information, wherein the operands are required to execute the obtained instructions. In some embodiments, operand fetch information may include immediate values, registers, or pointers or addresses of other software / hardware that provide the operands.
[0047] In some embodiments, ITU 225 can be configured to receive decoding instructions from IDU 224 and perform instruction scheduling and management. It can efficiently distribute instructions to different IDUs 226 for parallel processing. In some embodiments, after ITU 225 distributes an instruction to an IDU 226, the IDU 226 can execute that instruction.
[0048] In some embodiments, the spGEMM accumulator circuit 230 may receive instructions from the processing circuit 220, access data from the memory pool 210, and perform accumulation of intermediate products to generate a row of the spGEMM output matrix. The spGEMM accumulator circuit 230 may send that row of the output matrix back to the processing circuit 220 for aggregation.
[0049] In some embodiments, the spGEMM accumulation circuit 230 may include a microprocessor for executing instructions to determine a buffer size, request / allocate a buffer according to the determined size, calculate intermediate products, perform accumulation of intermediate products using the buffer, and generate the output of the SpGEMM. In some embodiments, the spGEMM accumulation circuit 230 may include an allocation module 232, a calculation module 233, an iterative merging module 234, and an output module 235. In some embodiments, before triggering the spGEMM accumulation circuit 230, a first sparse matrix and a second sparse matrix for performing the SpGEMM may be loaded and stored in a memory pool 210 in a compact data format. The compact data format may exclude zero-value data in the first and second sparse matrices and store only information about non-zero data (index information and numerical information). The SpGEMM operation to be performed between the first and second sparse matrices may be distributed across multiple processors 222 in the processing circuit 220 for parallel processing. Each processor 222 can process computations between row subsets of the first and second sparse matrices and generate numerical values for the row subsets in the SpGEMM output matrix. Here, for illustrative purposes, computation refers to row matrix multiplication. The following description uses one processor 222 as an example to illustrate the workflow of the spGEMM accumulator circuit 230.
[0050] In some embodiments, allocation module 232 may be configured to determine an upper limit buffer size for storing intermediate products and allocate a pair of buffers from the memory system. Each buffer has an upper limit buffer size. A first pointer and a second pointer point to the pair of buffers, respectively. This pair of buffers may be referred to as ping-pong buffers and are used to perform the accumulation of intermediate products with efficient memory utilization. For example, the multiplication of the first row of a first sparse matrix with the corresponding row in a second sparse matrix can generate the output row of the SpGEMM output matrix. In this process, a list of intermediate products can be generated, and the intermediate product list can be accumulated to generate the output row.
[0051] In some embodiments, the computation module 233 may be configured to perform multiplication between the first row of the first sparse matrix and the corresponding row in the second sparse matrix to generate an intermediate product list. For example, based on row-matrix multiplication, the multiplication between the first row and the corresponding row may include: obtaining the column index of the non-zero data in the first row; identifying multiple corresponding rows in the second sparse matrix based on the column index of the non-zero data in the first row; and multiplying each non-zero data in the first row by the non-zero data in the corresponding row of the second sparse matrix to obtain an intermediate product list. That is, each non-zero data in the first row will generate an intermediate product list. In some embodiments, all non-zero data in the first row may be processed sequentially to generate multiple intermediate product lists. These intermediate product lists may be stored sequentially in the first buffer of the pair of buffers (e.g., the buffer pointed to by the first pointer) for accumulation.
[0052] In some embodiments, the iterative merging module 234 can be configured to iteratively merge multiple intermediate product lists into a single product list using a ping-pong buffer. This merging can be a necessary step to obtain the output row, as some intermediate products may contribute to the same output value (e.g., these intermediate products may be referred to as partial products, which need to be aggregated to generate the final product in the output row). In some embodiments, the merging process may include: merging multiple intermediate lists from a buffer pointed to by a first pointer into a smaller number of intermediate lists; storing the smaller number of intermediate lists into a buffer pointed to by a second pointer; swapping the first and second pointers; and determining whether an exit condition for exiting the iterative process is met, wherein the exit condition includes whether the smaller number of intermediate lists contains a final merged list. For example, a final list indicates that the smaller number of intermediate lists have been iteratively merged such that only one merged list becomes the final result.
[0053] In some embodiments, every two adjacent intermediate lists can be merged into a new list. The merging process of two adjacent intermediate lists may include: (1) determining two memory offsets, called source offsets, in a buffer pointed to by a first pointer, wherein the two memory offsets respectively point to the beginning of the two unmerged lists in the two adjacent intermediate lists; (2) determining a memory offset, called target offset, in a buffer pointed to by a second pointer; (3) extracting the column indices of the two unmerged elements at the two source memory offsets of the two unmerged lists; (4) in response to the two unmerged elements having the same column index, aggregating the values of the two unmerged elements to obtain an aggregated value, which is then pointed to by the second pointer. At the target offset in the buffer pointed to by the second pointer, the aggregated value is stored in the merge list, the two source memory offsets are incremented by 1, and the target offset is incremented by 1; (5) in response to the different column indices of the two unmerged elements, at the target offset in the buffer pointed to by the second pointer, the unmerged element with the smaller column index is stored in the merge list, the source memory offset with the smaller column index is incremented by 1, and the target offset is incremented by 1; and (6) steps (1)-(5) are repeated until the two intermediate lists are merged into the merge list in the buffer pointed to by the second pointer. That is, the elements in the two adjacent intermediate lists stored in the first buffer are merged, duplicate data is removed, and copied to another buffer in sequence. In these steps, it can be noted that the aggregated value is stored in the merge list in the buffer pointed to by the second pointer along with its index information. The index information may include column indices used to indicate the position in the output matrix. In some embodiments, an additional buffer may be created to monitor the memory offset corresponding to the next unmerged element in each list to be merged (e.g., an intermediate list). The additional buffer may include a buffer or another set of ping-pong buffers. For more details, please refer to, for example Figure 4 A round of merging is complete when all intermediate lists in the buffer pointed to by the first pointer have been merged two at a time into the buffer pointed to by the second pointer.
[0054] After one round of merging, multiple intermediate lists stored in the first buffer are merged into a smaller merged list in another buffer. For example, "smaller" could be half the number of the multiple intermediate lists. Multiple rounds of merging may be performed until only one merged list is generated (e.g., the partial products of all intermediate lists are merged).
[0055] To reuse ping-pong buffers in multiple merge rounds without consuming new memory resources, pointers to the ping-pong buffers can be swapped. This pointer swapping switches the roles of the two buffers without moving data. The buffer that stored the merge list in the previous merge round now becomes the buffer storing the list to be merged, while the buffer that previously stored the original intermediate list now becomes the target buffer receiving the new merge list; the original data in the target buffer is safely discarded. This swapping step allows for efficient reuse of algorithmic logic and corresponding instructions without costly data migration. This iterative merging process may continue until a final merge list is generated.
[0056] In some embodiments, output module 235 can migrate a final merge list from cache 221 of processing circuitry 220 to a memory pool (e.g., one or more non-transient memories), which serves as a row in the output matrix of SpGEMM. Cache 221 refers to an on-chip cache closer to processor 222 and with a faster data access rate than memory pool 210. As shown, even if the input matrix is typically too large to fit into on-chip cache 221, partial products generated based on a few rows of the input matrix can be stored in cache 221. Thus, the most computationally intensive step of SpGEMM, merging partial products into a final product, can be performed as if it were operating within the cache, without external memory access. Therefore, the overall efficiency of SpGEMM is significantly improved. Furthermore, the memory footprint for merging partial products is minimized due to the high reusability of the ping-pong buffer.
[0057] Figure 2 Exemplary row-based methods for performing spGEMM with efficient memory allocation are illustrated according to some embodiments. In some embodiments, Figure 1 The spGEMM accumulator circuit 230 in the middle relies on row-based matrix multiplication.
[0058] like Figure 2As shown, spGEMM involves two sparse input matrices 250 and 260. These sparse matrices can be stored in a data repository 270 (e.g., a database, cloud server) in a compact storage format (e.g., COO, CSR, bitmap, etc.). This compact storage format can store only the non-zero elements (and their index information) of the sparse matrices to save storage space and facilitate access to non-zero elements. In some practical applications, row-based matrix multiplication is used, and the non-zero elements generated as part of spGEMM can include multiple duplicated data entries. The term "duplicated data entry" refers to multiple values corresponding to the same row and column indices (also called row-column index pairs) in the output matrix, and these multiple values need to be aggregated to generate a single output value.
[0059] In some embodiments, when reading non-zero elements from data repository 270, an iterative process of row-based matrix multiplication can be performed. For example, Figure 2 The first row (row = 0) of the first sparse matrix A 250 can include multiple non-zero elements (e.g., A at columns 1 and 3 respectively). o1 and A 03 For each of these non-zero elements, the corresponding row in the second sparse matrix B 260 can be obtained, and the corresponding row in the second sparse matrix B 260 can have a row index equal to the column index of that non-zero element. For example, A o1 Row B corresponding to matrix B 1* Then, the non-zero elements in matrix A can be multiplied by all the non-zero elements in the corresponding rows of matrix B. For each multiplication operation between an element at (row = i, column = j) in matrix A and an element at (row = j, column = k) in matrix B, the resulting output value can be placed in a position (row = i, column = k) in the output matrix 240. This principle can be expressed as A ij *B jk =C ik After processing all non-zero elements in the first row of matrix A, the next row of matrix A can be processed. In some embodiments, parallel processing techniques can be applied to improve computational efficiency because rows of matrix A in row-based matrix multiplication can be processed independently. In some embodiments, when multiple processes are writing multiplication results to the same cell in output matrix 240, a locking mechanism can be used to serialize these write operations.
[0060] From the above description, the difference between standard row-column matrix multiplication and row-based matrix multiplication becomes clear. In standard row-column matrix multiplication, rows of the first matrix are multiplied by corresponding columns of the second matrix, and the results are summed to generate an output value in the output matrix. That is, in standard row-column matrix multiplication, each multiplication operation in one execution cycle generates a final output value for the output matrix. In contrast, the row-based method involves multiplying the first row of the first matrix by multiple corresponding second rows of the second matrix over multiple execution cycles. Each multiplication operation between the first row and each second row can generate one or more partial output values for a row of the output matrix. These partial output values generated from multiple multiplication operations over multiple execution cycles can be aggregated to generate the final output value for that row of the output matrix.
[0061] Figure 3 A block diagram of an exemplary distributed, cumulative multi-core processor 300 for spGEMM with efficient memory utilization is shown according to some embodiments. For example, the multi-core processor 300 may be, for example, a multi-core CPU or a GPU with multiple streaming multiprocessors (SMs). Depending on the architecture of the multi-core processor 300, cache 305 may be an on-chip cache shared by multiple cores or a non-shared cache per core.
[0062] like Figure 3 As shown, matrices A 310 and B 320 involved in SpGEMM may need to be multiplied to generate an output matrix. These matrices A 310 and B 320 can be loaded into DRAM 330 from external memory. To improve the efficiency of SpGEMM, the computation can be distributed across core 302 by leveraging the fact that output rows are generated independently in row-wise SpGEMM computations. For example, in row-wise matrix multiplication, each row from matrix A 310 can be multiplied with the corresponding row from matrix B 320 to generate one output row of the output matrix. Therefore, rows of matrix A 310 can be distributed and assigned to multiple cores 302 for parallel processing. Two cores may need to read the same row of matrix B 320 to perform their respective computations. However, these read operations do not update matrix B 320 or cause data conflicts, and therefore can be performed simultaneously (e.g., without involving locking, synchronization, or cache coherence issues).
[0063] Figure 4 Exemplary memory layouts and workflows for performing SpGEMM accumulation with efficient memory utilization are shown according to some embodiments. Figure 4 The memory layout and workflow described are for illustrative purposes. For example... Figure 4As shown, the memory layout includes a set of ping-pong buffers 410 and offset buffers 420. Ping-pong buffer 410 may refer to main memory space used for merging intermediate lists (where intermediate products are stored). Offset buffer 420 may refer to additional buffers used for storing pointers or memory offsets that point to the next merged and to-be-merged element in each intermediate list in ping-pong buffer 410.
[0064] In description Figure 4 In the design of the buffer, it is assumed that there are six intermediate lists to be merged, denoted as "List 1 to List 6". That is, in SpGEMM between two input matrices, a first row in the first input matrix contains six non-zero data points, and each non-zero data point corresponds to a second row in the second input matrix. Here, "corresponds" can be defined as: for each non-zero data point in the first row, the row index of the corresponding second row is the same as the column index of each non-zero data point in the first row. The product of each non-zero data point with the corresponding second row can generate an intermediate list. Therefore, for the six non-zero data points in the first row, six intermediate lists can be generated. These six intermediate lists (called partial products) contain intermediate products that contribute to the same output row in the output matrix of SpGEMM. Therefore, it is necessary to merge the six intermediate lists into a final list by aggregating / accumulating these partial products. In some embodiments, a processing unit (e.g., a core in a multi-core CPU) processes the entire first row and can process all the non-zero data points therein sequentially, thereby generating intermediate lists sequentially.
[0065] In some embodiments, the processor may determine the buffer size of the ping-pong buffer 410 for allocation before initiating the process of merging intermediate lists. In some embodiments, the ping-pong buffer 410 may include a source buffer and a target buffer, wherein the source buffer stores the intermediate lists to be merged for the current round of merging, and the target buffer stores the merged lists generated during the current round of merging. In some embodiments, the source buffer and the target buffer may be allocated the same memory size. Since the source buffer may be initialized (for the first round of merging) to store the intermediate lists directly generated based on the multiplications between non-zero data in the first row and the corresponding rows (i.e., all intermediate products), the memory sizes of the two ping-pong buffers may be determined by the maximum FLOP or upper limit FLOP for computing each result row. FLOPs can be determined by performing symbolic computation based on the index information of the non-zero data in the first matrix and the row size of the rows in the second matrix. Here, the "row size" of a row refers to the number of non-zero data in that row. Further details on determining the buffer size can be found in [reference needed]. Figure 5 .
[0066] Refer again Figure 4During the first round of merging, every two consecutive intermediate lists in the source buffer can be merged into a single list and stored in the destination buffer. For example, list 1 and list 2 can be merged into list 12. Merging two consecutive lists can be done by traversing the two lists using two running entries (e.g., two pointers). Values with the same column index can be added together and copied to the destination buffer, while values with different column indices can be copied directly to the destination buffer. It should be noted that this merging method requires that the initial intermediate lists be sorted based on their column indices, which is naturally guaranteed when the input matrix is stored in standard CSR format. The merged list generated from the two sorted lists is naturally sorted.
[0067] In some embodiments, the source buffer and the target buffer are reused in different rounds of merging. In some embodiments, the source buffer in the previous round of merging becomes the target buffer in the next round of merging, and the target buffer in the previous round of merging becomes the source buffer in the next round of merging. This can be achieved by swapping pointers to the two buffers, where the pointers are used as references to the buffers (the processor uses the pointers to look up the corresponding buffer). This pointer swapping method effectively avoids data migration between the ping-pong buffers 410, thereby improving merging efficiency. Figure 4 The ping-pong buffer 410 in the middle is used for three rounds of merging, and the six lists are eventually merged into one list, namely list 123456. It can be noted that in... Figure 4 The different buffers used in the different rounds of merging (the four buffers shown in 410) are just a logical view, and they are actually implemented using only two physical ping-pong buffers by swapping pointers.
[0068] At the start of each merge round, the processor can locate the starting point of each list to be merged in the source buffer and iterate through them sequentially to merge. This starting point can be represented as a memory offset. In some embodiments, two offset buffers 420 can be created to track the memory offsets of the lists to be merged in the source buffer and the lists to be merged in the target buffer. Different options exist for implementing this offset buffer 420. Figure 4 As shown, Option 1 includes creating another set of offset ping-pong buffers 422 for monitoring the offsets of the primary set of ping-pong buffers 410 respectively. The size of each buffer in the offset ping-pong buffers 422 can be determined as the maximum number of lists to be merged, i.e., the maximum number of row sizes in matrix A. The "row size" of a row refers to the number of non-zero data in that row.
[0069] In some embodiments, such as Figure 4Option 2, as shown, can further reduce the memory footprint of offset buffer 420. Instead of creating a pair of offset buffers 422, only one buffer 424 is created to track the offsets of the lists in the source and target buffers of the two ping-pong buffers 410. At the start of a merge round, this single buffer 420 can include multiple slots, each used to store the offsets of the intermediate lists to be merged. The processor reads the first two slots from the single buffer 424 to determine the memory offsets of the first two intermediate lists in the source buffer of the ping-pong buffer 410. These two intermediate lists can then be merged into a merged list in the target buffer. The single buffer 424 can then store the offset of the merged list in its first slot in the target buffer. Similarly, after the third and fourth intermediate lists in the source buffers are merged into a second merged list, the memory offset of the second merged list can be stored in the second slot of the single buffer 424. In this way, the single buffer 424 is updated in a rolling manner: after the processor reads two slots, these two slots can be freed and reused to store the offsets of the newly generated lists. Since the merging process always reads more lists and generates fewer new lists (e.g., reads two lists and generates one), it can be guaranteed that the slots used to store the memory offsets of the newly generated lists have been read and released. Therefore, this rolling update method of a single buffer 424 can effectively reuse memory slots and further improve storage efficiency.
[0070] Figure 5 Exemplary method 500 for performing memory-efficient accumulation for SpGEMM is illustrated according to some embodiments. Method 500 can be executed by a processor (CPU, a core of a multi-core CPU, a GPU, or a processor within a GPU) to perform SpGEMM computations between matrices A and B. According to method 500, the processor may receive input including index information of matrices A and B, perform symbolic computation based on the index information of matrices A and B to determine the size of one or more buffers, allocate ping-pong buffers and offset buffers based on the determined buffer sizes, read floating-point numbers from matrices A and B (in the rows allocated to the processor for processing), perform floating-point computations to generate an intermediate product list, store the list in the ping-pong buffer, iteratively merge these lists using the ping-pong buffer with the assistance of the offset buffer, and finally generate the last output row of the SpGEMM.
[0071] like Figure 5 As shown, method 500 includes two phases: a multiplication phase and a merging phase. In the multiplication phase, a i*Multiply by the corresponding row of B to obtain multiple intermediate lists. These intermediate lists are stored contiguously in a ping-pong buffer. During the merge phase, nnz(a) is merged in a manner similar to merge sort. i* Two consecutive lists are merged into one list, and then the merged list is stored in another ping-pong buffer using the recorded offset.
[0072] In addition to the multiplication and merging phases, the processor can also determine the buffer size of the ping-pong buffer. In some embodiments, the buffer size can be determined by performing symbolic computation based on the indices of non-zero data in matrix A and the row sizes of matrix B. Symbolic computation estimates the maximum number of floating-point multiplication operations (FLOPs) in the hypothetical multiplication between each row of matrix A and the corresponding row of matrix B. Symbolic computation avoids reading and computing actual floating-point numbers, thus saving computational resources such as processing cycles. In some embodiments, symbolic computation may include: identifying one or more corresponding second rows from matrix B for each of a plurality of first rows in matrix A containing one or more non-zero elements; determining the number of floating-point multiplication operations for each non-zero element in the first row as the number of non-zero elements in the corresponding row of the second matrix; summing all the number of floating-point multiplication operations for each non-zero element in the first row as the FLOP required for the corresponding row of the resulting matrix; and using the maximum FLOP of the plurality of resulting rows as the buffer size.
[0073] The preceding description illustrates the technical advantages of an efficient memory-utilizing accumulation method using ping-pong buffers. The following sections further quantify these advantages by comparing the described method with several exemplary existing solutions from the perspectives of storage and computational efficiency.
[0074] Existing work on solving intermediate multiplication accumulation in SpGEMM includes heap-based SpGEMM and hash-based SpGEMM.
[0075] SpGEMM based on heaps uses a length of k (k = nnz(a i* The heap data structure is used to contain k run entries from k intermediate row lists. Then, it merges the k partial row lists into a result row with natural ordering properties. Heap-based SpGEMM's main memory access and computation are uniformly interleaved, which is unfriendly to cache line maintenance and active row buffer maintenance in main memory. Furthermore, heap-based SpGEMM has high computational complexity and high compression ratio. The computational complexity of heap-based SpGEMM (C0...) is... heap ) and memory overhead (M heapAs shown below, where i = 1 to M, represents the M rows of matrix A:
[0076]
[0077]
[0078] The hash-based SpGEMM uses a length of flop(c) i* The hash data structure is used to insert each partial result into a hash table with a time complexity of O(1). It then scans the hash table to extract valid non-zero elements and sorts the result rows in ascending order. Hash-based SpGEMM suffers from inefficient cache access patterns. This inefficiency stems from the enormous cache capacity pressure and the random cache access patterns during hash table insertion. To compute each row, the hash table size flop(c i* The hash table should be initialized to -1. This should be the maximum access capacity per row. During hash table insertion, the hash table is accessed uniformly and randomly. This may lead to wasted cache capacity (e.g., a user may only need to access 8 bytes of double-precision data, but must access the entire 64-byte cache line). After insertion, the entire hash table must be scanned again, and nnz(c) must be initialized. i* Fetching 10 non-zero elements and sorting them in another cache location will cause greater cache capacity pressure (flop(c)). i* )+nnz(c i* This is far greater than the accumulated cache capacity pressure of the heap-based SpGEMM and the efficient memory utilization using ping-pong buffers mentioned above. The computational complexity of hash-based SpGEMM (C hash ) and memory overhead (M hash As shown below:
[0079]
[0080]
[0081] Here, α is a factor slightly greater than 1 due to possible hash table collisions, and β1 is a factor less than 1 because such operations are expected (but not guaranteed) to be performed in the cache rather than in main memory.
[0082] Referring to the efficient memory utilization accumulation method 500 using a ping-pong buffer, merge the i-th row (i th (line) requires [log(nnz(a)] i* Round merging. During the first round, the number of comparison operations is compared with the number of floating-point multiplication operations (flop(c i*The same applies. In the following rounds, the number of comparisons is reduced due to the repeated column indexes. In the final round, the number of comparisons equals the number of non-zero elements in the result row (nnz(c i* Therefore, the computational complexity of de-skewed computation for method 500 can be given as follows:
[0083]
[0084] In terms of memory complexity, the proposed ping-pong buffered row merging method performs multi-round merging operations only between two ping-pong buffers. For each row, it accesses at most flop(c) only in the first multiplication phase. i* The required memory space is ) and decreases as multiple merge phases are executed. Because the required space (flop(c)) is less than 10 ... i* The cache space is typically smaller than the L1 / L2 cache, resulting in very high cache hit rates during most computationally intensive accumulation phases. Furthermore, modern CPU architectures employ write-back cache policies and LRU cache replacement policies. Therefore, frequent data accesses to the two ping-pong buffers may not even trigger main memory requests. Main memory accesses only occur in the final round when the final result row is copied from the cached ping-pong buffer to an intermediate matrix (e.g., stored in system memory). It should also be noted that both main memory access patterns and cache access patterns are the most efficient streaming patterns. Therefore, the proposed method achieves optimal cache reuse and minimal main memory accesses. The memory complexity of method 500 can be summarized as follows:
[0085]
[0086] Here, β2 is a factor less than 1 because the corresponding operation is expected to be executed in the cache, and β2 is also less than β1 due to higher cache efficiency (use case reuse). Method 500 has significantly better computational complexity than heap-based methods with roughly the same memory overhead (memory complexity). Compared to hash-based methods, the proposed method performs better in terms of memory complexity while maintaining comparable computational complexity.
[0087] Figure 6 An exemplary accumulation method 600 for SpGEMM that efficiently utilizes memory is shown according to some embodiments. Method 600 can be... Figure 1 Implemented in the environment shown. Method 600 can be derived from... Figures 1-5 The device, apparatus, or system shown performs, for example, Figure 3 The processor 300 in the process. Depending on the implementation, method 600 may include additional, fewer, or alternative steps executed in various orders or in parallel.
[0088] Referring to the details of method 600, step 610 includes obtaining a first sparse matrix and a second sparse matrix for performing SpGEMM via a processor associated with a cache. In some embodiments, the first sparse matrix and the second sparse matrix are stored in a compact data format, wherein the compact data format excludes zero-value data in the first sparse matrix and the second sparse matrix.
[0089] Step 620 includes allocating a pair of buffers from the cache by the processor, with a first pointer and a second pointer pointing to the pair of buffers respectively. For example, the pair of buffers includes a first buffer and a second buffer, with the first pointer pointing to the first buffer and the second pointer pointing to the second buffer.
[0090] Step 630 includes, for each first row in the first sparse matrix that includes a plurality of non-zero elements, identifying by the processor a plurality of second rows in the second sparse matrix corresponding to the plurality of non-zero elements.
[0091] Step 640 includes obtaining multiple intermediate lists by the processor, the multiple intermediate lists being calculated based on each of the multiple non-zero elements in the first row and a second row in the multiple second rows corresponding to that non-zero element.
[0092] Step 650 includes storing multiple intermediate lists into a buffer pointed to by a first pointer via a processor.
[0093] Step 660 includes executing an iterative process via a processor, the iterative process comprising: merging multiple intermediate lists of a buffer pointed to by a first pointer into a smaller number of intermediate lists; storing the smaller number of intermediate lists into a buffer pointed to by a second pointer; swapping the first pointer and the second pointer; and determining whether an exit condition for exiting the iterative process is met, wherein the exit condition includes whether the smaller number of intermediate lists includes a final merged list. In some embodiments, merging multiple intermediate lists of a buffer pointed to by a first pointer into a smaller number of intermediate lists and storing the smaller number of intermediate lists in a buffer pointed to by a second pointer includes: for two adjacent intermediate lists of multiple intermediate lists: (1) determining two memory offsets in the buffer pointed to by the first pointer, the two memory offsets pointing to the two intermediate lists respectively; (2) determining a target memory offset in the buffer pointed to by the second pointer; (3) extracting the column indices of the two elements at the memory offset in the buffer pointed to by the first pointer; (4) in response to the two elements having the same column index, aggregating the values of the two unmerged elements to obtain an aggregated value, and storing the aggregated value in a merged list starting at the target memory offset in the buffer pointed to by the second pointer; (5) in response to the two elements having different column indices, storing one of the two unmerged elements containing the smaller value in the merged list starting at the target memory offset in the buffer pointed to by the second pointer; and repeating steps (1)-(5) until the two intermediate lists are merged into a merged list in the buffer pointed to by the second pointer.
[0094] Step 670 includes migrating a final merged list, which is a row of the output matrix of SpGEMM, from the cache to system memory via the processor.
[0095] In some embodiments, method 600 may further include allocating an offset buffer comprising a plurality of offsets, each offset corresponding to a plurality of intermediate lists, wherein each offset points to the first unmerged element in the respective intermediate list. In some embodiments, in response to merging the plurality of intermediate lists into a smaller number of intermediate lists, the offset buffer may be updated such that each index points to an offset of the first unmerged element in one of the smaller number of intermediate lists. In some embodiments, the offset buffer comprises a pair of index lists, each corresponding to a pair of buffers. In some embodiments, method 600 may further include determining the buffer size of the offset buffer based on the maximum number of non-zero data in each row of the first sparse matrix.
[0096] In some embodiments, method 600 may further include determining a buffer size for a pair of buffers by performing symbolic computation based on the indices of non-zero elements in a first sparse matrix and the row size of a second sparse matrix, wherein the symbolic computation estimates the maximum number of floating-point multiplication operations (FLOPs) in hypothetical multiplications between each of the plurality of first rows and the second sparse matrix, wherein allocating a pair of buffers includes: allocating each of the pair of buffers using the buffer size. In some embodiments, the symbolic computation includes: for each of the plurality of first rows containing one or more non-zero elements, identifying one or more corresponding second rows from the second sparse matrix; determining a FLOP value for each first row based on the number of non-zero elements in the corresponding one or more second rows of the second sparse matrix; and determining the maximum FLOP value of the plurality of first rows as the buffer size.
[0097] In some embodiments, the processor includes a multi-core processor, and method 600 may further include: dividing the rows of a first sparse matrix into multiple groups of first rows according to the multiple cores in the multi-core processor; and allocating the multiple groups of first rows to the multiple cores for parallel processing, wherein each core is allocated a corresponding pair of buffers. The multi-core processor may be a multi-core CPU or a GPU, and the multiple cores are multiple streaming multiprocessors (SMs) of the GPU.
[0098] Figure 7 A block diagram of a hardware device 700 for spGEMM accumulation with efficient memory utilization, according to some embodiments, is shown. The components of the hardware device 700 presented below are intended to be illustrative. Depending on the implementation, the hardware device 700 may include additional, fewer, or alternative components.
[0099] Hardware device 700 can be used to achieve Figure 6 An example of method 600. Hardware device 700 may include: one or more processors; and one or more non-transitory computer-readable storage media (e.g., one or more memories) coupled to the one or more processors, configured with instructions executable by the one or more processors to cause a system or device (e.g., a processor) to perform the embodiments described above. Hardware device 700 may include various units / modules corresponding to instructions (e.g., software instructions). Hardware device 700 may be implemented as a GNN accelerator.
[0100] In some embodiments, the hardware device 700 may include an allocation module 710, a calculation module 720, an iterative merging module 730, and an output module 740. These units can be configured to... Figures 1-5 This is achieved using the hardware devices and electronic circuits shown.
[0101] In some embodiments, the allocation module 710 may be configured to determine an upper limit on the memory size for merging multiple intermediate products of SpGEMM and allocate a pair of buffers based on the determined upper limit memory size. The pair of buffers may be pointed to by a first pointer and a second pointer, respectively. In some embodiments, the upper limit on the memory size may be determined by performing symbolic computation based on the indices of non-zero elements in the first sparse matrix and the row dimensions of the second sparse matrix, wherein the symbolic computation estimates the maximum number of floating-point multiplication operations (FLOPs) in hypothetical multiplications between each of the multiple first rows and the second sparse matrix.
[0102] In some embodiments, the calculation module 720 may be configured to, for each first row in a first sparse matrix that includes a plurality of non-zero elements, identify a plurality of second rows in a second sparse matrix that correspond to the plurality of non-zero elements, and obtain a plurality of intermediate lists, which are calculated based on each non-zero element in the plurality of non-zero elements in the first row and a second row in the plurality of second rows that corresponds to the non-zero element.
[0103] In some embodiments, the iterative merging module 730 may be configured to store a plurality of intermediate lists into a buffer pointed to by a first pointer; perform an iterative process, the iterative process including: merging the plurality of intermediate lists in the buffer pointed to by the first pointer into a smaller number of intermediate lists; storing the smaller number of intermediate lists into a buffer pointed to by a second pointer; swapping the first pointer and the second pointer; and determining whether an exit condition for exiting the iterative process is met, wherein the exit condition includes whether the smaller number of intermediate lists includes a final merged list.
[0104] In some embodiments, the output module 740 may be configured to migrate a final merged list as a row of the output matrix of SpGEMM to system memory.
[0105] Each process, method, and algorithm described in the preceding sections can be embodied in a code module executed by one or more computer systems or computer processors including computer hardware, and can be fully or partially automated by that code module. These processes and algorithms can be implemented, partially or entirely, in dedicated circuitry.
[0106] When the functions disclosed herein are implemented as software functional units and sold or used as independent products, they may be stored in a processor-executable, non-volatile, computer-readable storage medium. Specific technical solutions (all or part) disclosed herein, or aspects contributing to the present technology, may be embodied in the form of a software product. The software product includes multiple instructions that may be stored in the storage medium to cause a computing device (which may be a personal computer, server, network device, etc.) to perform all or some steps of the methods of the embodiments of this disclosure. The storage medium may include a flash drive, portable hard disk drive, ROM, RAM, magnetic disk, optical disk, another medium operable for storing program code, or any combination thereof.
[0107] Specific embodiments also provide a system including a processor and a non-transitory computer-readable storage medium storing processor-executable instructions to cause the system to perform operations corresponding to the steps in any of the methods of the above embodiments. Specific embodiments also provide a non-transitory computer-readable storage medium configured with instructions executable by one or more processors to cause one or more processors to perform operations corresponding to the steps in any of the methods of the above embodiments.
[0108] The embodiments disclosed herein can be implemented through a cloud platform, server, or group of servers (collectively referred to as the "service system") that interacts with a client. The client can be a terminal device or a client registered by a user on the platform, wherein the terminal device can be a mobile terminal, a personal computer (PC), or any device on which the platform application can be installed.
[0109] The various features and processes described above can be used independently of each other or combined in various ways. All possible combinations and sub-combinations are within the scope of this disclosure. Furthermore, in some embodiments, certain method or process blocks may be omitted. The methods and processes described herein are not limited to any particular order, and the associated blocks or states may be executed in other suitable orders. For example, the described blocks or states may be executed in an order other than the order specified in this disclosure, or multiple blocks or states may be combined in a single block or state. Example blocks or states may be executed serially, in parallel, or in some other manner. Blocks or states may be added to or removed from the example embodiments of this disclosure. The exemplary systems and components described herein may be configured differently than those described. For example, elements may be added, removed, or reset in the example embodiments of this disclosure compared to those described in the example embodiments of this disclosure.
[0110] The various operations of the example methods described herein can be performed at least partially by an algorithm. This algorithm may include program code or instructions stored in memory (e.g., the aforementioned non-transitory computer-readable storage medium). Such an algorithm may include a machine learning algorithm. In some embodiments, the machine learning algorithm may not explicitly program the computer to perform the function, but may learn from training data to build a predictive model for performing that function.
[0111] The various operations of the example methods described herein can be performed at least partially by one or more processors, which can be temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, these processors can constitute the engine of a processor implementation that runs to perform one or more of the operations or functions described herein.
[0112] Similarly, the methods described herein can be implemented at least in part by a processor, where a specific processor or one or more processors are examples of hardware. For example, at least some operations of the methods can be performed by one or more processors or an engine implemented by a processor. Furthermore, one or more processors can also be used to support the performance of related operations in a “cloud computing” environment or as “software as a service” (SaaS). For example, at least some operations can be performed by a set of computers (as an example of a machine including processors) that can be accessed via a network (e.g., the Internet) and through one or more appropriate interfaces (e.g., Application Program Interfaces (APIs)).
[0113] The performance of certain operations can be distributed across processors, rather than residing within a single machine, but deployed across multiple machines. In some example embodiments, the processor or processor-implemented engine may reside in a single geographic location (e.g., in a home environment, office environment, or server farm). In other example embodiments, the processor or processor-implemented engine may be distributed across multiple geographic locations.
[0114] In this specification, multiple instances can implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are shown and described as separate operations, one or more separate operations may be performed simultaneously, and the order in which they are performed is not required. Structures and functions presented as separate components in the example configuration can be implemented as composite structures or components. Similarly, structures and functions presented as single components can be implemented as separate components. These, and other variations, modifications, additions, and improvements fall within the scope of this document.
[0115] Although an overview of the subject matter has been described with reference to specific example embodiments, various modifications and changes can be made to these embodiments without departing from the broader scope of embodiments of this disclosure. These embodiments of this disclosure may be referred to individually or collectively by the term "this disclosure" merely for convenience, and are not intended to voluntarily limit the scope of this disclosure to any single disclosure or concept, if in fact more than one disclosure or concept is disclosed.
[0116] The embodiments illustrated herein have been described in sufficient detail to enable those skilled in the art to practice the disclosed teachings. Other embodiments may be used and derived therefrom, allowing for structural and logical substitutions and changes without departing from the scope of this disclosure. Therefore, the Specific Embodiments section should not be construed as limiting, and the scope of the various embodiments is defined only by the appended claims and all their equivalents.
[0117] Any process description, element, or block in the flowcharts described herein and / or the accompanying drawings should be understood to potentially represent a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or step in the process. As will be understood by those skilled in the art, alternative implementations are included within the scope of the embodiments described herein, wherein elements or functions may be removed depending on the functionality involved, and elements or functions may be performed in an order different from the order shown or discussed (including substantially simultaneous or reverse order).
[0118] As used herein, “or” is inclusive, not exclusive, unless otherwise expressly indicated or indicated by the context. Therefore, here, “A, B, or C” means “A, B, A and B, A and C, B and C, or A, B, and C”, unless otherwise expressly indicated or indicated by the context. Furthermore, “and” is both consequential and individual, unless otherwise expressly indicated or indicated by the context. Therefore, here, “A and B” means “A and B, jointly or separately”, unless otherwise expressly indicated or indicated by the context. Furthermore, multiple instances may be provided for a resource, operation, or structure described herein as a single instance. Moreover, the boundaries between various resources, operations, engines, and data stores are arbitrary and specific operations are described within the context of a particular illustrative configuration. Other allocations of functionality are contemplated and may fall within the scope of various embodiments of this disclosure. Generally, structures and functions presented as separate resources in the example configuration may be implemented as combined structures or resources. Similarly, structures and functions presented as a single resource may be implemented as separate resources. These and other variations, modifications, additions, and improvements fall within the scope of embodiments of this disclosure as represented by the appended claims. Therefore, the specification and drawings are considered illustrative rather than restrictive.
[0119] The terms “comprising” or “including” are used to indicate the presence of a subsequently stated feature, but do not preclude the addition of other features. Conditional language, such as “may” or “may”, unless explicitly stated otherwise or otherwise understood in the context in which they are used, is generally intended to convey that some embodiments include certain features, elements, and / or steps that are not included in other embodiments. Therefore, such conditional language generally does not imply that features, elements, and / or steps are required in any way by one or more embodiments, or that one or more embodiments must include logic for determining whether such features, elements, and / or steps are included or will be performed in any particular embodiment, with or without user input or prompting.
Claims
1. An accumulation method for sparse matrix-matrix multiplication, comprising: The memory-associated processor obtains the first and second sparse matrices for performing sparse matrix-matrix multiplication. The processor allocates a pair of buffers from the memory, the pair of buffers including a first buffer and a second buffer, a first pointer pointing to the first buffer and a second pointer pointing to the second buffer; For each first row in the first sparse matrix that includes multiple non-zero elements, the processor identifies multiple second rows in the second sparse matrix that correspond to the multiple non-zero elements; The processor obtains multiple intermediate lists, which are calculated based on each non-zero element of the multiple non-zero elements in the first row and a second row in the multiple second rows corresponding to the non-zero element of the first row. The processor stores the plurality of intermediate lists into the first buffer. The iterative process is executed by the processor, and the iterative process includes: Merge the plurality of intermediate lists in the first buffer into a smaller number of intermediate lists; The smaller number of intermediate lists are stored in the second buffer; Swap the first pointer and the second pointer; and Determine whether an exit condition for exiting the iteration process is met, wherein the exit condition includes whether the smaller number of intermediate lists comprises a final merged list; and The processor migrates the final merged list as a row of the output matrix of the sparse matrix-matrix multiplication from the first buffer to system memory.
2. The accumulation method according to claim 1, wherein, Also includes: The allocation includes an offset buffer with multiple offsets, each offset corresponding to one of the multiple intermediate lists, wherein each offset points to the first unmerged element in the corresponding intermediate list.
3. The accumulation method according to claim 2, wherein, Also includes: In response to merging the plurality of intermediate lists into the smaller number of intermediate lists, the offset buffer is updated such that each offset points to the offset of the first unmerged element in one of the intermediate lists of the smaller number of intermediate lists.
4. The accumulation method according to claim 1, wherein, The step of merging the plurality of intermediate lists in the first buffer into a smaller number of intermediate lists and storing the smaller number of intermediate lists in the second buffer includes: For two adjacent intermediate lists of the plurality of intermediate lists, step (1): determine two memory offsets in the first buffer, wherein the two memory offsets respectively point to the two adjacent intermediate lists. Step (2): Determine the target memory offset in the second buffer. Step (3): At the memory offset in the first buffer, extract the column indices of the two elements. Step (4): In response to the two elements having the same column index, aggregate the values of the two unmerged elements to obtain an aggregated value, and store the aggregated value in the merge list starting from the target memory offset of the second buffer. Step (5): In response to the different column indices of the two elements, starting from the target memory offset of the second buffer, store the unmerged element, including the smaller value, into the merge list; and Repeat steps (1) through (5) until the two intermediate lists are merged into the merged list in the second buffer.
5. The accumulation method according to claim 1, wherein, The first sparse matrix and the second sparse matrix are stored in a compact data format, wherein the compact data format excludes zero-value data in the first sparse matrix and the second sparse matrix.
6. The accumulation method according to claim 1, wherein, Also includes: The buffer size of the pair of buffers is determined by performing symbolic computation based on the indices of non-zero elements in the first sparse matrix and the row size of the second sparse matrix, wherein the symbolic computation estimates the maximum number of floating-point multiplication operations in the hypothetical multiplication between each of the plurality of first rows and the second sparse matrix. The allocation of the pair of buffers includes: Allocate each of the pair of buffers using the buffer size.
7. The accumulation method according to claim 6, wherein, The symbol calculation includes: For each of the plurality of first rows that includes one or more non-zero elements, identify one or more corresponding second rows in the second sparse matrix; Based on the number of non-zero elements in one or more corresponding second rows in the second sparse matrix, determine the number of floating-point multiplication operations for each first row; The maximum number of floating-point multiplication operations in the plurality of first rows is determined as the buffer size.
8. The accumulation method according to claim 1, wherein, The processor includes a multi-core processor, and the accumulation method further includes: Based on the multiple cores in the multi-core processor, the rows of the first sparse matrix are divided into multiple groups of first rows; The multiple sets of first rows are assigned to the multiple cores for parallel processing, with each core assigned a corresponding pair of buffers.
9. The accumulation method according to claim 8, wherein, The multi-core processor includes a multi-core CPU.
10. The accumulation method according to claim 8, wherein, The multi-core processor includes a GPU, and the multiple cores include multiple streaming multiprocessors of the GPU.
11. The accumulation method according to claim 2, wherein, The offset buffer includes a pair of index lists, each corresponding to one of the pair of buffers.
12. The accumulation method according to claim 2, wherein, Also includes: The buffer size of the offset buffer is determined based on the maximum number of non-zero data in each row of the first sparse matrix.
13. An accumulation system for sparse matrix-matrix multiplication, comprising: One or more processors; One or more non-transitory computer-readable memories, coupled to the one or more processors, and configured with instructions executable by the one or more processors, causing the accumulation system to perform operations including: Obtain the first and second sparse matrices used to perform sparse matrix-matrix multiplication; Allocate a pair of buffers in a cache associated with the one or more processors, the pair of buffers including a first buffer and a second buffer, a first pointer pointing to the first buffer and a second pointer pointing to the second buffer; For each first row in the first sparse matrix that includes multiple non-zero elements, identify multiple second rows in the second sparse matrix that correspond to the multiple non-zero elements; Obtain multiple intermediate lists, which are calculated based on each non-zero element of the multiple non-zero elements in the first row and a second row corresponding to that non-zero element in the multiple second rows; Store the plurality of intermediate lists into the first buffer; Perform an iterative process, the iterative process including: Merge the plurality of intermediate lists in the first buffer into a smaller number of intermediate lists; The smaller number of intermediate lists are stored in the second buffer; Swap the first pointer and the second pointer; and Determine whether an exit condition for exiting the iteration process is met, wherein the exit condition includes whether the smaller number of intermediate lists comprises a final merged list; and The final merged list is migrated from the cache to the one or more non-transitory computer-readable storage devices as a row of the output matrix of the sparse matrix-matrix multiplication.
14. The accumulation system according to claim 13, wherein, The operation also includes: The buffer size of the pair of buffers is determined by performing symbolic computation based on the indices of non-zero elements in the first sparse matrix and the row size of the second sparse matrix, wherein the symbolic computation estimates the maximum number of floating-point multiplication operations in the hypothetical multiplication between each of the plurality of first rows and the second sparse matrix. The allocation of the pair of buffers includes: Allocate each of the pair of buffers using the buffer size.
15. The accumulation system according to claim 14, wherein, The symbol calculation includes: For each of the plurality of first rows that includes one or more non-zero elements, identify one or more corresponding second rows in the second sparse matrix; Based on the number of non-zero elements in one or more corresponding second rows in the second sparse matrix, determine the number of floating-point multiplication operations for each first row; The maximum number of floating-point multiplication operations in the plurality of first rows is determined as the buffer size.
16. The accumulation system according to claim 13, wherein, The operation also includes: The allocation includes an offset buffer with multiple offsets, each offset corresponding to one of the multiple intermediate lists, wherein each offset points to the first unmerged element in the corresponding intermediate list.
17. The accumulation system according to claim 13, wherein, The step of merging the plurality of intermediate lists in the first buffer into a smaller number of intermediate lists and storing the smaller number of intermediate lists in the second buffer includes: For two adjacent intermediate lists of the plurality of intermediate lists, step (1): determine two memory offsets in the first buffer, wherein the two memory offsets respectively point to the two adjacent intermediate lists. Step (2): Determine the target memory offset in the second buffer. Step (3): At the memory offset in the first buffer, extract the column indices of the two elements. Step (4): In response to the two elements having the same column index, aggregate the values of the two unmerged elements to obtain an aggregated value, and store the aggregated value in the merge list starting from the target memory offset of the second buffer. Step (5): In response to the different column indices of the two elements, starting from the target memory offset of the second buffer, store the unmerged element, including the smaller value, into the merge list; and Repeat steps (1) through (5) until the two intermediate lists are merged into the merged list in the second buffer.
18. A non-transitory computer-readable storage medium for sparse matrix-matrix multiplication, the storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations, the operations including: Obtain the first and second sparse matrices used to perform sparse matrix-matrix multiplication; Allocate a pair of buffers in a cache associated with the one or more processors, the pair of buffers including a first buffer and a second buffer, a first pointer pointing to the first buffer and a second pointer pointing to the second buffer; For each first row in the first sparse matrix that includes multiple non-zero elements, identify multiple second rows in the second sparse matrix that correspond to the multiple non-zero elements; Obtain multiple intermediate lists, which are calculated based on each non-zero element of the multiple non-zero elements in the first row and a second row corresponding to the non-zero element in the multiple second rows; Store the plurality of intermediate lists into the first buffer; Perform an iterative process, the iterative process including: Merge the plurality of intermediate lists in the first buffer into a smaller number of intermediate lists; The smaller number of intermediate lists are stored in the second buffer; Swap the first pointer and the second pointer; and Determine whether an exit condition for exiting the iteration process is met, wherein the exit condition includes whether the smaller number of intermediate lists comprises a final merged list; and The final merged list is migrated as a row of the output matrix of the sparse matrix-matrix multiplication to the non-transitory computer-readable storage medium.
19. The non-transitory computer-readable storage medium according to claim 18, wherein, The operation also includes: The buffer size of the pair of buffers is determined by performing symbolic computation based on the indices of non-zero elements in the first sparse matrix and the row size of the second sparse matrix, wherein the symbolic computation estimates the maximum number of floating-point multiplication operations in the hypothetical multiplication between each of the plurality of first rows and the second sparse matrix. The allocation of the pair of buffers includes: Allocate each of the pair of buffers using the buffer size.
20. The non-transitory computer-readable storage medium according to claim 19, wherein, The symbol calculation includes: For each of the plurality of first rows that includes one or more non-zero elements, identify one or more corresponding second rows in the second sparse matrix; Based on the number of non-zero elements in one or more corresponding second rows in the second sparse matrix, determine the number of floating-point multiplication operations for each first row; The maximum number of floating-point multiplication operations in the plurality of first rows is determined as the buffer size.