In-memory compute accelerator and compute method for sparse matrix dense vector multiplication
By using an in-memory computing accelerator that flexibly switches between sparse and dense computing modes, the problem that existing sparse matrix dense vector multiplication accelerators cannot simultaneously handle high parallelism computing and irregular sparse data storage is solved, achieving a balance between high parallelism and energy efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INST OF COMPUTING TECH CHINESE ACAD OF SCI
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-12
AI Technical Summary
Existing sparse matrix-dense vector multiplication accelerators cannot simultaneously achieve high parallelism computation and efficient storage and computation of irregular sparse data, resulting in their performance potential not being fully realized.
This invention provides a flexible in-memory computing accelerator that uses a multi-functional array and a unified data layout method to flexibly switch between sparse and dense computing modes based on the local data characteristics of the sparse matrix. It utilizes the multi-functional array to perform multiply-accumulate operations or index search, and combines it with time-delay floating-point data computing methods to achieve high parallelism computing.
It achieves high parallelism in irregular sparse data while balancing storage and computation, improving computing performance and energy efficiency, and avoiding resource waste and computational latency.
Smart Images

Figure CN122196322A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of in-memory computing and high-performance computing, specifically to in-memory computing accelerators, computing methods, and computer program products for dense vector multiplication of sparse matrices. Background Technology
[0002] The statements in this section are merely to provide background information in relation to this application to aid in understanding it, and such background information does not necessarily constitute prior art.
[0003] Sparse matrix-dense vector multiplication (SpMV) is a core computational primitive in high-performance computing, machine learning, and other fields, and its computational efficiency directly impacts the overall performance of various large-scale applications. In SpMV acceleration research, traditional storage-compute separation accelerators suffer from performance bottlenecks and low energy efficiency due to the frequent data migration between storage and compute units. Computing-in-Memory (CIM) technology, by integrating computational functions within the storage unit, effectively reduces data movement overhead and improves system energy efficiency, and is considered a key technological path to overcome this bottleneck. However, existing CIM-based SpMV accelerators suffer from insufficient architectural design flexibility, making it difficult to simultaneously meet the demands of high-parallelism computing and the efficient storage and computation of irregular sparse data, thus failing to fully realize the performance potential of CIM technology. Summary of the Invention
[0004] In their research and analysis of in-memory computing accelerators for dense vector multiplication of sparse matrices, the inventors discovered that the fundamental reason for the aforementioned shortcomings of existing technologies lies in the fact that existing accelerators adopt fixed architectures and fixed computing modes, supporting only one mode of dense or sparse computing. They cannot adaptively switch according to the local data characteristics of sparse matrices, making it difficult to simultaneously achieve high parallelism and efficient processing of irregular sparse data. Therefore, this application aims to provide a more flexible in-memory computing acceleration scheme that can balance the storage and computation of irregular sparse data while achieving high computational parallelism during dense vector multiplication of sparse matrices.
[0005] The objective of this application is achieved through the following technical solution: According to a first aspect of this application, an in-memory computing accelerator for dense vector multiplication of a sparse matrix is provided. The accelerator includes a controller, multiple multifunction arrays, and a merging module, wherein: the controller is configured to traverse a sparse matrix and calculate the local sparsity of each data block in the sparse matrix that matches a preset array size; when the local sparsity exceeds a preset threshold, it determines to process the corresponding data block in a dense mode; otherwise, it determines to process the corresponding data block in a sparse mode; and loads the data block into a corresponding multifunction array according to the processing mode determined for the corresponding data block for multiplication calculation with a dense vector; the multiple multifunction arrays are configured to perform the functions of a MAC array or a CAM array in accordance with the processing mode determined for the corresponding data block; and the merging module is configured to merge the calculation results of different data blocks to obtain a final calculation result.
[0006] Preferably, the multifunctional array configured to implement the MAC array in dense mode is configured to: store the corresponding dense matrix data blocks using a bit slicing method; perform floating-point multiplication calculations between each element of the stored corresponding data block and the element corresponding to the dense vector to obtain the calculation result.
[0007] Preferably, the multi-functional array configured to implement a CAM array in sparse mode is configured to: store the row index and column index corresponding to each non-zero value of the corresponding data block; match the column index and row index of each non-zero value of the corresponding data block to be calculated with the row index of each non-zero value of the dense vector to obtain a matching result; the multi-functional array configured to implement a MAC array in sparse mode is configured to: store each non-zero value of the corresponding data block; associate and store each non-zero value of the corresponding data block and each non-zero value of the dense vector according to the matching result; and perform floating-point multiplication calculation between each non-zero value of the associated stored corresponding data block and the non-zero value corresponding to the dense vector to obtain a calculation result.
[0008] Preferably, in sparse mode, each bit of a single index is stored in the same row of the CAM array, and the value corresponding to the index in the Nth CAM array is stored in the Nth column of multiple MAC arrays in a bit-slice manner, where N is an integer greater than zero.
[0009] Preferably, the system further includes: a first analog-to-digital converter (ADC), a second ADC, a shift-accumulator circuit, an input buffer, an output buffer, a delay unit, a register file, and a configuration information buffer; wherein, the first ADC is used to convert analog signals into digital signals in dense mode; the second ADC is used to convert analog signals into digital signals in sparse mode; the shift-accumulator circuit is used to sum the exponents of sparse matrices and dense vectors; the input buffer and the output buffer are used to buffer input data and output data, respectively; the delay unit is used to implement floating-point data alignment based on the delay value; the register file is used to store the exponent value information of floating-point data; and the configuration information buffer is used to store the mode configuration information of different arrays.
[0010] Preferably, the multifunctional array comprises internal storage units and peripheral circuitry; wherein the storage units are arranged in a horizontal and vertical manner, each storage unit including an N-type metal-oxide-semiconductor transistor and a resistive random access memory (RRAM), the N-type metal-oxide-semiconductor transistor being used to control the opening and closing of the corresponding storage unit, the RRAM being used to store one bit of multi-bit data, the gate of the N-type metal-oxide-semiconductor transistor being connected to the data line, its source being connected to the sensing line, its drain being connected to one end of the RRAM, and the other end of the RRAM being connected to the matching line; the peripheral circuitry includes a data line driver, a matching line driver, a CAM sensitive amplifier, and a sample-and-hold circuit, the two ends of the matching line being connected to the matching line driver and the CAM sensitive amplifier respectively, the end of the data line being connected to the data line driver, and the end of the sensing line being connected to the sample-and-hold circuit.
[0011] Preferably, when a multi-functional array is configured to implement the function of a CAM array, adjacent resistive random access memories on the same row will form a CAM unit, the row and column indices will be input through the data lines, the matching result of each row will be transmitted through the matching line, and converted into a digital signal by the CAM sensitive amplifier; when a multi-functional array is configured to implement the function of a MAC array, each resistive random access memory will store one bit of sparse matrix data, dense vector data will be input bit by bit through the matching line, one bit will be input every clock cycle, the calculation result will be converged to the sensing line in the form of current, and the calculation result of each clock cycle will be acquired and saved by the sample-and-hold circuit.
[0012] Preferably, the accelerator employs a time-delay-based floating-point data computation method for SpMV operations. The floating-point data includes an exponent and a mantissa; the exponent is used for data alignment, and the mantissa is used for multiplication and addition operations. The exponents of the sparse matrix and the dense vector are summed using adders in a shift-accumulator circuit, and the summation result is stored in a register file. The delay unit identifies the maximum exponent value corresponding to the data actually involved in the computation in the MAC array, calculates the difference between each exponent and this maximum value, and stores this difference as a delay value in the corresponding array's delay register. When performing floating-point multiplication and addition operations, each mantissa of the dense vector is delayed by a corresponding number of clock cycles before being input into the MAC array, based on its corresponding delay value, thereby achieving exponent alignment between different mantissas. Then, it performs multiplication and addition operations with the sparse matrix data stored in the array. For the calculation result, its mantissa is shifted using a shift-accumulator circuit, and the exponent of the result is adjusted according to the actual number of shifts, finally outputting a floating-point calculation result conforming to a standard format.
[0013] According to a second aspect of this application, a method for calculating sparse matrix-dense vector multiplication based on the accelerator of the first aspect is provided, comprising: traversing a sparse matrix and calculating the local sparsity of each data block in the sparse matrix that matches a preset array size; determining that the corresponding data block is processed in a dense mode when the local sparsity exceeds a preset threshold, otherwise determining that the corresponding data block is processed in a sparse mode; writing the data blocks of the sparse matrix and dense vector data into a multifunctional array according to the corresponding mode; performing multiply-accumulate operations or index search using the multifunctional array, combined with a time-delay-based floating-point data calculation method, to obtain intermediate calculation results; and being configured to merge the intermediate calculation results of different data blocks to obtain a final calculation result.
[0014] According to a third aspect of this application, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, implements the method of the first aspect of this application.
[0015] According to a fourth aspect of this application, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the method of the first aspect of this application.
[0016] Compared with the prior art, the advantages of the solution in this application are mainly as follows: a multi-functional in-memory computing array can be flexibly configured in the accelerator, and a sparse computing mode or a dense computing mode can be flexibly adopted according to the local data characteristics of the sparse matrix. This allows each data block to adopt the optimal processing mode, thereby taking into account both the storage and computation of irregular sparse data and achieving a high degree of computational parallelism. Attached Figure Description
[0017] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. In the drawings: Figure 1 This is a schematic diagram of the structure of an in-memory computing accelerator for dense vector multiplication of sparse matrices according to an embodiment of this application; Figure 2 This is a schematic diagram of a sparse matrix data layout according to an embodiment of this application; Figure 3 This is a schematic diagram of the stages of an accelerator execution process according to an embodiment of this application; Figure 4 This is a schematic diagram of an accelerator architecture according to an embodiment of this application; Figure 5 This is a schematic diagram of a multi-functional array architecture according to an embodiment of this application; Figure 6 This is a schematic diagram of a floating-point calculation method according to an embodiment of this application. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided through specific embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of this application.
[0019] Existing SpMV accelerators based on in-memory computation can be divided into two main categories according to the storage method of sparse matrix data: dense storage accelerators and sparse storage accelerators. These two types of accelerators differ significantly in their architecture design, working mechanism, and performance.
[0020] Among them, the dense storage-based CIM-SpMV accelerator uses a resistive random-access memory (ReRAM) as its core to construct a multiply-add computation (MAC) array to store coefficient matrix data. Its core feature is that it completely maps sparse matrix data into the ReRAM-MAC array in a dense matrix storage format, hence the definition of dense storage. The workflow of this type of accelerator is as follows: first, the sparse matrix is written into the ReRAM-MAC array in a dense format to complete data storage; then, the dense vector is input into the MAC array, and the SpMV calculation process is completed through parallel operations within the array, together with the sparse matrix data stored in the array.
[0021] The CIM-SpMV accelerator with sparse storage employs a hybrid architecture combining a ReRAM-MAC array and a content addressable memory (CAM) array. Unlike dense storage methods, this type of accelerator stores sparse matrix data in a native sparse format. The CAM array is specifically used to store the row and column index information of sparse matrix elements, while the ReRAM-MAC array stores the corresponding matrix element values. The computation process is implemented through the index search function of the CAM array. First, the CAM array searches for the target row and column index. Based on the search result, the corresponding dense vector data is loaded, and the row cells in the MAC array participating in the computation are determined, thereby realizing the SpMV computation of the sparse matrix and the dense vector.
[0022] Although both types of CIM-SpMV accelerators are designed based on the concept of in-memory computing, due to the inherent characteristics of storage methods and architecture design, both schemes have obvious performance defects, which restrict their applicability under different sparse characteristic matrices.
[0023] The core drawback of dense storage accelerators is their inability to efficiently process irregular sparse data, making it difficult to fully utilize the parallel computing resources of the MAC array. These accelerators use a dense format to store sparse matrices. In regions of high element density within the matrix, the parallel computing capabilities of the MAC array can achieve high parallelism, leveraging the energy efficiency advantages of in-memory computing. However, when the element distribution in a local region of the matrix is extremely sparse and exhibits irregular characteristics, a large number of storage units in the MAC array remain idle because there are no matrix elements at corresponding locations, resulting in a significant waste of in-memory computing resources. Some solutions address this by migrating such irregular sparse data to external computing units such as CPUs or GPUs. This data migration process not only increases system latency but also significantly reduces the overall computing performance and energy efficiency of the accelerator, ultimately preventing these accelerators from achieving end-to-end in-memory computing integration for irregular sparse data.
[0024] The main drawback of sparse storage accelerators lies in their low overall computational parallelism, making it difficult to leverage the parallel computing potential of locally dense regions within a sparse matrix. While these accelerators can fully store various irregular sparse data using sparse storage formats and perform SpMV operations on the entire sparse data by controlling the MAC array for multiply-accumulate operations via index search using the CAM array, they suffer significant performance degradation due to the serial search-computation mechanism. Index search and multiply-accumulate operations cannot be executed in parallel; the MAC array computation can only begin after the CAM index search determines the matching row and column indices, preventing different row cells in the MAC array from participating in the computation synchronously and directly reducing the array's parallel computing efficiency. The CAM array's index search process itself introduces additional time overhead, further exacerbating the overall system performance degradation and making it difficult for these accelerators to utilize the high parallel computing advantages that locally dense matrix regions should possess.
[0025] While researching in-memory computing accelerators for dense vector multiplication of sparse matrices, the inventors discovered that the shortcomings of existing technologies in simultaneously handling high parallelism computing and the storage and computation of irregular sparse data are due to the fact that they only consider accelerator designs with fixed architectures and computation methods. These fixed-architecture accelerators can only adopt one of dense or sparse methods and cannot flexibly cope with the local data characteristics of different regions in sparse matrices, thus failing to fully leverage the advantages of in-memory computing and efficiently accelerate SpMV.
[0026] To address the aforementioned shortcomings, this application provides a flexible in-memory computing acceleration scheme. Utilizing a multifunctional in-memory computing array and a unified data layout method, the scheme configures modes according to a preprocessing algorithm and executes different modes of computation according to a predetermined computation process. This accelerator can freely switch computation modes based on local data characteristics, processing local data computation in a sparse or dense manner, thereby achieving both storage and computation of irregular sparse data and a high degree of computational parallelism.
[0027] Figure 1 This is a schematic diagram of the structure of an in-memory computing accelerator (also referred to herein as MACAM) for dense vector multiplication of sparse matrices according to an embodiment of this application. Figure 1As shown, the accelerator includes a controller, multiple multi-function arrays, and a merging module. The controller is configured to traverse a sparse matrix and calculate the local sparsity of each data block in the sparse matrix that matches a preset array size. When the local sparsity exceeds a preset threshold, the corresponding data block is processed in a dense mode; otherwise, it is processed in a sparse mode. Based on the processing mode determined for the corresponding data block, it is loaded into the corresponding multi-function array for multiplication calculation between the data block and the dense vector. Each multi-function array is configured to perform the functions of a MAC array or a CAM array in accordance with the processing mode determined for the corresponding data block to process the data block. The merging module is configured to merge the calculation results of different data blocks to obtain the final calculation result.
[0028] The embodiments of this application flexibly configure sparse and dense modes based on the local sparsity of the sparse matrix to ensure that each data block adopts the optimal processing mode and achieves a balance between performance and energy efficiency.
[0029] In some embodiments, before the SpMV operation officially begins, the controller initiates a one-time preprocessing algorithm, which is executed only once to avoid efficiency losses caused by repeated preprocessing. The preprocessing algorithm traverses the entire sparse matrix, dividing it into multiple data blocks matching the size of the multi-functional array (ensuring each data block can fully adapt to one or more arrays). Simultaneously, it calculates the local sparsity of each data block (local sparsity = number of non-zero elements in the data block / total number of elements in the data block). During the traversal, completely blank data blocks (with 0 non-zero elements) are skipped, eliminating the need to configure arrays for them and further saving hardware resources. The controller compares the local sparsity of each data block with a preset threshold, which is determined by hardware parameters. Specifically, based on the current accelerator's hardware parameters (such as array size, ADC conversion speed, MAC operation speed, etc.), the calculation time for the data block in both dense and sparse modes is calculated, and the local sparsity when the time in both modes is equal is used as the preset threshold. When the local sparsity of a data block exceeds the threshold, the data block is deemed suitable for dense mode processing, and the corresponding array is configured in MAC mode. When the local sparsity does not exceed the threshold, the data block is deemed suitable for sparse mode processing, and the corresponding array is configured in a combination of CAM and MAC modes. If the data block in a sparse region is small and its data volume is insufficient to fill the array configured in sparse mode (i.e., the array has idle storage units), the controller will further deploy the excess data blocks from other sparse regions to the idle array, making full use of the array's storage resources and avoiding resource waste.
[0030] Regarding the data storage method for sparse matrices, in some embodiments, such as Figure 2As shown, regardless of whether it's sparse or dense mode, sparse matrix data is stored on a multi-functional array using a unified data layout method. The core employs bit-slicing storage technology. To ensure a one-to-one correspondence between the CAM array index and the MAC array values, the row dimension, column dimension, and bit-slice size of all multi-functional arrays in the MACAM architecture are uniformly set to 64. In both modes, the MAC mode array storing sparse matrix values uses the bit-slicing method, meaning that different bits of the same multi-bit value are stored in the same position (same row, same column) in different MAC arrays, with each position storing only one bit. Through the collaboration of multiple arrays, complete storage and parallel computation of multi-bit values are achieved. In sparse mode, there is a strict one-to-one correspondence between the row and column indices stored in the CAM array and the values stored in the MAC array. For a single PE, all the indexes stored in the first CAM array have their corresponding values stored in the first column of multiple MAC arrays in a bit-slice manner; all the indexes stored in the second CAM array have their corresponding values stored in the second column of multiple MAC arrays in a bit-slice manner, and so on. The values corresponding to the indexes in the Nth CAM array are stored in the Nth column of multiple MAC arrays in a bit-slice manner, where N is a positive integer, to ensure that the index search results can quickly locate the corresponding values and improve computational efficiency.
[0031] Dense mode is suitable for regions in a sparse matrix where the local sparsity exceeds a preset threshold (local data is relatively dense). Its core feature is that it uses a dense format to store data and performs operations using only the MAC mode array, without the need for a CAM array. Sparse matrix data is stored in the same format as dense matrices; that is, regardless of whether matrix elements are zero, they are stored in a multi-functional array configured in MAC mode using a bit-slicing method, following the row and column order of the dense matrix. No index information needs to be stored, simplifying the data storage process. Sparse matrix data with a dense data layout is written to the multi-functional array configured in MAC mode, while the exponent and mantissa of the dense vector are written to the register file and input buffer, respectively. The vector mantissa in the input buffer is input to the MAC array with a delay of the corresponding clock cycle based on the delay value generated by the delay unit, performing multiplication and addition operations. Intermediate results are temporarily stored in the output buffer and merge queue; no additional search steps are required, fully leveraging the high parallelism of the MAC array.
[0032] The sparse mode is suitable for regions within a sparse matrix where the local sparsity does not exceed a preset threshold (local data is relatively sparse and irregular). It works in conjunction with a CAM array (index search) and a MAC array (multiply-accumulate operation). Sparse matrix data is stored in COO format, meaning that for each non-zero matrix element, it stores three parts: row index, column index, and value. The row and column indices are stored in a multi-functional array configured in CAM mode, while the values are stored in a multi-functional array configured in MAC mode using a bit-slicing method. The indices and values are in a one-to-one correspondence, ensuring rapid location of the corresponding value after a search, fully utilizing the fast search advantage of the CAM array and avoiding invalid computations. The values of the sparse matrix are written to the MAC mode array, and the row and column indices are written to the CAM mode array. The controller then sends the column index to the CAM mode array to perform the search. Based on the search results, the corresponding dense vector mantissa is loaded into the input buffer, the exponent is loaded into the register file, and a delay value is generated and stored in the delay register. The controller sends the row index to the CAM mode array to perform a search. The search results are stored in the matching information cache. Based on the matching results, the row cells in the MAC array that need to participate in the operation are determined. Then, the vector mantissa is input with a delay value, and multiplication and addition operations are performed. Intermediate results are temporarily stored in the output cache and merging module.
[0033] In some embodiments, such as Figure 3 As shown, the overall process of the accelerator in this application performing SpMV operations is divided into a writing stage, a computation stage, and a merging stage. The workflows of dense mode and sparse mode differ in each stage, but they are all completed collaboratively under the unified scheduling of the controller.
[0034] The task of the writing phase is to write sparse matrix data and dense vector data into the corresponding components of the multi-functional array according to a unified data layout of the corresponding mode, in preparation for the computation phase. The writing process for the two modes is as follows: (1) Dense mode writing: The controller controls the accelerator to write sparse matrix data with dense data layout in batches to the multi-functional array configured in MAC mode; at the same time, the exponent part of the dense vector is written to the register file and the mantissa part is written to the input buffer; the configuration information buffer is updated synchronously with the MAC mode configuration information of the array to ensure that the subsequent calculation stage can respond quickly.
[0035] (2) Sparse mode writing: The controller controls the accelerator to write the values of the sparse matrix (using the bit slicing method) to the multi-function array configured in MAC mode, and write the corresponding row and column indices to the multi-function array configured in CAM mode. After the index writing is completed, the controller immediately sends the column index to the CAM mode array to start the search operation. The CAM array obtains the matching result and stores it in the matching information cache by inputting the column index through DL, transmitting the matching result through ML, and converting the digital signal through the CAM sensitive amplifier. According to the matching result, the controller loads the corresponding dense vector mantissa into the input cache and loads the vector exponent into the register file. Subsequently, the delay unit calculates the difference between each exponent and the maximum exponent (delay value) and stores the delay value in the delay register of each MAC array, completing all the preparation work for the writing stage.
[0036] The task of the computation phase is to perform multiplication and addition operations using a multi-function array, obtain intermediate calculation results, and temporarily store them. The computation flow for the two modes is as follows: (1) Dense mode calculation: The controller sends the operation instruction to the MAC mode array, inputs the dense vector mantissa in the buffer, and delays the input to the MAC array by a corresponding number of clock cycles according to the delay value stored in the delay register to achieve exponential alignment between different mantissas; the aligned mantissa is input into the MAC array bit by bit through ML, and performs multiplication and addition operations with the sparse matrix data stored in the array. The operation result is converged to SL in the form of current. The sample-and-hold circuit holds the analog signal, and the high-precision ADC converts it into a digital signal (intermediate result); the intermediate result is scheduled by the controller and stored in the output buffer (value) and the merge queue (row index + value) respectively to complete the calculation of the data block.
[0037] (2) Sparse mode calculation: First, the controller sends the row index to the CAM mode array and performs a second search operation. The CAM array obtains the matching result of the row index according to the above search process and stores it in the matching information cache. Then, the controller determines the row units in the MAC array that need to participate in the operation based on the matching result (only the successfully matched row units participate in the operation to avoid invalid operation). After that, the vector mantissa in the input cache is delayed and input into the MAC array according to the delay value. The mantissa and matrix value are multiplied and added according to the same multiply-add operation process as the dense mode. The low-precision ADC completes the analog signal conversion. The intermediate results are also stored in the output cache and the merging queue to complete the calculation of the data block.
[0038] Because the computation speed of sparse mode is slower than that of dense mode, in order to avoid the accumulation of intermediate results and improve the overall computational efficiency, the merging stage is tightly coupled with sparse mode and adopts a real-time merging method. Whenever sparse mode completes the computation of a row and generates an intermediate result, the merging stage is started immediately; the controller controls the merging queue, and according to the row index of the intermediate result, it accumulates and merges the intermediate results of the same row generated by sparse mode and dense mode; after traversing all row indices and merging all intermediate results, the final result of the SpMV operation is obtained and output from the output buffer, completing the entire SpMV computation process.
[0039] In some embodiments, such as Figure 4 As shown, the overall architecture of the MACAM accelerator is divided into a processing element (PE) layer and an array layer. It adopts a distributed architecture design with multiple PEs operating in parallel and multiple arrays per PE, ensuring high parallelism and flexibility in computation. The accelerator consists of multiple PEs, each capable of parallel data processing. Each PE contains multiple multi-functional arrays, and the operation of all PEs and arrays is uniformly scheduled by the controller. All components collaborate to complete the entire SpMV computation process. In addition to the multi-functional arrays, controller, and merging queue (merging module), each PE also integrates the following functional components: The ADC module includes a high-precision shared ADC (first analog-to-digital converter) and a low-precision shared ADC (second analog-to-digital converter), employing a shared design to save hardware resources and reduce power consumption. The high-precision shared ADC is specifically used for the conversion of analog computation signals to digital signals in dense mode, ensuring the accuracy requirements of highly parallel computing in dense mode. The low-precision shared ADC is used for the digital conversion of analog signals in sparse mode, further reducing energy consumption while ensuring sufficient computational accuracy. The two ADCs automatically switch between on and off states according to the accelerator's operating mode to avoid unnecessary energy consumption.
[0040] The shift-accumulator circuit has two core functions: first, it performs a summation operation on the floating-point exponents of sparse matrices and dense vectors, providing a basis for subsequent floating-point data alignment; second, it performs shift-accumulation processing on the intermediate results of multiplication and addition operations output by the MAC array, while shifting and adjusting the mantissa of the final result to ensure that the result meets the mantissa bit requirements of the IEEE-754 double-precision floating-point standard, and adjusts the exponent value of the result accordingly based on the number of shifts.
[0041] The caching module includes an input cache, an output cache, and a configuration information cache. The input cache is used to cache sparse matrix data and dense vector data (including exponents and mantissas) to be processed, avoiding latency caused by frequent data reads; the output cache is used to temporarily store intermediate results generated by MAC array operations, preventing result loss or corruption; the configuration information cache is used to store the working mode configuration information (CAM mode or MAC mode), data layout parameters, etc. of each multi-functional array, ensuring that the array can quickly respond to mode switching commands and improve computational efficiency.
[0042] The delay unit works in conjunction with the delay registers of each multi-function array to achieve delay-based floating-point data alignment. Its core function is to identify the maximum exponent value of all data involved in the operation in the array currently operating in MAC mode, calculate the difference between each exponent and this maximum value and use it as the delay value, and send the delay value to the delay register of the corresponding array for storage, providing a basis for subsequent mantissa delay input and exponent alignment.
[0043] The register file is specifically used to store the exponent value information of floating-point data, including the summation result of the exponents of sparse matrices and dense vectors. This facilitates the delay unit to quickly read the exponent data for calculation and provides data support for the exponent adjustment of the calculation result. The high-speed register design ensures that the read and write speed of the exponent data meets the requirements of high parallel computing.
[0044] In the embodiments of this application, a multi-functional array refers to an array that is multifunctional and can be configured to perform the multiply-accumulate function of a MAC array or the search function of a CAM array. In some embodiments, each multi-functional array includes a multi-functional crossbar, a delay register, an input buffer, and a matching information buffer. The multi-functional crossbar consists of horizontally and vertically arranged memory cells, the structure of which directly determines the dual-function implementation of the array. Each memory cell consists of an N-type metal-oxide-semiconductor (NMOS) transistor and a resistive random access memory (ReRAM). The ReRAM is binary and used to store one bit of multi-bit data (whether it is a numerical bit of a sparse matrix or an index bit). The NMOS transistor is used to control the opening and closing of the corresponding memory cell, enabling selective reading, writing, and operation of data. The delay register works in conjunction with the delay cells within the PE to store the delay values calculated by the delay cells. During floating-point multiply-accumulate operations, the input timing of the mantissa of the dense vector is controlled according to the delay value, ensuring that the mantissa can be input to the MAC array with a corresponding clock cycle delay, achieving exponent alignment between different mantissas and ensuring the accuracy of floating-point operations. The input cache is used to cache data input to the multi-functional array, including search keywords (row and column indices) in CAM mode and dense vector tail data in MAC mode. This avoids computational errors caused by unstable data input and also buffers and synchronizes data, ensuring that the data input speed matches the array's computation speed. The matching information cache is only enabled when the array is configured in CAM mode. It stores the matching results after the CAM array performs an index search, including the index position of a successful match and the corresponding MAC array row number. This facilitates quick retrieval of the matching results in subsequent computation stages, determines the row cells in the MAC array involved in the computation, and improves computational efficiency in sparse mode.
[0045] In some embodiments, the storage cells of the multi-functional array adopt a cross-array structure, and the circuit connection method and peripheral circuit design of each storage cell are the basis for realizing dual-function switching. For example... Figure 5As shown, in each memory cell, the gate of the NMOS transistor is connected to the data line (DL), the source is connected to the sense line (SL), and the drain is connected to one end of the binary ReRAM. The other end of the ReRAM is connected to the match line (ML). The design of the connection between SL, ReRAM, and NMOS ensures that the computation result can be collected and sensed in the form of current. The peripheral circuitry includes a DL driver, an ML driver, a CAM sensitive amplifier, and a sample-and-hold circuit. The connections and functions of each peripheral circuit are as follows: The DL driver is connected to the end of the DL transistor to provide a stable drive signal, control the data transmission timing and strength of the DL, and ensure that the search keyword in CAM mode and the control signal in MAC mode can be accurately and quickly transmitted to the NMOS gate of each memory cell, achieving precise switching control of the memory cell. The ML driver is connected to one end of the ML transistor to provide a drive signal. In CAM mode, it powers the transmission of the matching result; in MAC mode, it serves as the input path for the mantissa of the dense vector, controlling the timing of the bit-by-bit input of the vector data, ensuring that only one bit of data is input per clock cycle, achieving time-division multiplexing of multi-bit data. The CAM sensitive amplifier is connected to the other end of the ML and is specifically used for signal amplification and digitization of the matching results in CAM mode. Since the matching result signal transmitted by the ML in CAM mode is relatively weak, the CAM sensitive amplifier amplifies the weak signal and converts it into a digital signal (0 or 1) to indicate whether the index of the corresponding row is successfully matched. The digital signal is then transmitted to the matching information buffer for storage. The sample-and-hold circuit is connected to the end of the SL and is used to hold the analog calculation signal in MAC mode. Since the multiply-accumulate operation result in MAC mode is channeled to the SL as current, and the ADC conversion takes a certain amount of time, the sample-and-hold circuit can stably hold the current signal (operation result) for each clock cycle until the ADC completes the analog-to-digital conversion, avoiding calculation errors caused by signal attenuation or distortion.
[0046] When the controller determines that a data block should be processed in sparse mode based on the local sparsity of the sparse matrix, the corresponding multi-function array is configured in CAM mode. The specific process of the array performing the indexing function at this time is as follows: The controller sends a CAM mode configuration command to the corresponding multi-function array. The configuration information is cached and stored in the mode information. At the same time, the controller shuts down the two ADCs (multiply-accumulate related components) in the PE to avoid interfering with the index search process, and turns on the CAM mode-specific components such as the CAM sensitive amplifier, DL driver, and ML driver. Adjacent binary ReRAM devices in the same row constitute a CAM cell. Each CAM cell is used to store one bit of the sparse matrix index (row index or column index). Multiple CAM cells in the same row work together to store a complete index value (multiple bits), that is, the bits of a single index are stored in the same row of the CAM array. The controller transmits the keyword (row and column index) to be searched to the DL via the DL driver. The signal on the DL controls the NMOS gate switch of each memory cell, thereby controlling the conduction state of the ReRAM. Each CAM cell determines whether its stored bits match the input keyword bits. The matching result is transmitted to the ML via the ReRAM. The matching results of all CAM cells are aggregated and transmitted to the CAM sensitive amplifier via the ML. The CAM sensitive amplifier amplifies and digitizes the weak matching signal to obtain the matching result of the corresponding row (1 for a successful match and 0 for a failure), and transmits the matching result to the matching information cache for storage, completing one index search.
[0047] When the controller determines that data in a certain area needs to be processed in dense mode, or that multiply-accumulate operations need to be performed in sparse mode, the corresponding multi-function array is configured to MAC mode. At this time, the specific process of the array performing multiply-accumulate operations on sparse matrices and dense vectors is as follows: The controller sends a MAC mode configuration command to the corresponding multifunction array. The configuration information is cached and stored in the mode cache. Simultaneously, the controller disables CAM mode-specific components such as the CAM sensitive amplifier and enables the sample-and-hold circuit and the corresponding precision ADC (high-precision ADC for dense mode, low-precision ADC for sparse mode) to ensure smooth multiplication and addition operations. Each ReRAM cell stores one bit of sparse matrix data (values) using a bit-slicing method. This means different bits of the same multi-bit value are stored in the same location on different multifunction arrays, ensuring the feasibility of subsequent parallel operations. The mantissa data of the dense vector is transmitted to the ML drive, inputting one bit per clock cycle in a time-division multiplexing manner. The multi-bit mantissa data is input over multiple clock cycles. The input vector mantissa bits are multiplied by the matrix value bits stored in the ReRAM cell, and the result is fed onto the SL in the form of current. The sample-and-hold circuit stably holds the current signal on the SL (the result of each clock cycle) and then transmits it to the corresponding ADC. The ADC converts the analog current signal into a digital signal, completing one multiplication and addition operation. The digital signal is then transmitted as an intermediate result to the output buffer for temporary storage.
[0048] In some embodiments, such as Figure 6 As shown, the accelerator in this application employs a time-delay-based floating-point data computation method, supporting SpMV operations on double-precision floating-point data and solving the technical problems of difficult floating-point exponent alignment and low computational accuracy. First, all double-precision floating-point data in the sparse matrix and dense vector are split into an exponent part and a mantissa part. The exponent part is used for data alignment, and the mantissa part is used for multiplication and addition operations. The splitting process is scheduled by the controller to ensure accurate splitting. The exponent of each floating-point element in the sparse matrix and the exponent of the corresponding element in the dense vector are summed using the adder in the shift-accumulator circuit in the PE. The summed exponent result is scheduled by the controller and stored in a register file for fast retrieval by the delay unit. The delay unit reads all exponent summation results stored in the register file and simultaneously identifies the maximum exponent value corresponding to the data participating in the operation in the array currently operating in MAC mode (in dense mode, all data in the MAC array participates in the operation, and the maximum value of all exponents is taken; in sparse mode, only the MAC array data corresponding to the matching index participates in the operation, and the maximum value of that part of the exponents is taken). The delay unit calculates the difference between each exponent summation result and this maximum value, uses this difference as the delay value, and sends it to the delay register of the corresponding MAC array for storage. Each delay value corresponds to the input timing of a mantissa. When performing floating-point multiply-accumulate operations, the controller controls the mantissa of each dense vector in the input buffer to be delayed by a corresponding number of clock cycles before being input into the MAC mode array, based on the delay value stored in the delay register. For example, if the delay value corresponding to a certain mantissa is 3, then this mantissa is input into the MAC array 3 clock cycles after the other mantissas are input, thereby achieving exponent alignment of all mantissas participating in the operation. The aligned mantissa is then multiplied and added with the mantissa of the sparse matrix stored in the MAC array to obtain the simulation operation result. The analog result obtained from the multiplication-addition operation is converted into a digital signal by the ADC. The controller then schedules the shift-accumulation circuit to shift the mantissa of the result to meet the requirements of the IEEE-754 double-precision floating-point standard for the number of mantissa bits (e.g., 23 significant mantissa bits). At the same time, the exponent of the result is adjusted accordingly based on the number of shift bits of the mantissa (shifting by n bits increases or decreases the exponent by n). The adjusted floating-point result is stored in the output buffer and finally merged in the merging stage to output the complete SpMV operation result.
[0049] To verify the technical effectiveness of this application, the same sparse matrix dataset was used to compare the performance and energy efficiency of the MACAM accelerator proposed in this application with two existing mainstream technical solutions (fully dense mode accelerator and fully sparse mode accelerator). Using the same hardware parameters (array size 64×64, ReRAM storage density, ADC conversion speed, etc.), the same sparse matrix dataset (covering matrices with different local sparsity), and the same computational task (SpMV double-precision floating-point operations), the computation time (performance) and energy consumption (energy efficiency) of the three solutions were tested. Compared to the fully dense mode approach, this application automatically identifies irregular sparse data regions and uses sparse mode for storage and computation, eliminating the need to transfer this data to the GPU for computation and avoiding data migration overhead. This results in a 97.41x performance improvement and a 213.65x energy saving. Compared to the fully sparse mode approach, this application can use a dense mode in locally dense data regions, fully leveraging the high parallelism of the MAC array and avoiding the low parallelism of the fully sparse mode. This results in a 6.56x performance improvement and a 10.06x energy saving. Test results show that the MACAM accelerator of this application can balance high-parallelism computation with the storage and computation of irregular sparse data. Through adaptive mode switching, it significantly improves the performance and energy efficiency of SpMV operations, solving the core shortcomings of existing technologies and possessing extremely high practical value.
[0050] According to one embodiment of this application, a method for calculating the product of a sparse matrix and dense vectors based on the aforementioned accelerator is proposed, comprising: traversing the sparse matrix and calculating the local sparsity of each data block in the sparse matrix that matches a preset array size; determining that the corresponding data block is processed in a dense mode when the local sparsity exceeds a preset threshold, otherwise determining that the corresponding data block is processed in a sparse mode; writing the sparse matrix data blocks and dense vector data into a multi-functional array according to the corresponding mode; performing multiply-accumulate operations or index search using the multi-functional array, combined with a latency-based floating-point data alignment mechanism, to obtain intermediate calculation results; and merging the intermediate calculation results of different data blocks to obtain the final calculation result.
[0051] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working process of the system and modules described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0052] The various embodiments in this application are described in a progressive manner, with each embodiment focusing on the differences from other embodiments or implementation methods. Similar or identical parts between the various embodiments of this application can be referred to mutually. The implementation principles and technical effects of the inventive concept can be mutually referenced, and will not be repeated here. Where there is no conflict, the various embodiments or implementation methods in this application can be combined with each other.
[0053] It should be noted that although the steps are described in a specific order above, it does not mean that the steps must be executed in the above specific order. In fact, some of these steps can be executed concurrently or even in a different order, as long as the required function can be achieved.
[0054] This application may be a system, method, and / or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of this application.
[0055] Computer-readable storage media can be tangible devices that hold and store instructions for use by an instruction execution device. Computer-readable storage media can include, for example, but not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination thereof.
[0056] This application uses specific embodiments to illustrate the principles and implementation methods of this application. The description of the above embodiments is only for the purpose of helping to understand the solution and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. An in-memory computing accelerator for dense vector multiplication of sparse matrices, the accelerator comprising a controller, multiple multi-function arrays, and a merging module, wherein: The controller is configured to traverse the sparse matrix and calculate the local sparsity of each data block in the sparse matrix that matches a preset array size. When the local sparsity exceeds a preset threshold, it determines to process the corresponding data block in a dense mode; otherwise, it determines to process the corresponding data block in a sparse mode. The controller also loads the corresponding data block into the corresponding multi-function array according to the processing mode determined for the corresponding data block to perform the multiplication calculation between the data block and the dense vector. Multiple multifunction arrays, wherein each multifunction array is configured to perform the functions of a MAC array or a CAM array in accordance with a processing mode determined for the corresponding data block; The merging module is configured to merge the calculation results of different data blocks to obtain the final calculation result.
2. The accelerator according to claim 1, wherein, In dense mode, the multi-functional array configured to implement a MAC array is configured as follows: The corresponding dense matrix data blocks are stored using a bit-slicing method; Perform floating-point multiplication between each element of the stored corresponding data block and the corresponding element of the dense vector to obtain the calculation result.
3. The accelerator according to claim 1, wherein, In sparse mode, the multi-functional array configured to implement the CAM array is configured to: store the row index value and column index value corresponding to each non-zero value of the corresponding data block, and match the column index value and row index value of each non-zero value of the corresponding data block to be calculated based on the row index value of each non-zero value of the dense vector to obtain the matching result; The multi-functional array configured to implement the MAC array in sparse mode is configured to: store each non-zero value of the corresponding data block, associate each non-zero value of the corresponding data block and each non-zero value of the dense vector according to the matching result, and perform floating-point multiplication between each non-zero value of the associated stored corresponding data block and the non-zero value corresponding to the dense vector to obtain the calculation result.
4. The accelerator according to claim 3, wherein, In sparse mode, the bits of a single index are stored in the same row of the CAM array, and the value corresponding to the index in the Nth CAM array is stored in the Nth column of multiple MAC arrays in a bit slice manner, where N is an integer greater than zero.
5. The accelerator according to claim 1, further comprising: The system comprises a first analog-to-digital converter, a second analog-to-digital converter, a shift-accumulator circuit, an input buffer, an output buffer, a delay unit, a register file, and a configuration information buffer; among which... The first analog-to-digital converter is used to convert analog signals into digital signals in dense mode; The second analog-to-digital converter is used to convert analog signals into digital signals in sparse mode; Shift-accumulator circuits are used to sum the exponents of sparse matrices and dense vectors; Input buffers and output buffers are used to buffer input data and output data, respectively. The delay unit is used to implement floating-point data alignment based on the delay value; Register files are used to store the exponent values of floating-point data; The configuration information cache is used to store mode configuration information for different arrays.
6. The accelerator according to claim 1, wherein, The multi-functional array consists of internal storage units and peripheral circuits; The memory cells are arranged in a horizontal and vertical manner. Each memory cell includes an N-type metal-oxide-semiconductor transistor and a resistive random access memory (RRAM). The N-type metal-oxide-semiconductor transistor is used to control the opening and closing of the corresponding memory cell. The RRAM is used to store one bit of multi-bit data. The gate of the N-type metal-oxide-semiconductor transistor is connected to the data line, its source is connected to the sensing line, its drain is connected to one end of the RRAM, and the other end of the RRAM is connected to the matching line. The peripheral circuitry includes a data line driver, a matching line driver, a CAM sensitive amplifier, and a sample-and-hold circuit. The two ends of the matching line are connected to the matching line driver and the CAM sensitive amplifier, respectively. The end of the data line is connected to the data line driver, and the end of the sensing line is connected to the sample-and-hold circuit.
7. The accelerator according to claim 6, wherein, When a multi-functional array is configured to perform the functions of a CAM array, adjacent resistive variable memories on the same row will form a CAM cell, the row and column indices will be input through the data lines, the matching results of each row will be transmitted through the matching lines and converted into digital signals by the CAM sensitive amplifier; When a multi-functional array is configured to implement the function of a MAC array, each resistive variable memory will store one bit of sparse matrix data, and dense vector data will be input bit by bit through the matching line, one bit per clock cycle. The calculation results are converged on the sensing line in the form of current, and the calculation results for each clock cycle are acquired and saved by the sample-and-hold circuit.
8. The accelerator according to claim 6, wherein, The accelerator uses a time-delay-based floating-point data calculation method to perform SpMV operations, where the floating-point data includes an exponent and a mantissa. The exponent is used for data alignment, and the mantissa is used for multiplication and addition operations. The exponents of the sparse matrix and the dense vector are summed by the adder in the shift-accumulator circuit, and the summation result is stored in the register file; The delay unit identifies the maximum exponent value corresponding to the data actually involved in the calculation in the MAC array, calculates the difference between each exponent and the maximum value, and stores the difference as the delay value in the delay register of the corresponding array. When performing floating-point multiply-add operations, each mantissa of a dense vector is delayed by a corresponding number of clock cycles before being input into the MAC array, according to its corresponding delay value, so as to achieve exponent alignment between different mantissas, and then perform multiply-add operations with the sparse matrix data stored in the array. The calculation result is processed by shifting and accumulating the mantissa through a shift-accumulator circuit, and the exponent of the result is adjusted according to the actual number of shifts. Finally, a floating-point calculation result conforming to the standard format is output.
9. A method for calculating the product of a sparse matrix and a dense vector based on an accelerator according to any one of claims 1-8, wherein, include: Traverse the sparse matrix and calculate the local sparsity of each data block in the sparse matrix that matches the preset array size. When the local sparsity exceeds the preset threshold, determine that the corresponding data block is processed in dense mode; otherwise, determine that the corresponding data block is processed in sparse mode. Write the sparse matrix data blocks and dense vector data into the multi-functional array according to the corresponding patterns; By using a multi-functional array to perform multiply-accumulate operations or index search, combined with a time-delay-based floating-point data calculation method, intermediate calculation results are obtained. It is configured to merge intermediate calculation results from different data blocks to obtain the final calculation result.
10. A computer program product comprising a computer program that, when executed by a processor, implements the method of claim 9.