A memory-aware sparse matrix multiplication method suitable for edge embedded platforms
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2026-04-07
- Publication Date
- 2026-06-30
AI Technical Summary
Traditional sparse matrix multiplication suffers from high computational latency and high memory power consumption on edge embedded platforms, making it difficult to meet the low latency and low power consumption requirements of real-time attitude perception for drones, dynamic smoothing of motion trajectories for intelligent robots, and real-time denoising of physiological signals for wearable devices.
The sparse matrix is stored in column-major sequence segments, and the dense matrix and result matrix are stored in continuous row segments. By combining the on-chip cache capacity of the edge embedded platform, data reuse is achieved by preloading the row segment elements of the dense matrix. Furthermore, a six-layer block loop structure is designed to perform multiplication and addition operations, which is optimized by combining multi-core parallelism and vectorized operation.
It significantly improves cache hit rate, reduces memory access overhead, lowers computational latency and memory power consumption, and meets the low-latency, low-power, and lightweight deployment requirements of edge embedded platforms.
Smart Images

Figure CN122309907A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of embedded platform technology, and in particular to a memory-aware sparse matrix multiplication method suitable for edge embedded platforms. Background Technology
[0002] In edge-embedded scenarios such as drones, intelligent robots, and wearable devices, core data processing tasks such as attitude perception signal filtering, motion trajectory smoothing, and physiological signal denoising often employ filtering techniques based on polynomial approximation or recursive forms to achieve signal denoising, feature extraction, and data smoothing. The core computational process of such filtering techniques can be abstracted as the repeated action of a sparse linear operator A on a multi-channel signal matrix B, i.e., it requires high-frequency execution of the sparse matrix-dense matrix multiplication (SpMM) operation C=A. B, where the sparse linear operator A is determined by the sparsity of the filter kernel, the multi-channel signal matrix B is a dense matrix, and the operation result C is the filtered target signal matrix. The execution efficiency of this operation directly determines the delay and real-time performance of the filtering process of the edge embedded device.
[0003] Block partitioning is a key technique for improving the locality of data in SpMM operations and enabling data reuse. It is widely used in matrix operation optimization in the field of high-performance computing. Although traditional SpMM implementations such as CSR-Vec, ASpT, and J-Stream have attempted to improve computational efficiency through block partitioning and vectorization, none of them have been specifically designed for the hardware characteristics of edge embedded platforms. Edge embedded platforms are constrained by hardware cost and power consumption, and generally suffer from inherent problems such as small on-chip cache capacity, limited memory bandwidth and physical capacity, and strict power budget. The data layout design and memory access mode of traditional SpMM implementations are seriously inadequately adapted to the cache hierarchy of embedded platforms: On the one hand, traditional solutions often use row-major order to store dense matrices, and the block design along the dense dimension K is missing or unreasonable, resulting in discontinuous memory access within the blocks. When the dimension K of the multi-channel signal matrix increases, the cache occupancy is very likely to exceed the limited cache capacity of the embedded platform, the cache hit rate drops significantly, and a large amount of data is frequently exchanged between the cache and main memory, causing serious memory access overhead. On the other hand, traditional block strategies do not take into account the multi-core resource characteristics of embedded platforms, resulting in low data reuse efficiency and an imbalance in memory access pressure between the dense and sparse sides, which further leads to low resource utilization and limited parallel efficiency when multiple cores are running in parallel.
[0004] The aforementioned problems directly result in high computational latency and significant memory power consumption when traditional SpMM implementations perform filtering-related SpMM operations on edge embedded platforms. This makes it difficult to meet the core requirements of low latency, low power consumption, and lightweight deployment for filtering tasks in edge embedded scenarios such as real-time attitude perception of UAVs, dynamic smoothing of intelligent robot motion trajectories, and real-time denoising of physiological signals of wearable devices. Summary of the Invention
[0005] Therefore, it is necessary to provide a memory-aware sparse matrix multiplication method suitable for edge embedded platforms that can reduce computational latency and memory power consumption, in order to address the above-mentioned technical problems.
[0006] A memory-aware sparse matrix multiplication method suitable for edge embedded platforms, the method being applied to multinomial filtering-type data processing tasks on edge embedded platforms, comprising: The sparse matrix is divided into row blocks and stored as continuous column segments in column-major order to form a sparse matrix with column segment layout. The column segment is the set of non-zero elements corresponding to each column in the row block. The dense matrix and the result matrix to be generated are divided into blocks along the preset column dimension. The block-decomposed dense matrix and the result matrix are laid out as continuous row segments to form a row segment layout dense matrix and a row segment layout result matrix. Based on the on-chip cache capacity of the edge embedded platform, the row block partitioning parameters of the sparse matrix and the column dimension partitioning parameters of the dense matrix and the result matrix are determined so that a single block of the result matrix can reside in the on-chip cache. Taking a dense matrix with row segment layout as input, each active column segment of a sparse matrix with column segment layout is traversed sequentially. The active column segment is a column segment containing non-zero elements. The row segment elements in the dense matrix that match the current active column segment are preloaded into the registers of the edge embedded platform to realize the reuse of row segment elements in the dense matrix. Perform multiplication and addition operations on the non-zero elements in the currently active column segment and the preloaded dense matrix row segment elements. Accumulate the intermediate results of the operation to the corresponding blocks of the result matrix of the row segment layout. After completing the multiplication and addition operations and result accumulation of all active column segments, the final result matrix of the row segment layout is obtained. The format of the final result matrix of the row segment layout is restored, and the final sparse matrix-dense matrix multiplication operation result is output.
[0007] The aforementioned memory-aware sparse matrix multiplication method suitable for edge embedded platforms achieves continuous sequential memory access within matrix blocks by employing a column-major order storage layout for the sparse matrix and a continuous row segment storage layout for the dense matrix and the result matrix. Combined with block parameter design adapted to the on-chip cache of edge embedded platforms, it significantly improves cache hit rate by preloading dense matrix row segment elements to achieve data reuse. At the same time, it reduces memory access overhead by preloading dense matrix row segment elements. Combined with multi-core parallelism and vectorized operation optimization, it effectively improves SpMM operation efficiency, reduces operation latency and memory power consumption, and can meet the low latency, low power consumption, and lightweight deployment requirements of polynomial filtering tasks on edge embedded platforms. Attached Figure Description
[0008] Figure 1This is a flowchart illustrating a memory-aware sparse matrix multiplication method suitable for edge embedded platforms in one embodiment. Figure 2 This is a flowchart illustrating the MaSpMM algorithm corresponding to the method of this application in one embodiment; Figure 3 This is a flowchart of the MaSpMM algorithm in one embodiment. Detailed Implementation
[0009] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0010] In one embodiment, such as Figure 1 As shown, a memory-aware sparse matrix multiplication method suitable for edge embedded platforms is provided, applied to multinomial filtering data processing tasks on edge embedded platforms, including the following steps: Step 102: Divide the sparse matrix into row blocks and store them as continuous column segments in column-major order to form a sparse matrix with column segment layout. Each column segment is the set of non-zero elements corresponding to each column in the row block.
[0011] Column-major order storage breaks away from the traditional row-major order storage convention for sparse matrices. It divides the sparse matrix A into several row blocks using preset row block parameters. Non-zero elements within each row block are sorted by column dimension and stored contiguously. The set of non-zero elements corresponding to each column within a row block constitutes a column segment. This storage method allows non-zero elements sharing the same column to reuse the same row in the dense matrix B through SIMD-supported loading. Simultaneously, it enables contiguous storage of the column-segmented sparse matrix in physical memory, achieving continuous memory access within blocks. This solves the problems of irregular memory access and low cache reuse rate in traditional CSR-format sparse matrices, and is particularly well-suited to the on-chip caching characteristics of edge embedded platforms, reducing the overhead of memory data movement.
[0012] Step 104: Divide the dense matrix and the result matrix to be generated into blocks along the preset column dimensions, and lay out the divided dense matrix and the result matrix into continuous row segments to form a row segment layout dense matrix and a row segment layout result matrix.
[0013] The preset column dimension is the column dimension K of the dense matrix B and the result matrix C. After dividing the dense matrix B and the result matrix C to be generated into blocks along this dimension, the elements in each block are stored continuously along the row dimension to form row segments, so that each × The blocks are distributed contiguously in physical memory. For column block reference parameters, For column dimensions K (Block size). This row segment layout solves the problem of discontinuous memory access within blocks caused by storing dense matrices in row-major order in traditional J-Stream and ASpT schemes, allowing block access to dense matrix B and result matrix C to be sequential, thus improving spatial locality; at the same time, the row segment layout design of result matrix C allows its blocks to remain cached during cumulative computation, greatly improving cache reuse rate and adapting to the hardware constraints of small on-chip cache capacity on edge embedded platforms.
[0014] Step 106: Based on the on-chip cache capacity of the edge embedded platform, determine the row block partitioning parameters of the sparse matrix and the column dimension partitioning parameters of the dense matrix and the result matrix, so that a single block of the result matrix can reside in the on-chip cache.
[0015] The row block partitioning parameters of a sparse matrix are (Row block size), the column dimension block parameters for dense matrices and result matrices are: (Column block size), both are core block parameters. The determination process must strictly match the on-chip cache capacity of the edge embedded platform to ensure that the single column block size of the resulting matrix C is consistent. The data blocks can reside entirely in the on-chip cache until all multiplication and addition operations in that block are completed, avoiding frequent data exchanges between the cache and main memory during computation. The determination of the block parameters also needs to take into account the sparsity characteristics of the sparse matrix A, and select the optimal value by modeling the total data movement cost to achieve a balance between cache utilization and data movement overhead, fully adapting to the cache hierarchy of edge embedded platforms.
[0016] Step 108: Using the dense matrix with row segment layout as input, traverse each active column segment of the sparse matrix with column segment layout in turn. The active column segment is the column segment containing non-zero elements. Preload the row segment elements in the dense matrix that match the current active column segment into the register of the edge embedded platform to realize the reuse of row segment elements in the dense matrix.
[0017] Active column segments are those containing non-zero elements in a sparse matrix with a column layout. Traversing only active column segments avoids invalid access to zero elements, reducing computational and memory access overhead. During traversal, based on the column index of the current active column segment, the corresponding row segment in the dense matrix B is matched, and the elements of that row segment are preloaded into the registers of the edge embedded platform. This allows multiple non-zero elements sharing the same column index to directly read row segment elements of B from the registers for computation, without needing to reload from memory. This achieves efficient reuse of row segment elements in the dense matrix B, significantly reducing memory access frequency and adapting to the hardware constraints of edge embedded platforms with limited memory bandwidth.
[0018] Step 110: Perform multiplication and addition operations on the non-zero elements in the currently active column segment and the preloaded dense matrix row segment elements. Accumulate the intermediate results of the operation to the corresponding blocks of the result matrix of the row segment layout. After completing the multiplication and addition operations and result accumulation of all active column segments, the final result matrix of the row segment layout is obtained. The format of the final result matrix of the row segment layout is restored, and the final sparse matrix-dense matrix multiplication operation result is output.
[0019] The multiply-accumulate operation is implemented using a six-layer block loop structure combined with vectorized operations. During the operation, the SIMD registers of the edge embedded platform are used to complete the vector fusion multiply-accumulate operation, improving the throughput. Intermediate results are directly accumulated to the corresponding block of the result matrix C in the on-chip cache, avoiding the memory write-back overhead of intermediate results. After completing the operation of all active column segments, the result matrix C of the row segment layout is a collection of multiple consecutive row segments. By restoring the format, it is concatenated into a standard dense matrix format according to the storage order, which can be directly used as the filtering result output of the multinomial filtering task of the edge embedded platform without additional format conversion processing, achieving lightweight deployment.
[0020] The above method achieves high cache hit rate and low memory access overhead for SpMM operations by using a dual-end continuous storage design with column and row segment layouts, combined with block parameters and data reuse strategies adapted to edge embedded platform caches. At the same time, it improves computational efficiency through multi-core parallelism and vectorized operation optimization, effectively solving the problems of high computational latency and large memory power consumption in traditional SpMM implementations on edge embedded platforms, and meeting the low latency, low power consumption, and lightweight deployment requirements of polynomial filtering tasks.
[0021] In one embodiment, row block partitioning of the sparse matrix includes: the partitioning parameters for row block partitioning of the sparse matrix A are as follows: , The row block size of sparse matrix A; the block size of dense matrix B and C along the preset column dimension is determined by the block size of dense matrix B and C along column dimension K. , Determine the column block size for dense matrices B and C. and At the same time, it meets the capacity constraints of the on-chip cache of the edge embedded platform:
[0022] in, Let A be the sparse density of the sparse matrix A. , Let A be the number of non-zero elements in the sparse matrix A. This refers to the on-chip fast cache capacity for edge embedded platforms.
[0023] Specifically, the block parameters determined by this capacity constraint can avoid cache overflow at the hardware resource level, solve the problems of low cache hit rate and frequent data exchange caused by the cache usage exceeding the cache capacity of the embedded platform in the traditional SpMM scheme, significantly reduce memory access overhead, and adapt to the inherent characteristics of small on-chip cache capacity of edge embedded platforms, laying the foundation for low-power computing.
[0024] In one embodiment, the block parameters are determined. and When the optimization objective is to minimize the total data movement cost, the formula for calculating the total data movement cost is:
[0025] in, For a sparse matrix A in the block parameter Number of active column segments below Let A be the number of rows in the sparse matrix A. K Let B and C be the column dimensions of the dense matrices.
[0026] Specifically, the formula for calculating the total data migration cost takes into account... K The analytical estimation formula for memory access volume after dimensional partitioning ignores the constant term. M and K The impact on the minimization objective is primarily reflected in the number of active column segments. and column block size The impact on memory traffic, among which The function effectively captures the impact of the two-dimensional sparse structure of the sparse matrix A on memory throughput. When determining the block parameters, while satisfying on-chip cache capacity constraints, the result of this calculation is evaluated on a set of candidate block parameters, and the parameter that minimizes the total data movement cost is selected. and As the optimal partitioning parameter, this optimization goal reduces the amount of data movement between memory and cache at the data transmission level, further reducing the memory bandwidth pressure and power consumption of the edge embedded platform. At the same time, it combines the two-dimensional sparsity characteristics of sparse matrices to achieve personalized design of partitioning parameters, adapting to the sparse matrix operation requirements of different sparsity modes.
[0027] In one embodiment, the multiplication-addition operation includes: The multiplication and addition operations are performed using a six-level block loop structure. The outer three loops are traversed in the following order: first, the dense matrices B and C are traversed along the column dimensions. K The sparse matrix A is divided into blocks, and then the row blocks of the sparse matrix A are traversed. Finally, a streaming loop is executed along the column dimension J, and the block size of the streaming loop is 1. The traversal order of the inner three loops is as follows: first, traverse each active column segment in the row block of the sparse matrix A; then, traverse the row segment elements in the column block of the dense matrix B; and finally, traverse each non-zero element in the currently active column segment.
[0028] Specifically, the six-level block loop structure is an execution order redesigned based on the loop framework of the TileSpMM algorithm, tailored to the hardware characteristics of edge embedded platforms and the segment storage layout of this application. Figure 2 The flowchart of the MaSpMM algorithm is shown below. The first loop is set as a K loop (line 1) to traverse the dense matrix blocks, and the second loop is set as an I loop (line 3) to traverse the sparse matrix A blocks. This is because accessing dense column blocks usually incurs more memory pressure than traversing sparse row blocks of A. Therefore, MaSpMM prioritizes reusing column blocks of dense matrices. The third loop in MaSpMM also performs a streaming loop in the J direction. The above is the traversal order of the outer three blocks of MaSpMM. The inner loop first traverses each active column segment of the current row block to obtain the column index (lines 4 and 5). To achieve dense row segment reuse of matrix B, the fifth loop traverses the row segment elements within the dense column blocks of B (line 6). The innermost loop traverses each element of the currently active column segment (line 8), multiplies it by the preloaded row segment elements of B (line 7), and writes the result to the corresponding position in block C (line 14). In lines 11-14 of Algorithm 4, MaSpMM performs vectorized computation within each block. Non-zero elements in A are first broadcast to the SIMD register (line 11), and then the corresponding row segment of C is loaded into the vector register (line 12). Vector fusion multiply-accumulate operations operate on the broadcast value and the preloaded row segment of B, accumulating into the corresponding row segment of the C block, and then performing vector storage (lines 13-14). Since the row segment of B is preloaded during the innermost iteration (line 7), multiple non-zero elements sharing the same column index can reuse that segment without reloading from memory. The C block remains cached until computation on the row blocks of A and the column blocks of B is complete. This design reduces redundant access to dense matrices and fully utilizes SIMD instructions to improve throughput. Furthermore, Figure 3 As shown, this application adopts a row segment format, which divides the dense matrix into column blocks. The row segments within each column block are stored continuously. Storing the sparse matrix as continuous "column segments" can improve the cache reuse of the output matrix C.
[0029] In one embodiment, a vectorized operation is used to perform multiplication and addition operations. Non-zero elements in the currently active column segment are broadcast to the SIMD register of the edge embedded platform. At the same time, the row segment elements of the corresponding block of the result matrix C are loaded into the vector register. The multiplication and addition operation between the non-zero elements and the row segment elements of the dense matrix B is completed through vector fusion multiplication and addition operations. The operation result is directly stored in the row segment elements of the result matrix C in the vector register.
[0030] Specifically, vectorized computation fully leverages the SIMD (Single Instruction Multiple Data) architecture of edge embedded platforms to achieve parallel computation of multiple data points with a single instruction, thereby improving computational throughput. During the computation process, the non-zero scalar elements in the active column segments are first broadcast to the SIMD register, matching the scalar value with the dimension of the row segment elements of the dense matrix B in the vector register. Then, the row segment elements of the corresponding block of the result matrix C are loaded into the vector register. These elements reside in the on-chip cache and do not need to be read from main memory. Subsequently, through a fused vector multiply-add (FMA) operation, the multiplication of the non-zero element with the row segment element of B and the addition with the row segment element of C are performed in one operation. The intermediate results are directly stored in the row segment element of C in the vector register without needing to be written back to memory. This vectorized computation method converts scalar operations into vector operations, significantly improving computational efficiency while avoiding the memory access overhead of intermediate results. It also adapts to the multi-core computing characteristics of edge embedded platforms, further reducing computational latency and power consumption.
[0031] In one embodiment, the sparse matrix with column segment layout, the dense matrix with row segment layout, and the result matrix are all stored contiguously in the physical memory of the edge embedded platform, making memory access within the block a continuous sequential access.
[0032] Specifically, this application employs a contiguous segment storage design for both sparse and dense matrices. The column segment layout of sparse matrix A ensures that the column segments within each row block are contiguous in physical memory. Similarly, the row segment layout of dense matrix B and the resulting matrix C ensures that the row segments within each column block are contiguous in physical memory. This achieves continuous sequential memory access within blocks from a storage perspective. This design solves the core problem of irregular memory access in traditional SpMM schemes, significantly improves the spatial locality of memory access, enables the CPU's cache prefetching mechanism to function effectively, and significantly improves cache hit rate. For edge embedded platforms, contiguous memory access reduces memory addressing overhead, lowers memory bandwidth requirements, and reduces pipeline stalls caused by cache misses, further improving computational efficiency and reducing power consumption. This is one of the key design features for achieving low-latency, low-power SpMM operations.
[0033] In one embodiment, the sparse matrix of the column segment layout ensures that the total data transfer volume of the sparse matrix-dense matrix multiplication operation satisfies: Total=
[0034] in, The memory access amount for the resulting matrix C. Let A be the memory access amount of the sparse matrix A. This represents the memory access volume of the dense matrix B.
[0035] Specifically, the total data transfer volume formula is a memory access volume quantification model for SpMM operations under column segment layout. Compared with the memory access volume formula of the traditional ASpT scheme, the design of this application significantly reduces the access volume of core data: when the sparse matrix is stored as column segments, the memory access volume of C is... M × K Because C can be reused in the cache without multiple reads and writes; the access volume of sparse matrix A is 2×NNZ, since each non-zero element contains both the value and the column index; the access volume of dense matrix B is only related to the number of active column segments, which is... This avoids invalid access to inactive column segments. Compared to traditional solutions, this total data transfer volume model prioritizes cache reuse of the result matrix C, and the benefits of caching matrix C in blocks are more significant than those of caching matrix B in blocks, because C requires both reading and writing. This design significantly reduces the total data transfer volume of SpMM operations, adapts to the hardware characteristics of edge embedded platforms with both limited memory bandwidth and capacity, effectively reduces the time and power consumption overhead of memory access, and ensures that memory access volume is only related to the number of active column segments in the sparse matrix. Further reduction in data transfer volume can be achieved by optimizing the block parameters.
[0036] In one embodiment, the edge embedded platform is a hardware platform based on x86 or ARM multi-core architecture, including embedded processing units for drones, intelligent robots, and wearable devices.
[0037] Specifically, the memory-aware sparse matrix multiplication method of this application is adapted for both x86 and ARM, the two mainstream multi-core architectures, without any architecture-specific dependencies, and can be directly deployed in edge embedded processing units based on these architectures. For typical application scenarios such as attitude perception signal filtering for UAVs, motion trajectory smoothing for intelligent robots, and physiological signal denoising for wearable devices, this method can fully utilize the multi-core computing resources of the device, achieving multi-core parallel execution through OpenMP. Simultaneously, it incorporates on-chip cache optimization, contiguous segment storage, and data reuse designs to adapt to the characteristics of small on-chip cache capacity and strict power budgets of devices. While achieving low latency in SpMM operations, it significantly reduces memory power consumption, meeting the lightweight and low-power deployment requirements of edge embedded devices. Furthermore, the preprocessing overhead of this method is extremely low, quantified as the equivalent number of parallel CSR-SpMM iterations, and will not occupy the valuable computing and memory resources of edge embedded devices.
[0038] In one embodiment, determining the row block parameters of the sparse matrix and the column dimension block parameters of the dense matrix and the resulting matrix includes: A matrix signature model is used to model the two-dimensional sparse structure of the sparse matrix A. The data transmission volume under different partitioning parameters is estimated by the matrix signature model, and the partitioning parameter with the smallest data transmission volume is selected as the optimal partitioning parameter.
[0039] Specifically, the matrix signature model is a parameter selection model borrowed from the J-Stream scheme and adapted for edge embedded platforms. This model abstracts the distribution of non-zero elements of a sparse matrix A into a one-dimensional function signature. This signature can represent the number of active rows or columns under different block sizes, thereby accurately estimating the parameters of different blocks. and To determine the optimal block size, the method first models the two-dimensional sparse structure of the sparse matrix using a matrix signature model, thus establishing the correspondence between block size parameters and data transfer volume. Then, from the set of candidate block size parameters that satisfy the on-chip cache capacity constraint, the parameter that minimizes data transfer volume is selected as the optimal block size parameter. This model-driven parameter selection method, compared to traditional empirical parameter selection methods, can more accurately match the sparsity characteristics of the sparse matrix, minimizing data transfer volume. It also adapts to sparse matrices with different sparsity patterns, improving the method's versatility. Furthermore, this model has low computational overhead, making it suitable for execution on edge embedded platforms with limited computing resources.
[0040] In one embodiment, the format of the final row and segment layout matrix is restored, including: The resulting matrix C of the row segment layout is concatenated according to the storage order of consecutive row segments to restore the original matrix. The standard dense matrix format is directly used as the output of the filtering results for the polynomial filtering task on the edge embedded platform.
[0041] Specifically, the resulting matrix C of the row and segment layout is composed of multiple columns. K The format is composed of consecutive row segments in blocks. Each row and column segment is stored consecutively in physical memory in the order of block traversal. The format restoration process does not require complex calculations; it only requires concatenating the consecutive row segments in the storage order to restore the format to meet the requirements of the polynomial filtering task. The standard dense matrix format has extremely low computational overhead during the reconstruction process, requiring minimal computational resources on the edge embedded platform. Furthermore, the reconstructed matrix can be directly output as the filtering result without additional matrix format conversion or data processing steps. This achieves seamless integration of SpMM operations and polynomial filtering tasks, meeting the lightweight deployment requirements of filtering tasks on edge embedded platforms and avoiding additional latency and power consumption caused by format conversion.
[0042] It should be understood that, although Figure 1 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
[0043] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0044] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these modifications and improvements all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A memory-aware sparse matrix multiplication method suitable for edge embedded platforms, characterized in that, The method is applied to polynomial filtering data processing tasks on edge embedded platforms, including: The sparse matrix is divided into row blocks and stored as continuous column segments in column-major order to form a sparse matrix with column segment layout. The column segment is the set of non-zero elements corresponding to each column in the row block. The dense matrix and the result matrix to be generated are divided into blocks along the preset column dimension. The block-decomposed dense matrix and the result matrix are laid out as continuous row segments to form a row segment layout dense matrix and a row segment layout result matrix. Based on the on-chip cache capacity of the edge embedded platform, the row block partitioning parameters of the sparse matrix and the column dimension partitioning parameters of the dense matrix and the result matrix are determined so that a single block of the result matrix can reside in the on-chip cache. Taking a dense matrix with row segment layout as input, each active column segment of a sparse matrix with column segment layout is traversed sequentially. The active column segment is a column segment containing non-zero elements. The row segment elements in the dense matrix that match the current active column segment are preloaded into the registers of the edge embedded platform to realize the reuse of row segment elements in the dense matrix. Perform multiplication and addition operations on the non-zero elements in the currently active column segment and the preloaded dense matrix row segment elements. Accumulate the intermediate results of the operation to the corresponding blocks of the result matrix of the row segment layout. After completing the multiplication and addition operations and result accumulation of all active column segments, the final result matrix of the row segment layout is obtained. The format of the final result matrix of the row segment layout is restored, and the final sparse matrix-dense matrix multiplication operation result is output.
2. The method according to claim 1, characterized in that, The row block partitioning of the sparse matrix includes: the partitioning parameters for row block partitioning of the sparse matrix A are as follows: The The row block size of sparse matrix A; the step of dividing the dense matrix and the resulting matrix to be generated into blocks along the preset column dimensions is to divide the dense matrices B and C along the column dimensions. K The block partitioning parameters are: The Determine the column block size of dense matrices B and C. and At the same time, it meets the capacity constraints of the on-chip cache of the edge embedded platform: in, Let A be the sparse density of the sparse matrix A. , Let A be the number of non-zero elements in the sparse matrix A. This refers to the on-chip fast cache capacity for edge embedded platforms.
3. The method according to claim 2, characterized in that, The method further includes: Determine block parameters and At that time, the optimization objective is to minimize the total data movement cost, which is: in, For a sparse matrix A in the block parameter Number of active column segments below Let A be the number of rows in the sparse matrix A. K Let B and C be the column dimensions of the dense matrices.
4. The method according to claim 1, characterized in that, The multiplication-addition operation process includes: The multiplication and addition operations are performed using a six-level block loop structure. The outer three loops are traversed in the following order: first, the dense matrices B and C are traversed along the column dimensions. K The sparse matrix A is divided into blocks, and then the row blocks of the sparse matrix A are traversed. Finally, a streaming loop is executed along the column dimension J, and the block size of the streaming loop is 1. The traversal order of the inner three loops is as follows: first, traverse each active column segment in the row block of the sparse matrix A; then, traverse the row segment elements in the column block of the dense matrix B; and finally, traverse each non-zero element in the currently active column segment.
5. The method according to claim 4, characterized in that, The method further includes: The multiplication and addition operation is performed using a vectorized operation method. The non-zero elements in the currently active column segment are broadcast to the SIMD register of the edge embedded platform. At the same time, the row segment elements of the corresponding block of the result matrix C are loaded into the vector register. The multiplication and addition operation of the non-zero elements and the row segment elements of the dense matrix B is completed through vector fusion multiplication and addition operation, and the operation result is directly stored in the row segment elements of the result matrix C in the vector register.
6. The method according to claim 1, characterized in that, The sparse matrix of the column segment layout, the dense matrix of the row segment layout, and the result matrix are all stored contiguously in the physical memory of the edge embedded platform, making memory access within the block a continuous sequential access.
7. The method according to claim 1, characterized in that, The sparse matrix of the column segment layout ensures that the total data transmission volume of the sparse matrix-dense matrix multiplication operation satisfies the following: Total= in, The memory access amount for the resulting matrix C. Let A be the memory access amount of the sparse matrix A. This represents the memory access volume of the dense matrix B.
8. The method according to claim 1, characterized in that, The edge embedded platform is a hardware platform based on x86 or ARM multi-core architecture, including embedded processing units for drones, intelligent robots, and wearable devices.
9. The method according to claim 1, characterized in that, Determine the row block parameters for the sparse matrix and the column dimension block parameters for the dense matrix and the resulting matrix, including: A matrix signature model is used to model the two-dimensional sparse structure of the sparse matrix A. The data transmission volume under different block parameters is estimated by the matrix signature model, and the block parameter with the smallest data transmission volume is selected as the optimal block parameter.
10. The method according to claim 1, characterized in that, The final row and segment layout matrix is formatted and restored, including: The resulting matrix C of the row segment layout is concatenated according to the storage order of consecutive row segments to restore the original matrix. The standard dense matrix format is directly used as the output of the filtering results for the polynomial filtering task on the edge embedded platform.