A sparse tensor processing method and device, electronic equipment and storage medium

By extracting dense data blocks from sparse tensors and performing zero-element matching and task allocation, the problem of low storage and computation efficiency in 3D sparse tensor processing is solved, achieving efficient utilization of hardware resources and data processing.

CN122242571APending Publication Date: 2026-06-19INSPUR SUZHOU INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INSPUR SUZHOU INTELLIGENT TECH CO LTD
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing sparse tensor processing methods struggle to effectively capture and utilize the locality of 3D space when processing sparse tensors using traditional flattening compression formats. This results in wasted storage bandwidth and cache space, low hardware utilization, and underutilization of computational resources.

Method used

By extracting dense data blocks from sparse tensors, changing the organization and storage of data, performing zero-element matching and dividing them into target task groups, and distributing them to the processing unit array for computation, load balancing and efficient computing are achieved.

Benefits of technology

It significantly reduces the storage and computational overhead of sparse tensors, improves hardware utilization and computing resource efficiency, and enhances the processing efficiency of multimodal input data.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242571A_ABST
    Figure CN122242571A_ABST
Patent Text Reader

Abstract

This application discloses a sparse tensor processing method, apparatus, electronic device, and storage medium, relating to the field of data processing technology. The method includes: acquiring a first sparse tensor and a second sparse tensor to be processed; extracting multiple dense data blocks from the first and second sparse tensors respectively; performing zero-element matching on the first and second sparse tensors based on the multiple dense data blocks to determine a non-zero data set to be processed; dividing the non-zero data set into multiple target task groups to be processed according to preset constraints, and distributing the multiple target task groups to a preset processing unit array for processing to obtain the processing results, so as to determine the processing result for multimodal input data based on the processing results. Utilizing dense data blocks and zero-element matching can reduce the storage overhead and computational overhead of sparse tensors.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing technology, and in particular to a sparse tensor processing method, apparatus, electronic device, and storage medium. Background Technology

[0002] With the development of artificial intelligence and big data analytics, models such as graph neural networks, capable of handling complex relational data, have been widely applied. The core computations of these models often involve operations on the features and weights of large-scale, irregular graph structures. Their data naturally exhibits high sparsity, meaning that the proportion of non-zero elements in the data tensor is extremely low. To efficiently process such sparse data, related technologies typically employ dedicated hardware accelerators, such as arrays composed of a large number of parallel processing units, supplemented by data compression formats such as sparse row compression to reduce storage and transmission overhead.

[0003] However, when processing three-dimensional sparse tensors such as 3D feature maps and high-order adjacency matrices, the distribution of non-zero elements in these tensors is not only sparse but also exhibits irregular and non-uniform spatial clustering characteristics. Traditional flattening compression formats, such as the CSR (Compressed Sparse Row) format designed for two-dimensional matrices, struggle to effectively capture and utilize this three-dimensional spatial locality. This results in a significant amount of storage bandwidth and cache space being used to transmit and store zero values ​​or redundant index information that have no practical computational value. Furthermore, the irregular data distribution leads to severe load imbalances in processing unit arrays, with many processing units idle while waiting for valid data, resulting in low overall hardware utilization. Summary of the Invention

[0004] Therefore, it is necessary to provide a sparse tensor processing method, apparatus, electronic device, and storage medium to address at least one of the aforementioned technical problems.

[0005] In a first aspect, embodiments of this application provide a sparse tensor processing method, comprising: Obtain the first and second sparse tensors to be processed.

[0006] Multiple dense data blocks are extracted from the first sparse tensor and the second sparse tensor respectively; each dense data block is a set of elements whose number of non-zero elements exceeds a preset density threshold. Based on multiple dense data blocks, zero element matching is performed on the first sparse tensor and the second sparse tensor to determine the set of non-zero data to be processed; According to the preset constraints, the non-zero data set is divided into multiple target task groups to be processed, and the multiple target task groups are distributed to the preset processing unit array for processing to obtain the processing results, so as to determine the processing result of the multimodal input data based on the processing results.

[0007] The first sparse tensor and the second sparse tensor are intermediate data in the form of three-dimensional matrices obtained by preprocessing the multimodal input data. The multimodal input data includes at least images, videos, text, speech, or point clouds.

[0008] In a second aspect, embodiments of this application provide a sparse tensor processing apparatus, comprising: The acquisition module is used to acquire the first sparse tensor and the second sparse tensor to be computed; the first sparse tensor and the second sparse tensor are intermediate data obtained by preprocessing the multimodal input data, which includes at least images, videos, text, speech or point clouds. The extraction module is used to extract multiple dense data blocks from the first sparse tensor and the second sparse tensor respectively; each dense data block is a set of elements in which the number of non-zero elements exceeds a preset density threshold. The matching module is used to perform zero element matching on the first sparse tensor and the second sparse tensor based on multiple dense data blocks to determine the set of non-zero data to be operated on. The grouping module is used to divide the non-zero data set into multiple target task groups to be processed according to preset constraints, and distribute the multiple target task groups to a preset processing unit array for processing to obtain the processing results, so as to determine the processing result of the multimodal input data based on the processing results.

[0009] In a third aspect, embodiments of this application provide a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the sparse tensor processing method provided in any embodiment of the first aspect of this application.

[0010] In a fourth aspect, embodiments of this application provide a computer-readable storage medium. It stores a computer program, which, when executed by a processor, implements the steps of the sparse tensor processing method provided in any embodiment of the first aspect of this application.

[0011] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the sparse tensor processing method provided in any embodiment of the first aspect of this application.

[0012] This application transforms the organization and storage of data by extracting dense data blocks from sparse tensors. For regions where non-zero elements cluster, it is no longer necessary to store the complete coordinates of each non-zero element; instead, only the starting position and range of the data block need to be stored. This significantly reduces the amount of metadata required to describe the spatial location of the data, thereby effectively reducing the storage overhead of sparse tensors. By performing zero-element matching and determining the set of non-zero data to be computed before computation, invalid operations are pre-eliminated before computation, ensuring that subsequent hardware computing resources are only applied to data points that contribute to the final result. This completely avoids invalid computation loops involving multiplying a large number of operands with zero values, directly and significantly reducing overall computational overhead and power consumption. By dividing the filtered effective data set into target task groups that match the hardware scale and distributing them to the processing unit array for execution, load balancing of computational tasks on parallel hardware is achieved. This allows the processing unit array to maintain high utilization and reduce idle waiting, thereby improving the utilization efficiency of computing resources and the overall system throughput, further optimizing computational efficiency from the resource scheduling perspective. In scenarios involving artificial intelligence model computation tasks, this can improve the processing efficiency of multimodal input data such as images, videos, text, speech, or point clouds. Attached Figure Description

[0013] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0014] Figure 1 A schematic diagram of a graph structure is provided for some embodiments; Figure 2 An application environment diagram of a sparse tensor processing method provided in the embodiments of this application; Figure 3 A flowchart illustrating a sparse tensor processing method provided in an embodiment of this application; Figure 4 This is a flowchart illustrating the block template in some embodiments; Figure 5 This is a schematic diagram of a block template in a three-dimensional matrix in some embodiments; Figure 6 This is a flowchart illustrating the block indexing process in some embodiments; Figure 7 This is a flowchart illustrating the sub-block indexing process in some embodiments; Figure 8 This is a flowchart illustrating non-zero data blocks in some embodiments; Figure 9This is a flowchart illustrating discrete non-zero elements in some embodiments; Figure 10 This is a flowchart illustrating multiple division steps in some embodiments; Figure 11 This is a flowchart illustrating the steps involved in updating the partitioning strategy in some embodiments; Figure 12 This is a schematic diagram of reinforcement learning logic in some embodiments; Figure 13 Here are some block diagrams of reinforcement learning systems in some embodiments; Figure 14 This is a flowchart illustrating the number of non-zero elements in some embodiments; Figure 15 This is a flowchart illustrating the calculation of data sequences in some embodiments; Figure 16 This is a flowchart illustrating the output tensor in some embodiments; Figure 17 This is a structural block diagram of the sparse tensor processing device in some embodiments; Figure 18 This is a diagram showing the internal structure of a computer device in some embodiments. Detailed Implementation

[0015] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.

[0016] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.

[0017] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0018] With the development of artificial intelligence, research on non-Euclidean topological data, i.e., graphs, which are widely present in nature, has received increasing attention. In order to effectively extract feature representations from graphs, a class of excellent models such as GNN (Graph Neural Network) have been proposed and are applicable to a wide range of application scenarios, such as social networks, recommendation systems, smart healthcare, and urban planning.

[0019] In social networks, vertices in a graph structure can represent people, and edges can represent social relationships between people; in recommendation systems, vertices can represent items, and edges can represent relationships between items; in graph neural networks, vertices are feature data, and edges are weight data. These application scenarios are inherently sparsity-based, therefore the graph structure data representing these scenarios are mostly sparse matrices. When the matrix size is large, this can lead to huge storage and computational overhead.

[0020] In view of this, embodiments of this application provide a sparse tensor processing method, which is combined with the application environment architecture or hardware architecture on which the execution of the sparse tensor processing method depends. The application environment architecture or hardware architecture is described herein.

[0021] Sparse tensor processing methods can be applied to, for example Figure 2 In the application environment shown, server 210 can communicate with terminal 220 via a network to process sparse tensors uploaded by terminal 220. Server 210 can be implemented as a standalone server or a server cluster consisting of multiple servers, and terminal 220 can be, but is not limited to, personal computers, laptops, smartphones, tablets, and portable wearable devices.

[0022] Server 210 may be implemented using at least one of the following hardware forms: programmable logic array (PLA), field-programmable gate array (FPGA), digital signal processor (DSP), application-specific integrated circuit (ASIC), general-purpose processor, or other programmable logic device.

[0023] Of course, the sparse tensor processing method provided in this disclosure can also be applied to more scenarios not shown.

[0024] Applying sparse tensor processing methods to Figure 2 Taking server 210 as an example, in some embodiments, such as Figure 3 As shown, the sparse tensor processing method includes steps S310, S320, S330, and S340 that can be executed by the server 210. Each step is described in detail below.

[0025] Step S310: Obtain the first sparse tensor and the second sparse tensor to be operated on.

[0026] The first sparse tensor and the second sparse tensor are intermediate data in the form of three-dimensional matrices obtained by preprocessing the multimodal input data. The multimodal input data includes at least images, videos, text, speech, or point clouds.

[0027] Tensors are a higher-dimensional generalization of the concepts of scalars, vectors, and matrices; they are mathematical abstractions of multidimensional arrays. First and second sparse tensors can be one-dimensional arrays, two-dimensional matrices, or three-dimensional matrices, etc. In the field of artificial intelligence model computation, sparse tensors are often represented as two-dimensional or three-dimensional matrices.

[0028] In general artificial intelligence computing, raw multimodal input data, such as image data, text word sequences, and speech waveforms, undergo a series of standardized, domain-specific preprocessing steps to transform them into standardized tensor formats that can be directly processed by neural networks or other computational models. These preprocessing steps may include, but are not limited to: normalization, cropping, and scaling for images / videos; word segmentation and word embedding for text; Fourier transform and Mel-spectrum extraction for speech; and voxelization and feature extraction for point clouds. The tensors obtained after these preprocessing steps are typically sparse tensors, which can be one-dimensional vectors, two-dimensional matrices, or three-dimensional matrices.

[0029] For example, in an end-to-end autonomous driving environmental perception system, the system simultaneously receives camera images (image modality) and LiDAR point clouds (point cloud modality). After shallow convolution of the backbone network, the image outputs a sparse feature map, which can serve as the first sparse tensor. Simultaneously, the point cloud data undergoes preprocessing by a point cloud feature extraction network (such as PointPillar), transforming it into a sparse bird's-eye view feature grid, which can serve as the second sparse tensor. Furthermore, the first or second sparse tensor can be a feature matrix obtained after preprocessing the multimodal input data, or it can be a pre-defined weight matrix or adjacency matrix in an artificial intelligence model; this application does not impose any limitations on this.

[0030] Similar methods can be used to process tensors of different dimensions. This application mainly uses sparse tensors as three-dimensional matrices for illustration.

[0031] Step S320: Extract multiple dense data blocks from the first sparse tensor and the second sparse tensor respectively.

[0032] Each dense data block is a set of elements in which the number of non-zero elements exceeds a preset density threshold, and each dense data block is in the form of a three-dimensional matrix.

[0033] The preset density threshold is a configurable parameter used to dynamically define what constitutes density. Based on the statistical properties of the data itself, the dense data blocks extracted from the first and second sparse tensors are not fixed in geometric size and shape, thus more flexibly and closely matching the true distribution pattern of non-zero elements in the sparse tensor.

[0034] Step S330: Based on multiple dense data blocks, perform zero element matching for the first sparse tensor and the second sparse tensor to determine the set of non-zero data to be processed.

[0035] Zero-element matching refers to the systematic comparative analysis of the distribution of zero elements in a first sparse tensor and a second sparse tensor based on multiple dense data blocks. Its purpose is to identify, by matching the positions of zero elements, the data points whose corresponding positions in both tensors are non-zero. This allows the set of these data points to be used as the actual objects to be processed by the processing unit array, avoiding invalid operations on zero values ​​or unilaterally non-zero values.

[0036] Step S340: According to the preset constraints, the non-zero data set is divided into multiple target task groups to be processed, and the multiple target task groups are sent to the preset processing unit array for processing to obtain the processing results, so as to determine the processing result of the multimodal input data based on the processing results.

[0037] Modern high-performance computing hardware typically contains a large number of parallel processing units, which are organized in an array called a processing unit array. To fully utilize its parallel capabilities, the total computing task usually needs to be decomposed into multiple subtasks that are suitable for the array size and can balance the load.

[0038] Preset constraints can refer to the fact that the computational load of each target task group should be close to and not exceed the maximum capacity that a single processing unit array can process in parallel at one time. Distributing the divided target task groups to the processing unit array aims to achieve a balanced distribution of computational load among the processing units, thereby maximizing hardware utilization and overall computational throughput.

[0039] In the fields of artificial intelligence and data processing, the process from raw input to final output is typically a multi-stage, hierarchical computational process. Sparse tensor operations are often an intermediate link in this long chain. This link itself does not directly produce processing results that are perceptible to the end user, but its output is a necessary and critical pre-computation for generating the final processing result.

[0040] The result obtained after the processing unit operation can be an intermediate result in a multi-stage calculation process or the final result of the last stage. Its ultimate purpose is to serve the high-level processing task of multimodal input data.

[0041] The processing result of multimodal input data can be a classification label, a detection box, a summary text, or a recommendation score.

[0042] For example, when sparse tensor processing methods are applied to scenarios such as smart cities and autonomous driving, the processing results of multimodal input data can be traffic flow prediction results, decision quality of autonomous vehicles, or state diagnosis results of industrial equipment.

[0043] When sparse tensor processing methods are applied to recommendation systems, risk control, and other applications, the processing results of multimodal input data can be user-item matching scores or financial transaction risk scores, etc.

[0044] By applying the sparse tensor processing method to the above scenarios, while obtaining the processing results and completing the artificial intelligence model calculation task, it can also reduce the invalid computing overhead of the processing unit array, reduce the amount of data transfer between the processing unit array and the memory, improve the energy efficiency of the entire computing system, and further improve the output efficiency of the processing results of multimodal input data.

[0045] By extracting dense data blocks from sparse tensors, the organization and storage of data are transformed. For regions where non-zero elements cluster, it is no longer necessary to store the complete coordinates of each non-zero element; instead, only the starting position and range of the data block need to be stored. This significantly reduces the amount of metadata required to describe the spatial location of the data, thereby effectively reducing the storage overhead of sparse tensors. By performing zero-element matching and determining the set of non-zero data to be computed before computation, invalid operations are pre-eliminated before computation, ensuring that subsequent hardware computing resources are only applied to data points that contribute to the final result. This completely avoids invalid computation loops involving multiplying a large number of operands with zero values, directly and significantly reducing overall computational overhead and power consumption. By dividing the filtered effective data set into target task groups that match the hardware scale and distributing them to the processing unit array for execution, load balancing of computational tasks on parallel hardware is achieved. This allows the processing unit array to maintain high utilization and reduce idle waiting, thereby improving the utilization efficiency of computing resources and the overall system throughput, further optimizing computational efficiency from the resource scheduling perspective. In scenarios involving artificial intelligence model computation tasks, this can improve the processing efficiency of multimodal input data such as images, videos, text, speech, or point clouds.

[0046] In some embodiments, such as Figure 4 As shown, step S320 may also include steps S321 and S322.

[0047] Step S321: According to multiple preset block templates, perform spatial scanning on the first sparse tensor and the second sparse tensor respectively to identify the spatial clustering region of non-zero elements.

[0048] A block template is a scanning window structure defined in three-dimensional space with a specific geometric shape and size range. It is used to slide and traverse the coordinate space of a sparse tensor to detect the spatial continuity of non-zero elements. A block template can be a template that is pre-determined and stored after dividing multiple three-dimensional matrices into data blocks, and can be called by template encoding or code when needed.

[0049] The block template is not limited to cubes or cuboids; it can also be a spherical neighborhood, an ellipsoidal neighborhood, an anisotropic columnar neighborhood, or a piecewise linearly connected region template. Its shape and size can be preset according to the spatial statistical characteristics of typical non-zero clusters in the multimodal input data.

[0050] During spatial scanning, each block template can cover all non-zero coordinates of the sparse tensor in a stepwise manner and record the set of non-zero elements falling within the coverage area of ​​the template. A spatial cluster region refers to a connected subset of coordinates that is completely covered by at least one block template and contains two or more non-zero elements, and its boundary is determined by the spatial extent defined by the block template in the current pose.

[0051] Step S322: Calculate the density of non-zero elements within each spatial cluster region, and determine the spatial cluster region with a non-zero element density greater than a preset density threshold as a dense data block.

[0052] The geometric dimensions of each dense data block are not fixed.

[0053] Non-zero element density refers to the ratio of the number of non-zero elements in a spatially clustered region to the total number of elements it contains. Its calculation does not depend on the overall dimension of the sparse tensor, but only reflects the compactness of the local non-zero distribution. The value range of non-zero element density is (0,1], and typical values ​​can be 0.7, 0.8, or 0.9. It can be calibrated offline based on the spatial clustering intensity of non-zero elements in different modal data such as images, videos, text, speech, or point clouds.

[0054] The non-fixed geometric size means that the extent of dense data blocks in different dimensional spaces (such as the span along the x, y, and z axes in three-dimensional space) is not set uniformly in advance, but is dynamically determined by the actual coordinate distribution of the corresponding spatial clusters. This allows dense data blocks to adaptively fit the real spatial shape of non-zero elements, covering both slender point cloud edge structures and spherical image feature response clusters.

[0055] For example, taking a three-dimensional sparse tensor as an example, the non-fixed geometric dimensions are reflected in the fact that different dense data blocks in the same sparse tensor are different in length, width and height in three-dimensional scale; the size distribution pattern of dense data blocks in different sparse tensors can also be different.

[0056] like Figure 5 As shown in the figure, ○ represents the coordinate position of each element in the 3D sparse matrix, and its value may be zero or non-zero; the dashed box represents the block template, where × represents non-zero data in the actual matrix, and △ represents individual zero data in the actual matrix. The shapes and dimensions (length, width, height) of each block template are different.

[0057] By spatially scanning the first and second sparse tensors using multiple preset block templates, spatial clustering regions of non-zero elements can be identified. The diversity of template morphology can enhance the perception of heterogeneous non-zero distribution patterns in multimodal data. Furthermore, the geometric dimensions of each dense data block are not fixed, allowing it to closely follow the actual non-zero clustering pattern. In addition, by pre-storing the block templates and calling them directly during the usage phase, the workload of repeatedly defining templates can be reduced.

[0058] In some embodiments, such as Figure 6 As shown, the sparse tensor processing method may also include steps S610, S620 and S630.

[0059] Step S610: Generate a block index based on the spatial location and geometric dimensions corresponding to each dense data block.

[0060] Taking a three-dimensional sparse matrix as an example, the spatial position refers to the starting position of a dense data block in the three-dimensional coordinate system within the corresponding sparse tensor, usually the coordinates of the first non-zero element in the dense data block; the geometric size refers to the set of lengths covered by the dense data block along each tensor dimension, representing the spatial extension range of the dense data block; the block index is the encoded information used to identify a dense data block. It does not directly record all coordinates and size values, but rather encodes the spatial position and geometric size together into a compact integer code through a mapping relationship.

[0061] For example, as shown in Table 1, Table 1 records a block index encoding method for dense data blocks: Table 1

[0062] Table 1 illustrates an example of an encoding method that maps certain block size types to specific numeric codes. Block templates of different block size types can be saved to a reusable template library and quickly retrieved when needed using the corresponding numeric codes. When there are multiple dense data blocks of the same size in a sparse matrix, the amount of stored metadata can also be reduced by simply annotating the numeric codes of the template types being retrieved.

[0063] Step S620: For multiple dense data blocks, extract multiple anomalous sub-blocks with clustered zero elements, and generate a sub-block index based on the spatial position and geometric size of each anomalous sub-block relative to its respective dense data block.

[0064] Each abnormal sub-block is in the form of a three-dimensional matrix.

[0065] An anomalous sub-block refers to a region of zero-element clusters that appears within a dense data block; its spatial position relative to its parent dense data block refers to the relative coordinate offset of the anomalous sub-block within its parent dense data block, reflecting the local orientation of the anomalous sub-block within the block; its geometric dimensions also refer to the set of lengths covered by the anomalous sub-block in each dimension, characterizing the spatial scale it occupies within the block.

[0066] The sub-block index is the encoded information used to identify an abnormal sub-block. It takes relative position and geometric dimensions as input and generates an independent encoding system that is different from the block index.

[0067] For example, as shown in Table 2, Table 2 records a sub-block index encoding method for an abnormal sub-block.

[0068] Table 2

[0069] Table 2 illustrates an example of an encoding method that maps some sub-block size types to specific codes. Block templates of different sub-block size types can also be saved to a reusable template library and quickly called up when needed using the corresponding codes.

[0070] Step S630: Compress and store the first sparse tensor and the second sparse tensor according to the block index of multiple dense data blocks, the sub-block index of multiple abnormal sub-blocks, and the discrete index corresponding to the discrete elements in the first sparse tensor and the second sparse tensor.

[0071] Discrete indexes refer to the coordinate information of isolated non-zero elements and zero elements in the original sparse tensor that are not included in any dense data block or anomalous sub-block. Their distribution is characterized by randomness and low density.

[0072] Compressed storage refers to organizing and serializing the above three types of index information in a structured format and writing it into the storage medium.

[0073] In some optional embodiments, the compressed storage method may be to arrange the block index sequence, sub-block index sequence and discrete index sequence consecutively, supplemented by a length field and a type identifier header, to form a three-segment encoding structure.

[0074] For example, as shown in Table 3, Table 3 records the information format of a matrix compressed storage.

[0075] Table 3

[0076] In this structure, dense data block 1-offset records the spatial position of the first dense data block, dense data block 1-type records the specific numerical code of the block template corresponding to the first dense data block, and abnormal sub-block 1-offset records the spatial position of the first abnormal sub-block (which can simultaneously record its spatial position relative to the matrix and its relative position relative to its own dense data block), and abnormal sub-block 1-type records the specific code of the template corresponding to the first abnormal sub-block. Dense data block 2 and abnormal sub-block 2 follow the same logic.

[0077] In addition, the discrete element index of dense data block 1 is used to record the discrete index of discrete elements in the first dense data block that are not included in the abnormal sub-block; the discrete element index of the matrix is ​​used to record the discrete index of discrete elements that are not included in any dense data block or abnormal sub-block.

[0078] In some optional embodiments, the discrete element index may only record the indexes of discrete non-zero elements, while the discrete indices that are not recorded in the matrix are all defaulted to zero elements, thereby further reducing storage overhead.

[0079] The block index efficiently abstracts dense data blocks, replacing element-by-element coordinate recording and reducing metadata redundancy in the mainstream non-zero areas. The sub-block index is used to encode the zero-value clusters within the block, avoiding the need to allocate an index for each zero value. Combined with discrete indexes to cover residual isolated non-zero points, the integrity of tensor reconstruction is ensured. The three types of indexes work together to form a hierarchical compression framework, making the overall storage structure adaptable to the spatial clustering characteristics of sparse tensors.

[0080] In some embodiments, the sparse tensor processing method may further include the following steps: compressing and storing the discrete indices corresponding to the discrete elements in the first sparse tensor and the second sparse tensor using a coordinate list-based sparse tensor compression format.

[0081] The sparse tensor compression format based on coordinate lists can be the COO format (Coordinate list), which is a standard compression encoding format used in the field to represent sparse tensors. This format inherently has the ability to record the multidimensional spatial coordinates and corresponding values ​​of any non-zero elements, without relying on the spatial continuity or clustering of the elements.

[0082] Using the COO format to compress and store discrete elements can further reduce the storage overhead caused by recording the coordinates of multiple discrete elements one by one.

[0083] In some alternative embodiments, discrete elements can also be compressed and stored using formats such as CSR (Compressed Sparse Row) or CSC (Compressed Sparse Column).

[0084] By employing a coordinate list-based sparse tensor compression format specifically for handling discrete elements, outliers that cannot be covered by the compression of dense data blocks and anomalous sub-blocks are handled using the most direct and faithful coordinate recording method, ensuring the complete preservation of all valid information in the original sparse tensor. Simultaneously, a hybrid hierarchical compression strategy is provided, ensuring a high overall compression ratio while also considering the feasibility of handling extreme sparsity cases.

[0085] In some embodiments, such as Figure 7 As shown, the sparse tensor processing method may also include steps S710, S720 and S730.

[0086] Step S710: Determine the spatial location of each dense data block based on the coordinates of the first element in each dense data block.

[0087] The first element can be the non-zero element with the smallest index number within the dense data block under a preset traversal order (e.g., a linearized index order based on row priority, column priority, or depth priority). In a three-dimensional sparse matrix, this coordinate is a three-dimensional spatial coordinate, representing the absolute spatial anchor point of the dense data block in the original sparse tensor.

[0088] Step S720: Based on the mapping relationship between the spatial location and geometric size of dense data blocks and the preset block code, determine the block index corresponding to each dense data block.

[0089] The mapping relationship for preset block encoding can be a pre-constructed lookup table, whose input is a combination of normalized or quantized spatial location coordinates and geometric dimensions, and whose output is a fixed-width integer encoded value. Block indexes can be used to identify the global location and shape characteristics of a dense data block.

[0090] In some alternative embodiments, the method for determining the block index based on the mapping relationship between spatial location and geometric dimensions and the preset block code can be: quantizing the spatial location coordinates and geometric dimensions respectively to obtain finite-precision integer representations, concatenating them into a joint key value, and then mapping it to the block code space through a hash function.

[0091] Step S730: Determine the sub-block index of each abnormal sub-block based on the mapping relationship between the spatial position and geometric size of the abnormal sub-block relative to its dense data block and the preset sub-block code.

[0092] The spatial position of an abnormal sub-block relative to its dense data block refers to the offset coordinate of the upper left front corner (or other agreed reference angle) of the abnormal sub-block in the local coordinate system of its dense data block. Its value range is constrained within the geometric dimensions of the corresponding dense data block.

[0093] The mapping relationship of the preset sub-block encoding is another independently constructed lookup table. Its input is the combination of the above relative position and geometric dimensions, and the output is a short-width integer encoding. This sub-block index does not contain global coordinate information, but only expresses local structural relationships, which significantly compresses the index expression length of abnormal sub-blocks.

[0094] In some optional embodiments, the method for determining the spatial location based on the coordinates of the first element in each dense data block may be: traversing all non-zero element indices in the dense data block according to a preset tensor linearization order, and selecting the three-dimensional coordinates corresponding to the first accessed index as the spatial location of the abnormal sub-block.

[0095] The method for determining the sub-block index based on the mapping relationship between the spatial position and geometric size of the abnormal sub-block relative to its dense data block and the preset sub-block code can also be: quantize the spatial position coordinates and geometric size respectively to obtain integer identifiers with finite precision, concatenate them into a joint key value, and then map them to the sub-block code space through another hash function.

[0096] By jointly mapping spatial location and geometric dimensions to preset block codes or sub-block codes, the information density and hardware friendliness of sparse tensor compression coding can be significantly improved while ensuring index uniqueness and decoding accuracy. This provides a high-precision, low-overhead structured index foundation for subsequent zero-element matching and task distribution.

[0097] In some embodiments, such as Figure 8 As shown, based on multiple dense data blocks, zero element matching is performed on the first sparse tensor and the second sparse tensor to determine the non-zero data set to be operated on, and steps S331 and S332 are also included.

[0098] Step S331: Based on the block index of each dense data block, determine multiple candidate block pairs with overlapping positions.

[0099] Specifically, for two dense data blocks from the first sparse tensor and the second sparse tensor respectively, the corresponding spatial coordinates and geometric dimensions in the block index can be extracted respectively. The intersection of the coordinate intervals of the two can be calculated one by one according to the dimensions (e.g., length, width, height). If the intersection of all dimensions is not empty, it is determined that there is an overlapping position, and the overlapping area is taken as a candidate block pair.

[0100] Step S332: For each candidate block pair, if any candidate block contains a target abnormal sub-block and / or a target zero element, then delete the index of the corresponding position of the target abnormal sub-block and / or the target zero element in the candidate block pair to obtain multiple non-zero data blocks in the non-zero data set.

[0101] The target zero element refers to the discrete zero value in a dense data block that is not included in any abnormal sub-block, and its index can be recorded independently in the form of a coordinate list.

[0102] If any candidate block contains an abnormal sub-block and a zero element, indicating that the computation at the corresponding position is invalid, then the element index at the corresponding position in the candidate block pair is deleted.

[0103] The index deletion method can be as follows: for each pair of candidate blocks, first retrieve their respective associated element index sets, and then find the corresponding index based on the spatial coordinates of the target abnormal sub-block and / or the target zero element, and mark it as logically invalid.

[0104] By coordinating the participation of three types of metadata—block index, abnormal sub-block index, and discrete zero element index—in the matching process, hierarchical, domain-specific, and structured removal of zero elements is achieved: the block index supports coarse-grained alignment, the abnormal sub-block index supports medium-grained local cleanup, and the discrete zero element index supports fine-grained detail cleaning. The three work together to accurately shrink the range of data to be operated on without expanding all non-zero elements, significantly reducing the number of invalid multiplication and addition operations and memory access bandwidth consumption.

[0105] In some embodiments, such as Figure 9 As shown, the sparse tensor processing method may also include steps S333 and S334.

[0106] Step S333: Based on the discrete index of each discrete element, determine multiple pairs of discrete elements.

[0107] Discrete elements refer to isolated non-zero or zero elements that are not included in any dense data block and can be independently recorded in coordinate form in both the first and second sparse tensors.

[0108] A discrete index is the coordinate information used to uniquely identify the spatial location of a discrete element in the corresponding sparse tensor. It can be a three-dimensional coordinate (x, y, z) or a high-dimensional tensor coordinate.

[0109] A discrete element pair is a pairing relationship formed by a discrete element in a first sparse tensor and a discrete element in a second sparse tensor based on the spatial consistency of their discrete indices. This pairing relationship can be established based on the following: when the discrete indices of two discrete elements are exactly the same, they are considered to be in the same spatial location, thus forming a discrete element pair.

[0110] For example, the discrete index set of all discrete elements in the first sparse tensor can be traversed. For each discrete index, an exact match search is performed in the discrete index set of the second sparse tensor. If the same index is found, the corresponding two discrete elements are combined into a pair of discrete elements.

[0111] Step S334: For each set of discrete element pairs, if there is a discrete zero element in any set of target discrete element pairs, delete the index of the target discrete element pair to obtain multiple discrete non-zero elements in the non-zero data set.

[0112] Specifically, for each pair of discrete elements, the corresponding data values ​​in the first sparse tensor and the second sparse tensor can be read respectively; if either data value is zero, both discrete indices of the discrete element pair are removed simultaneously (e.g., marked as logically invalid).

[0113] By using discrete indices as the pairing criteria, precise location of isolated zero values ​​is achieved. Fine-grained collaborative removal of zero values ​​is accomplished by judging the existence of any zero element in a discrete element pair. On this basis, the filtered discrete non-zero elements are incorporated into the non-zero data set, which together with the non-zero data blocks obtained by block-level zero element matching in the previous text constitute a complete and non-redundant non-zero data set. Thus, without increasing hardware memory access overhead, it ensures that all data involved in the sparse tensor operation process has actual semantic value and numerical validity, significantly reduces the proportion of invalid calculations, and improves the overall energy efficiency of the processing unit array.

[0114] It should be noted that steps S331 to S332 and steps S333 to S334 are all explanations of step S330 above, that is, the specific method for determining the non-zero data set. However, there is no fixed order between steps S331 to S332 and steps S333 to S334. They can be executed in any order or at the same time. This application does not impose any restrictions on this.

[0115] In some embodiments, such as Figure 10 As shown, the sparse tensor processing method may also include steps S1010 and S1020.

[0116] Step S1010: Divide the non-zero data set into multiple parts and calculate the effective utilization rate of the processing unit array for each group after each division.

[0117] Partitioning a non-zero dataset multiple times refers to combining and grouping multiple non-zero data blocks and discrete non-zero elements using different partitioning strategies without changing the overall composition of the dataset, resulting in various candidate grouping schemes. Partitioning strategies include, but are not limited to, equal distribution based on the number of non-zero data blocks, balanced distribution based on the total number of non-zero elements, clustering based on spatial proximity, or iterative adjustment after random initialization based on a preset number of groups. Each partitioning strategy generates an independent set of grouping results, and each grouping result contains at least one non-zero data block or at least one discrete non-zero element.

[0118] The effective utilization rate of the processing unit array can be defined as the ratio of the total number of non-zero elements in a given candidate group to the physical size of the corresponding allocated processing unit array. The physical size of the processing unit array is the total number of processing units in the array capable of parallel execution, and its value is fixed and known. This ratio represents the proportion of processing units actually activated and participating in computation after the group is scheduled to the processing unit array.

[0119] For example, the effective utilization rate of the processing unit array of each group after each partition can be calculated as follows: For the g-th group in a certain candidate grouping scheme, the sum of the total number of non-zero elements in all non-zero data blocks and the number of all discrete non-zero elements is denoted as Ng; Ng is divided by the total size of the processing unit array M×M to obtain the effective utilization rate σg=Ng / (M×M); σg values ​​of all G groups are calculated in turn and summarized into an effective utilization rate vector [σ1,σ2,…,σG].

[0120] Step S1020: When the effective utilization rate of the processing unit array in each group meets the constraints, determine multiple target task groups.

[0121] Constraints refer to a set of technical criteria used to determine whether a candidate grouping scheme is suitable for hardware execution requirements. They may include: the effective utilization rate of the processing unit array of each group is not lower than a preset utilization rate threshold, and the difference in the effective utilization rate of the processing unit array between each group does not exceed a preset error threshold.

[0122] Among them, the preset utilization threshold is used to ensure that a single task group has sufficient computing density to avoid idle hardware resources; the preset error threshold is used to control the degree of load deviation between task groups to prevent some processing unit arrays from being overloaded while the rest are idle.

[0123] By partitioning the non-zero data set multiple times and dynamically evaluating the effective utilization of the processing unit array in each group, various potential task load distribution patterns can be covered. Based on this, pre-defined constraints are used to screen the consistency of each partitioning result, ensuring that the final target task group meets both the minimum computational density requirements and maintains a relatively balanced load between groups. This allows each task group subsequently distributed to the processing unit array to drive a near-peak number of processing units to work collaboratively, significantly reducing idle waiting time caused by load imbalance and improving the overall throughput of sparse tensor operations and hardware resource utilization efficiency.

[0124] In some embodiments, such as Figure 11 As shown, the non-zero data set is divided multiple times, and steps S1011 and S1012 are also included.

[0125] Step S1011: Generate the current grouping result according to the current partitioning strategy, and determine the performance index of the current grouping result.

[0126] Among them, the performance indicators include at least the effective utilization rate of the processing unit array of the current grouping results.

[0127] The current partitioning strategy refers to a set of rules or logical configurations used to divide a non-zero data set into multiple task groups. It may include grouping granularity control parameters, combined weights of blocks and discrete elements, upper limit of the number of groups, and constraint on the maximum number of non-zero elements in a single group.

[0128] The current grouping result refers to a specific partitioning scheme generated for the non-zero data set based on the current partitioning strategy. Each group contains several non-zero data blocks and / or discrete non-zero elements.

[0129] Performance metrics are quantitative parameters used to characterize the overall performance of the current grouping results in terms of hardware resource utilization. They can typically include the effective utilization rate of the processing unit array, which measures the resource filling degree of the processing unit array under the grouping task. The higher the value, the fewer idle computing units and the lower the resource waste.

[0130] In some alternative embodiments, the performance metrics can be further extended to a weighted combination form, for example, by introducing a variance term for effective utilization among groups to ensure balance, or by introducing memory access bandwidth utilization to jointly assess storage pressure.

[0131] Step S1012: Update the partitioning strategy based on the degree of performance optimization of the current grouping result compared to the previous historical grouping result.

[0132] The previous historical grouping result refers to the grouping result generated in the most recent execution of the same process before this iteration. The corresponding historical partitioning strategy has been archived and can be used for comparative analysis.

[0133] The degree of optimization of performance indicators refers to the improvement or trend of the current grouping results relative to the historical grouping results in at least one performance indicator. It can be expressed as absolute difference, relative growth rate, sign change (such as from negative to positive), etc.

[0134] The degree of performance metric optimization is used to drive the adaptive adjustment of partitioning strategy parameters. For example, when the effective utilization rate of the processing unit array increases beyond a preset threshold, the block merging tendency coefficient in the next round of partitioning is increased; or, when the effective utilization variance decreases, the spatial locality priority weight is enhanced.

[0135] In some optional embodiments, the updated partitioning strategy may specifically adopt the following steps: if the standard deviation of the effective utilization rate of each group processing unit array in the current grouping result decreases by δ compared with the historical result, and δ≥0.1, then in the next iteration, the probability of mixed grouping of discrete non-zero elements and non-zero data blocks is increased to improve the resource adaptation accuracy of small-scale non-zero units.

[0136] In other alternative embodiments, the update strategy can also involve constructing a strategy update model that takes the difference between the current and historical performance metrics, the current strategy parameter vector, and the statistical characteristics of the task data distribution as input, outputs the incremental correction amount of the next version of the partition strategy parameters, and superimposes it onto the original strategy vector to complete the update. Reinforcement learning can be used to improve the model's performance through trial and error and rewards.

[0137] For example, assuming the processing unit array size is M×M, the number of combinations is G, the number of non-zero data blocks is Btotal, the number of discrete non-zero elements is Qtotal, the number of non-zero values ​​in the non-zero data blocks of each group are GB1, GB2…GBg, the number of discrete non-zero elements in each group are GQ1, GQ2…GQg, the number of non-zero elements in each non-zero data block of combination 1 is N11, N12…, the number of non-zero elements in each non-zero data block of combination 2 is N21, N22…, the effective utilization rate of the PE array of combination 1 is σg1, the effective utilization rate of the PE array of combination 2 is σg2…, and the effective utilization rate threshold γ is required to be achieved, then the following constraint formula is given: GB1 + GB2 + … + GBg = Btotal; GQ1 + GQ2 + ... + GQg = Qtotal; N11+N12+…+GQ1≤M×M; N21+N22+…+GQ2≤M×M; σg1=(N11+N12+…+GQ1) / (M×M); σg1≥γ; σg2=(N21+N22+…+GQ2) / (M×M); σg2≥γ; σg1=σg2=σgg.

[0138] Under the conditions of the above constraint formula, such as Figure 12 As shown, starting from the preset initial partitioning strategy, the current grouping result obtained by each partitioning strategy is tested to obtain the effective utilization rate of the processing unit array corresponding to the current grouping result. The test results are fed back to the Agent in the form of reward, thereby guiding the Agent to learn and take new actions. By adjusting the combination of non-zero data blocks and discrete non-zero elements in each group, and then testing again, the effective utilization rate of the processing unit array is continuously improved, and the optimal performance index is finally achieved.

[0139] Reinforcement learning systems such as Figure 13 As shown, under constraints, the agent adjusts the combination of non-zero data blocks and discrete non-zero elements according to the partitioning strategy, resulting in multiple grouping outcomes. The effective utilization rate of the processing unit array corresponding to each grouping outcome serves as a reward, prompting the agent to take a new action. The reward is an immediate numerical feedback, used to measure the effectiveness of the current action. It can be positive (encouraging the current action), negative (penalizing the current action), or zero (no impact), thus enabling the agent to gradually adjust its partitioning strategy to maximize long-term cumulative rewards.

[0140] In some optional embodiments, in addition to feeding back the effective utilization rate of the processing unit array as a reward to the Agent, the computation time of each group can also be calculated and fed back to the Agent as part of the reward. This allows the Agent to learn that in the final grouping result after adjustment, the computation time of each group is close to the same, avoiding the situation where a single processing unit is forced to wait for other processing units after processing a group in a short time.

[0141] By designing the partitioning process of non-zero data sets as an iterative optimization process with a feedback mechanism, each partition not only relies on static rules but also dynamically adjusts strategy parameters based on the actual performance of previous results. On this basis, the partitioning strategy is directionally modified according to the degree of optimization, thereby gradually approaching the grouping scheme that satisfies the constraints and has the optimal resource utilization. This mechanism does not rely on prior assumptions about the distribution of input data and can adapt to the changes in multimodal tensor structures under different sparse patterns, significantly improving the robustness and adaptability of the sparse tensor processing system.

[0142] In some embodiments, such as Figure 14 As shown, the sparse tensor processing method may also include steps S1410 and S1420.

[0143] Step S1410: Determine the effective utilization rate of the processing unit array for each group based on the number of non-zero elements in each group and the size of the processing unit array.

[0144] The number of non-zero elements in each group refers to the total number of non-zero data blocks and discrete non-zero values ​​contained in each non-zero element within each target task group obtained from the current partitioning, after zero element matching and removal. This number represents the effective amount of data that actually needs to be computed in that target task group.

[0145] The size of a processing unit array refers to the total number of basic processing units that can perform operations in parallel within the array. It can be a fixed-structure two-dimensional array (e.g., M×M size), or a configurable one-dimensional linear array or a three-dimensional topological array. The total number of units is denoted as NPE.

[0146] Step S1420: If the effective utilization rate of the processing unit array of each group is not less than the preset utilization rate threshold, and the difference between the effective utilization rates of the processing unit arrays of each group is not greater than the preset error threshold, then the constraint condition is satisfied.

[0147] The preset utilization threshold is the minimum effective utilization limit set to ensure the basic operating efficiency of the processing unit array. Its value reflects the system's technical requirements for resource idle tolerance. The preset error threshold is the maximum allowable utilization deviation set to ensure load balance among groups. Its value reflects the system's technical requirements for computational latency consistency.

[0148] The difference between the effective utilization rates of the processing unit arrays of each group can be the difference between the maximum and minimum effective utilization rates of the processing unit arrays corresponding to all target task groups, i.e., max(σ1,σ2,…,σg)-min(σ1,σ2,…,σg); the smaller the difference, the more consistent the utilization level of the processing unit arrays of each group is.

[0149] Specifically, iterate through the effective utilization rates of the processing unit arrays of all G target task groups to confirm whether all of them satisfy σg≥γ (γ is a preset utilization threshold). At the same time, calculate Δσ=max(σg)-min(σg) and determine whether Δσ≤ε (ε is a preset error threshold). When both conditions are met, it is determined that the constraint conditions are satisfied.

[0150] By leveraging the combined effects of effective utilization threshold and error threshold, an absolute efficiency baseline is set to prevent resource waste, while a relative balance upper limit is set to suppress load skew. This enables a quantitative assessment of the quality of the target task group partitioning. The assessment result directly drives the partitioning strategy update and can constitute a key reward signal source in the reinforcement learning framework, thereby supporting the stable convergence of the aforementioned partitioning strategy iterative optimization process.

[0151] In some embodiments, such as Figure 15 As shown, before distributing multiple target task groups to a preset processing unit array for computation, the sparse tensor processing method may further include steps S1510 to S1530.

[0152] Step S1510: For each target task group, extract the first element index set from the first sparse tensor and the second element index set from the second sparse tensor.

[0153] The first element index set refers to the set consisting of the coordinate indices of all non-zero data blocks and discrete non-zero elements contained in the target task group in the first sparse tensor. Each index item represents the position of a non-zero element in the first sparse tensor space. The second element index set refers to the set consisting of the coordinate indices of all non-zero data blocks and discrete non-zero elements contained in the same target task group in the second sparse tensor. Each index item represents the position of a non-zero element in the second sparse tensor space.

[0154] Neither the first nor the second element index set contains zero element indices, and the index format is consistent with the tensor dimension structure formed after preprocessing the multimodal input data.

[0155] Specifically, the element index set can be extracted in the following way: based on the block index, sub-block index and discrete index recorded by each target task group, the complete coordinate set corresponding to the original tensor space can be restored by table lookup mapping or coordinate expansion.

[0156] Step S1520: Determine the index pairs that exist simultaneously in the first element index set and the second element index set.

[0157] An index pair is a set of indices that appear in both the first and second element index sets and have the same coordinate value. It indicates that the same spatial location has a non-zero element in both sparse tensors. Index pairs constitute the basic alignment unit for subsequent operations, ensuring the semantic validity of binary operations such as multiplication or addition.

[0158] Specifically, index pairs can be determined as follows: load the first element index set and the second element index set into a hash table structure respectively, and identify all intersection indices by traversing either set and performing a search operation in the other set.

[0159] Step S1530: Based on the index pairs, extract the corresponding data element pairs from the first sparse tensor and the second sparse tensor to obtain the computational data sequence to be processed.

[0160] A data element pair is an ordered pair consisting of two non-zero values ​​pointed to by the same index, which are from the first sparse tensor and the second sparse tensor, respectively.

[0161] Computational data sequence refers to all pairs of data elements that meet the above conditions, ordered by index or by task scheduling.

[0162] Specifically, the storage structures of the first sparse tensor and the second sparse tensor can be accessed sequentially according to the index pair, the corresponding data values ​​can be obtained through index addressing, and they can be assembled into contiguous memory blocks in order.

[0163] In some optional embodiments, a data offset address field can be preset for each index pair in the sparse tensor compressed storage format, and the data element pair can be read through a single indirect addressing during execution.

[0164] By performing fine-grained index intersection determination at the target task group granularity, it is ensured that only the data elements corresponding to the valid indexes that actually participate in the operation are included in the computation process, avoiding invalid memory access and idle computation caused by missing indexes on one side. Moreover, it enables the data stream received by the processing unit array to have high spatial locality and structural consistency, thereby improving bandwidth utilization and computation pipeline throughput efficiency.

[0165] In some embodiments, such as Figure 16 As shown, the sparse tensor processing method may further include steps S1610 to S1620.

[0166] Step S1610: Collect the calculation results of each target task group according to the calculation order of the processing unit array corresponding to each target task group.

[0167] The computation order of the processing unit arrays corresponding to each target task group refers to the temporal order or preset scheduling sequence of the results generated by each array in completing its assigned task when multiple processing unit arrays execute computation tasks in parallel. This order can be an ascending order of array numbers, or a time sequence formed by dynamic feedback based on actual completion time.

[0168] Specifically, after each processing unit array completes the calculation of the current target task group, it can send a completion interrupt signal to the preset coordination unit, along with the unique identifier and timestamp of the target task group. The preset coordination unit queues and buffers the calculation results according to the order of the received interrupt signals and the preset scheduling sequence number.

[0169] Step S1620: Based on the position information of the original sparse tensors corresponding to each target task group, reorganize the calculation results of each target task group into an output tensor.

[0170] The location information of the original sparse tensor corresponding to each target task group refers to the spatial coordinate range covered by the data of the target task group in the first or second sparse tensor, which may include the starting coordinates, dimension size, and modality attribution identifier. This information does not contain specific numerical content and is only used to indicate the area where the operation results are written in the output tensor.

[0171] Specifically, a sparse index mapping table can be constructed for the output tensor, where each entry records a coordinate (e.g., x, y, z) and its corresponding data value. For each target task group, each data element in its computation result is traversed, and combined with the position information of the target task group, the global coordinates of each data element in the original sparse tensor are calculated. These coordinates and data values ​​are then written into the mapping table, and finally, a dense or sparse output tensor is generated based on the mapping table.

[0172] By using the computation order of each target task group as the temporal basis for result collection and its position information in the original sparse tensor as the spatial basis for writing the results, the reverse reconstruction of local computation results to the global output tensor is realized. This enables the asynchronous computation results scattered on different processing unit arrays to be accurately returned to their positions, thereby fully preserving the spatial semantic structure of the original sparse tensor and the topological correlation between multimodal data without introducing additional interpolation or approximation operations.

[0173] It should be understood that while the steps in the flowcharts of the accompanying drawings are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise expressly stated herein, the steps shown in the flowcharts and other embodiments are not subject to strict order restrictions and can be executed in other orders. Furthermore, at least some steps in the foregoing embodiments may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least a portion of the sub-steps or stages of other steps.

[0174] In a second aspect, embodiments of this application provide a sparse tensor processing apparatus, such as... Figure 17 As shown, the sparse tensor processing device 1700 includes: an acquisition module 1710, an extraction module 1720, a matching module 1730, and a grouping module 1740.

[0175] The acquisition module 1710 is used to acquire the first sparse tensor and the second sparse tensor to be processed; the first sparse tensor and the second sparse tensor are intermediate data in the form of three-dimensional matrices obtained by preprocessing the multimodal input data, and the multimodal input data includes at least images, videos, text, speech or point clouds. The extraction module 1720 is used to extract multiple dense data blocks from the first sparse tensor and the second sparse tensor respectively; each dense data block is a set of elements in which the number of non-zero elements exceeds a preset density threshold. The matching module 1730 is used to perform zero element matching on a first sparse tensor and a second sparse tensor based on multiple dense data blocks to determine the set of non-zero data to be operated on. The grouping module 1740 is used to divide the non-zero data set into multiple target task groups to be processed according to preset constraints, and distribute the multiple target task groups to a preset processing unit array for processing to obtain the processing results, so as to determine the processing result of the multimodal input data based on the processing results.

[0176] For further specific limitations regarding the sparse tensor processing apparatus, please refer to the limitations of the sparse tensor processing method above. The sparse tensor processing apparatus can also be used to perform further steps of the sparse tensor processing method in the embodiments of this application, which will not be repeated here. Each module in the above-described sparse tensor processing apparatus can be implemented entirely or partially through software, hardware, or a combination thereof. Each module can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0177] In a third aspect, embodiments of this application provide a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the sparse tensor processing method provided in any embodiment of the first aspect of this application.

[0178] In some embodiments, the computer device may be a server, and its internal structure diagram may be as follows: Figure 18As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computational and control capabilities. The memory includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database stores data such as sparse tensors. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements the sparse tensor processing method in any embodiment of this document.

[0179] Those skilled in the art will understand that Figure 18 The structures shown are merely block diagrams of some structures related to the embodiments of this application and do not constitute a limitation on the computer devices on which the embodiments of this application are applied. Specific computer devices may include more or fewer components than those shown in the figures, or combine certain components, or have different component arrangements.

[0180] In a fourth aspect, embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the sparse tensor processing method provided in any embodiment of the first aspect of this application.

[0181] The computer-readable storage medium may be Figure 18 The non-volatile storage medium in the computer device shown.

[0182] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The aforementioned computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments of this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0183] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0184] The above embodiments merely illustrate several implementation methods of this application, and their descriptions are quite specific and detailed, but they should not be construed as limiting the scope of protection of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the scope of protection of this application.

Claims

1. A method for processing sparse tensors, characterized in that, The method includes: Obtain the first sparse tensor and the second sparse tensor to be computed; the first sparse tensor and the second sparse tensor are intermediate data in the form of three-dimensional matrices obtained by preprocessing multimodal input data, wherein the multimodal input data includes at least images, videos, text, speech or point clouds; Multiple dense data blocks are extracted from the first sparse tensor and the second sparse tensor respectively; each dense data block is a set of elements with the number of non-zero elements exceeding a preset density threshold, and each dense data block is in the form of a three-dimensional matrix; Based on multiple dense data blocks, zero element matching is performed on the first sparse tensor and the second sparse tensor to determine the set of non-zero data to be processed; According to preset constraints, the non-zero data set is divided into multiple target task groups to be processed, and the multiple target task groups are distributed to a preset processing unit array for processing to obtain the processing results, so as to determine the processing result of the multimodal input data based on the processing results.

2. The sparse tensor processing method according to claim 1, characterized in that, The step of extracting multiple dense data blocks from the first sparse tensor and the second sparse tensor includes: According to multiple preset block templates, the first sparse tensor and the second sparse tensor are spatially scanned respectively to identify the spatial clustering regions of non-zero elements. For each spatial cluster region, the density of non-zero elements inside is calculated, and the spatial cluster region with a non-zero element density greater than the preset density threshold is determined as a dense data block; wherein, the geometric size of each dense data block is not fixed.

3. The sparse tensor processing method according to claim 1, characterized in that, Also includes: A block index is generated based on the spatial location and geometric dimensions of each dense data block; For multiple dense data blocks, multiple anomalous sub-blocks with clustered zero elements are extracted, and a sub-block index is generated based on the spatial position and geometric size of each anomalous sub-block relative to its respective dense data block; wherein, each anomalous sub-block is in the form of a three-dimensional matrix; Based on the block indexes of the multiple dense data blocks, the sub-block indexes of the multiple abnormal sub-blocks, and the discrete indexes corresponding to the discrete elements in the first sparse tensor and the second sparse tensor, the first sparse tensor and the second sparse tensor are compressed and stored.

4. The sparse tensor processing method according to claim 3, characterized in that, Also includes: For the discrete indices corresponding to the discrete elements in the first sparse tensor and the second sparse tensor, a sparse tensor compression format based on coordinate lists is used for compression and storage.

5. The sparse tensor processing method according to claim 3, characterized in that, Also includes: The spatial location of each dense data block is determined by the coordinates of the first element in each dense data block. Based on the mapping relationship between the spatial location and geometric size of dense data blocks and the preset block code, the block index corresponding to each dense data block is determined; Based on the mapping relationship between the spatial location and geometric size of the abnormal sub-block relative to its dense data block and the preset sub-block code, the sub-block index of each abnormal sub-block is determined.

6. The sparse tensor processing method according to claim 3, characterized in that, The step of performing zero-element matching on the first sparse tensor and the second sparse tensor based on multiple dense data blocks to determine the set of non-zero data to be processed includes: Based on the block index of each dense data block, multiple candidate block pairs with overlapping positions are identified; For each group of candidate block pairs, if any candidate block contains a target abnormal sub-block and / or a target zero element, then the index of the corresponding position of the target abnormal sub-block and / or the target zero element in the candidate block pair is deleted to obtain multiple non-zero data blocks in the non-zero data set.

7. The sparse tensor processing method according to claim 6, characterized in that, Also includes: Based on the discrete index of each discrete element, multiple pairs of discrete elements are determined; For each set of discrete element pairs, if there is a discrete zero element in any set of target discrete element pairs, then delete the index of the target discrete element pair to obtain multiple discrete non-zero elements in the non-zero data set.

8. The sparse tensor processing method according to any one of claims 1 to 7, characterized in that, The step of dividing the non-zero data set into multiple target task groups to be processed according to preset constraints includes: The non-zero data set is divided multiple times, and the effective utilization rate of the processing unit array of each group is calculated after each division. When the effective utilization rate of the processing unit array of each group meets the constraints, multiple target task groups are determined.

9. The sparse tensor processing method according to claim 8, characterized in that, The step of dividing the non-zero data set into multiple parts includes: Based on the current partitioning strategy, generate the current grouping result and determine the performance indicators of the current grouping result; the performance indicators include at least the effective utilization rate of the processing unit array of the current grouping result; The partitioning strategy is updated based on the degree of performance improvement of the current grouping results compared to the previous historical grouping results.

10. The sparse tensor processing method according to claim 8, characterized in that, Also includes: The effective utilization rate of the processing unit array for each group is determined based on the number of non-zero elements in each group and the size of the processing unit array. If the effective utilization rate of the processing unit array in each group is not less than a preset utilization rate threshold, and the difference between the effective utilization rates of the processing unit arrays in each group is not greater than a preset error threshold, then the constraint condition is determined to be satisfied.

11. The sparse tensor processing method according to claim 1, characterized in that, Also includes: Before distributing the multiple target task groups to a preset processing unit array for computation, for each target task group, a first element index set from the first sparse tensor and a second element index set from the second sparse tensor are extracted respectively. Determine the index pairs that exist simultaneously in both the first element index set and the second element index set; Based on the index pairs, corresponding data element pairs are extracted from the first sparse tensor and the second sparse tensor to obtain the computational data sequence to be processed.

12. The sparse tensor processing method according to claim 1, characterized in that, Also includes: Collect the computation results of each target task group according to the computation order of the processing unit arrays corresponding to each target task group; Based on the location information of the original sparse tensors corresponding to each target task group, the calculation results of each target task group are reorganized into an output tensor.

13. A sparse tensor processing device, characterized in that, include: The acquisition module is used to acquire the first and second sparse tensors to be processed. The first sparse tensor and the second sparse tensor are intermediate data in the form of three-dimensional matrices obtained by preprocessing multimodal input data, wherein the multimodal input data includes at least images, videos, text, speech, or point clouds; The extraction module is used to extract multiple dense data blocks from the first sparse tensor and the second sparse tensor, respectively. Each of the dense data blocks is a set of elements whose number of non-zero elements exceeds a preset density threshold; The matching module is used to perform zero element matching on the first sparse tensor and the second sparse tensor based on multiple dense data blocks to determine the set of non-zero data to be operated on. The grouping module is used to divide the non-zero data set into multiple target task groups to be processed according to preset constraints, and distribute the multiple target task groups to a preset processing unit array for processing to obtain the processing results, so as to determine the processing result of the multimodal input data based on the processing results.

14. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the sparse tensor processing method as described in any one of claims 1 to 12 when executing the computer program.

15. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, it implements the steps of the sparse tensor processing method as described in any one of claims 1 to 12.