A computing and storage method and system for large language model sparse inference

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By partitioning and dynamically scheduling the sparse matrix, and optimizing thread allocation, the problems of unbalanced load and memory access bottleneck in sparse matrix computation of large language models are solved, thereby improving the performance of parallel inference.

CN122195684APending Publication Date: 2026-06-12ZHONGNAN INFORMATION TECH (SHENZHEN) CO LTD +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: ZHONGNAN INFORMATION TECH (SHENZHEN) CO LTD
Filing Date: 2026-05-15
Publication Date: 2026-06-12

Application Information

Patent Timeline

15 May 2026

Application

12 Jun 2026

Publication

CN122195684A

IPC: G06F9/50; G06F9/48; G06N5/04

AI Tagging

Application Domain

Program initiation/switching Resource allocation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies suffer from uneven computational load and memory congestion when processing sparse matrices of large language models. They also lack a dynamic matching mechanism between the sparse data topology and the physical state of the hardware, leading to thread divergence and memory access bottlenecks.

Method used

By acquiring sparse activation matrix data, dividing it into micro data blocks according to the block step size parameter, extracting the spatial span of adjacent non-zero elements, and combining the idle resource quantity of thread bundles and the computational resource requirements, dynamically scheduling impedance, generating the optimal thread allocation quantity, and optimizing task allocation.

Benefits of technology

Dynamic balanced configuration of sparse matrices is achieved, eliminating thread divergence and memory access congestion, and improving the parallel inference performance of large language models.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122195684A_ABST

Patent Text Reader

Abstract

The application belongs to the technical field of resource allocation, and particularly relates to a computing and storage method and system for sparse inference of a large language model, which comprises the following steps: obtaining sparse activation matrix data of the large language model, and cutting the sparse activation matrix data into multiple micro data blocks according to a block step parameter; extracting the spatial span of adjacent non-zero elements in the micro data blocks to obtain a non-zero aggregation index; obtaining the thread bundle idle resource amount of a target thread bundle and the theoretical computing resource demand amount of the micro data blocks, and combining the non-zero aggregation index to obtain a dynamic scheduling impedance; obtaining a hardware maximum thread quota, and obtaining an optimal thread allocation amount based on the dynamic scheduling impedance, the hardware maximum thread quota and the non-zero aggregation index; and generating a control instruction set based on the optimal thread allocation amount to perform a multiplication and addition operation. The application eliminates thread divergence and memory congestion caused by static allocation, realizes dynamic matching of computing power, and improves parallel inference performance.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of resource allocation technology. More specifically, this invention relates to a computational and storage method and system for sparse reasoning in large language models. Background Technology

[0002] In practical applications of large language model inference, due to the use of network structures with non-linear activation characteristics, the outputs of each layer often exhibit extremely high sparsity, meaning that the matrix contains a large number of zero-value elements. To reduce storage and memory bandwidth overhead, the industry typically uses compressed sparse row format for storing and computing sparse matrices. Traditional task scheduling mechanisms based on compressed sparse rows usually allocate matrix elements to different parallel computing threads or thread bundles row by row. Specifically, the task scheduler reads the row pointer array, determines the number of non-zero elements in each row, and then sequentially pushes the row computation tasks into the hardware execution queue, so that each thread is responsible for processing the inner product operation of one or more rows.

[0003] However, the aforementioned existing technologies have significant limitations when dealing with highly unstructured sparse activation matrices in large language models. On one hand, due to the vastly different number of non-zero elements in each row, direct allocation based on static row indices leads to extremely uneven computational loads distributed across different threads. In a single-instruction-multiple-data (SIMDG) hardware architecture, if some threads within the same thread bundle complete the computation of rows with fewer non-zero elements ahead of time, they must remain idle until the heaviest-loaded thread finishes its computation. This thread divergence caused by the uneven distribution of non-zero elements within rows severely wastes the underlying processor's computational resources.

[0004] On the other hand, traditional task scheduling mechanisms lack consideration for the hit rate of multi-level caches and the continuity of physical memory accesses in the underlying hardware. When concurrent threads read column index arrays to obtain corresponding values, a large number of non-contiguous, discrete memory access requests are easily generated, leading to a sharp drop in memory bus bandwidth utilization and severe pipeline stalls. In summary, existing technologies lack a dynamic matching mechanism between sparse data topology and real-time physical load status of hardware. The scheduler can only passively dispatch tasks according to the static memory layout of the data. This rigid static scheduling mode will inevitably cause severe computational load imbalance and memory access congestion when performing irregular sparse matrix multiplications, ultimately significantly weakening the parallel inference performance of large language models. Summary of the Invention

[0005] To address the technical problem of the lack of a dynamic matching mechanism between sparse data topology and hardware physical state in the existing technology, which leads to unbalanced computing load and memory access congestion, the present invention provides solutions in the following aspects.

[0006] In a first aspect, the present invention provides a computation and storage method for sparse inference of large language models, comprising: acquiring sparse activation matrix data of a large language model; dividing the sparse activation matrix data into multiple micro-data blocks according to a preset block step size parameter, extracting the spatial span of adjacent non-zero elements within the micro-data blocks in the physical memory address, and acquiring the non-zero aggregation index of the micro-data blocks based on the spatial span; acquiring the amount of idle resources of the target thread bundle in the target computing core, and acquiring the theoretical computing resource requirement corresponding to the micro-data blocks, acquiring the dynamic scheduling impedance when the micro-data blocks are mapped to the target thread bundle based on the ratio of the theoretical computing resource requirement to the idle resources of the thread bundle, and the non-zero aggregation index of the micro-data blocks; scaling the maximum thread quota of the underlying hardware based on the dynamic scheduling impedance, and performing a secondary correction based on the non-zero aggregation index to obtain the optimal thread allocation; generating a control instruction set for the hardware task distribution engine based on the optimal thread allocation, so that the underlying physical computing unit performs the sparse activation matrix multiplication and addition operation of the large language model.

[0007] This invention obtains sparse activation matrix data of a large language model and cuts it into micro-data blocks according to the block step size parameter. It extracts the spatial span of adjacent non-zero elements to obtain a non-zero aggregation index. At the same time, it obtains dynamic scheduling impedance by combining the idle resource amount of the target thread bundle and the theoretical computational resource requirement of the micro-data blocks. Finally, it obtains the optimal thread allocation and generates a control instruction set based on the dynamic scheduling impedance, the maximum hardware thread quota, and the non-zero aggregation index. By dividing the large and irregular sparse matrix and extracting spatial locality features, and combining the real-time physical load status of the underlying hardware for adaptive matching of computing resources, this invention eliminates the thread divergence and memory congestion problems caused by uneven local computational difficulty and discrete memory access in the traditional static allocation mode. It achieves dynamic balanced allocation of computing power and improves the overall performance of parallel inference of large language models.

[0008] Preferably, the step of dividing the sparse activation matrix data into multiple micro-data blocks according to a preset block step size parameter, and extracting the spatial span of adjacent non-zero elements in the physical memory address within the micro-data block, includes: reading sparse activation matrix data stored in a compressed sparse row format from the physical video memory space, wherein the sparse activation matrix data includes a row pointer array, a column index array, and a non-zero value array; dividing the sparse activation matrix data into multiple micro-data blocks in a two-dimensional space according to the block step size parameter; parsing the column index array corresponding to the micro-data block; scanning the set of non-zero elements in the micro-data block; and extracting the spatial span of two adjacent non-zero elements in the physical memory address.

[0009] Preferably, the non-zero aggregation index of the micro-data block is obtained based on the spatial span, and it satisfies the following expression: In the formula, For micro data blocks The non-zero aggregation index; For micro data blocks The total number of non-zero elements contained within; For micro data blocks The sorted index of adjacent non-zero element pairs within the same element; For the first Spatial span between adjacent non-zero elements; It is an exponential function with the natural constant as its base; This is the base cache line step size for the underlying processor.

[0010] This invention obtains the non-zero aggregation exponent of micro data blocks by constructing an exponential decay operation logic based on spatial span and basic cache line step size. This operation is based on the principle of spatial locality in computer architecture and uses the monotonically decaying characteristics of the exponential function to map the physical continuity of memory access. When elements are scattered, the negative exponential operation result drops rapidly, and when elements are densely arranged, the operation result approaches the maximum value. This transforms the discrete topological state of physical space into a state quantity that characterizes the continuity of hardware cache, thereby determining the probability of triggering merge memory access when micro data blocks enter the underlying multi-level cache, and improving the rigor of the scheduling decision benchmark.

[0011] Preferably, obtaining the amount of idle resources of the target thread bundle in the target computing core includes: reading the physical resource status of each thread bundle in the target computing core in real time through the hardware performance counter of the underlying processor; extracting the current amount of idle resources of the target thread bundle, wherein the amount of idle resources represents the number of idle arithmetic logic unit slots in the target computing core that are currently available for allocation to the target thread bundle.

[0012] Preferably, obtaining the theoretical computational resource requirement corresponding to the micro data block includes: extracting the total number of non-zero elements contained within the micro data block; obtaining the basic clock cycles required for hardware to perform floating-point multiply-accumulate operations; combining the total number of non-zero elements with the basic clock cycles to reflect the total time workload required to complete the multiply-accumulate operation of all valid elements within the micro data block; introducing the throughput conversion rate of the underlying architecture to map the total time workload into hardware carrying requirements based on the spatial capacity dimension, and calculating the theoretical computational resource requirement.

[0013] Preferably, based on the ratio of the theoretically calculated resource demand to the idle resource quantity of the thread bundle, and the non-zero aggregation index of the micro data block, the dynamic scheduling impedance when the micro data block is mapped to the target thread bundle is obtained, which satisfies the following expression: In the formula, For micro data blocks Dynamic scheduling impedance when mapped to the target thread bundle; For micro data blocks The non-zero aggregation index; For micro data blocks Includes the theoretical computational resource requirements; The current amount of free resources in the target thread bundle.

[0014] This invention utilizes the reciprocal of the non-zero aggregation exponent as a baseline calculation generator, and combines it with the square of the ratio of theoretically calculated resource demand to the idle resource quantity of the thread bundle to form a dynamic barrier. It calculates and obtains the dynamic scheduling impedance when micro data blocks are mapped to the target thread bundle. This allows the calculated impedance value to be automatically converted into a penalty scaling factor for computing power allocation when processing discrete data and when the system load is too heavy. Through positive multiplication, it directly suppresses the over-dispatch of the current task queue, forming a physical flow control mechanism that adaptively predicts and intercepts system load imbalance behavior.

[0015] Preferably, the method for obtaining the maximum hardware thread quota of the underlying hardware includes: calling the application programming interface of the underlying computing platform architecture to query device attributes; and extracting the maximum hardware thread quota allocated to a single task block.

[0016] Preferably, the maximum thread quota of the underlying hardware is scaled based on the dynamic scheduling impedance, and then further corrected according to the non-zero aggregation exponent to obtain the optimal thread allocation, which satisfies the following expression: In the formula, The optimal thread allocation is calculated. This is the floor function operator; For micro data blocks Dynamic scheduling impedance; The maximum number of threads set for the underlying hardware. For micro data blocks The non-zero aggregation index.

[0017] This invention uses the dynamic scheduling impedance as a scaling factor multiplied into the maximum hardware thread quota for scaling, and uses the square root of the non-zero aggregation exponent of the micro data block as a secondary correction factor for division damping operation to finally obtain the optimal thread allocation. Following the resource contention and memory access merging rules in the field of parallel computing, it actively reduces the number of over-allocated independent threads when facing highly continuous data blocks, prompting a small number of threads to complete merged memory access in a continuous batch processing manner, thereby releasing throughput potential and avoiding bus access conflicts and clock cycle waste caused by too many independent threads preempting each other, thus achieving the ultimate utilization of the underlying physical parallel computing unit.

[0018] Preferably, generating a control instruction set for the hardware task distribution engine based on the optimal thread allocation, so that the underlying physical computing unit performs the sparse activation matrix multiplication and addition operation of the large language model, includes: generating the control instruction set for the hardware task distribution engine based on the optimal thread allocation, wherein the control instruction set includes the starting physical address of the micro data block in video memory, the line read offset, and the corresponding optimal thread allocation; the task distribution engine sends the control instruction set to the underlying physical computing unit, and the allocated thread bundle executes the sparse activation matrix multiplication and addition operation of the large language model in parallel according to the instruction parameters to complete the inference advancement of the current layer.

[0019] This invention generates a control instruction set containing parameters such as the starting physical address of video memory and the line read offset to the hardware task dispatch engine based on the optimal thread allocation. The allocated thread bundles strictly follow this instruction set to execute multiplication and addition operations in parallel. The adaptive scheduling strategy obtained from the previous inverse solution is truly implemented, so that the amount of computing resources injected is precisely matched with the physical complexity of the current task. This eliminates the thread divergence phenomenon caused by the single instruction multiple data stream architecture from the root of physical execution, and effectively alleviates the catastrophic pipeline stoppage caused by unbalanced loading.

[0020] In a second aspect, the present invention provides a computation and storage system for sparse reasoning of large language models, including a processor and a memory, wherein the memory stores computer program instructions, and when the computer program instructions are executed by the processor, the aforementioned computation and storage method for sparse reasoning of large language models is implemented.

[0021] By adopting the above technical solution, a computer program is generated from the above-mentioned computation and storage method for sparse inference of large language models and stored in the memory so that it can be loaded and executed by the processor. In this way, a terminal device can be made based on the memory and the processor for convenient use.

[0022] The beneficial effects of this invention are as follows: This invention obtains sparse activation matrix data of a large language model and cuts it into micro-data blocks according to the block step size parameter. It extracts the spatial span of adjacent non-zero elements to obtain a non-zero aggregation index. At the same time, it obtains dynamic scheduling impedance by combining the idle resource amount of the target thread bundle and the theoretical computational resource requirement of the micro-data blocks. Finally, it obtains the optimal thread allocation and generates a control instruction set based on the dynamic scheduling impedance, the maximum hardware thread quota, and the non-zero aggregation index. By dividing the large and irregular sparse matrix and extracting spatial locality features, and combining the real-time physical load status of the underlying hardware for adaptive matching of computing resources, this invention eliminates the thread divergence and memory congestion problems caused by uneven local computational difficulty and discrete memory access in the traditional static allocation mode. It achieves dynamic balanced allocation of computing power and improves the overall performance of parallel inference of large language models. Attached Figure Description

[0023] Figure 1 This is a flowchart illustrating a computation and storage method for sparse inference of large language models in this invention; Figure 2 This is a schematic diagram illustrating the comparison between the present invention and conventional solutions in terms of thread allocation. Figure 3 This is a schematic diagram illustrating the comparison between the present invention and conventional solutions in terms of reducing the cost of production line downtime. Detailed Implementation

[0024] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0025] The specific embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0026] This invention discloses a computation and storage method for sparse inference in large language models, referring to... Figure 1 This includes steps S1-S3: S1: Read the sparse activation matrix data of the large language model from the physical memory space, cut the sparse activation matrix data into multiple micro data blocks according to the block step size parameter, extract the physical memory span of the non-zero elements in each micro data block, and calculate the non-zero aggregation index of each micro data block.

[0027] It should be noted that, since existing technologies rely solely on the statistical information of row pointers and ignore the dispersion of non-zero elements in the physical memory space, subsequent memory access behaviors are very likely to break the continuity of the cache. Therefore, this invention, by dividing the large sparse matrix into micro data blocks and examining the spatial span between adjacent non-zero elements within the micro data blocks, can reflect the spatial locality characteristics of the data block when entering the underlying cache, providing a data structure basis for eliminating non-merged memory accesses.

[0028] Specifically, the sparse activation matrix data of the large language model is read from the physical memory space of the runtime environment. The sparse activation matrix data is stored in a compressed sparse row format and includes an array of row pointers, an array of column indices, and an array of non-zero values.

[0029] Furthermore, set the block step size parameter. According to the block step size parameter The sparse activation matrix data is divided into multiple micro-data blocks by gridding in a two-dimensional space; for any given micro-data block... By parsing micro data blocks The corresponding column index array is scanned, and the set of non-zero elements is extracted, along with the spatial distance between two adjacent non-zero elements in physical memory address. The unit of measurement for the spatial span is bytes.

[0030] Wherein, the block step size parameter This is a two-dimensional scale parameter set to control the granularity of matrix feature extraction. If this value is set too small, it will cause a surge in the number of micro-data blocks, greatly increasing the traversal overhead of the controller and the feature calculation latency, causing the scheduling system itself to experience computational bottlenecks. If it is set too large, a single micro-data block will simultaneously span multiple locally dense and sparse regions, causing the internal features to be smoothed and unable to capture the local memory discrete fluctuation patterns, resulting in the failure of subsequent scheduling strategies. Therefore, its value range is set to 16 to 128, and in this embodiment it is set to 64 to ensure that the sparse boundaries of the local activation vectors of the large language model are defined without increasing significant scheduling latency. In other embodiments, implementers can set it according to the upper limit of the shared memory capacity of the underlying processor.

[0031] Furthermore, based on the spatial span Calculate micro data blocks Non-zero aggregation index Non-zero aggregation index Satisfying the expression:

[0032] In the formula, For micro data blocks The non-zero aggregation index; For micro data blocks The total number of non-zero elements contained within it, then For micro data blocks The total number of adjacent non-zero element pairs contained within the element, used for average normalization. For micro data blocks The sorted index of adjacent non-zero element pairs within the same element; For the first Spatial span between adjacent non-zero elements; It is an exponential function with the natural constant as its base; This is the base cache line step size of the underlying processor, measured in bytes. This parameter is a fixed physical constant read from the hardware configuration description file.

[0033] It should be noted that when the total number of non-zero elements contained within a micro data block is less than or equal to 1, the merged memory access advantage cannot be formed. In this case, the non-zero aggregation exponent is directly assigned a minimum value of 0.

[0034] The expression is based on the principle of spatial locality in computer architecture, using the monotonically decaying property of the exponential function to map the physical continuity of memory access; when the spatial span of adjacent non-zero elements... Much larger than the base cache line stride When the elements are scattered, it indicates that cache misses are likely to occur. This is because the spatial span increases as the elements become more dispersed. As the exponent increases, the result of the negative exponent term rapidly approaches zero, making the overall non-zero aggregation exponent... Consequently, the spatial distance decreases; conversely, if elements are densely packed, the spatial distance between adjacent elements decreases. The shrinking of the negative exponent term increases the result of the negative exponent term, thus driving the non-zero aggregation exponent. Approximate the maximum value of 1; through this operation process, the matrix topological distribution of the physical space is transformed into a state quantity that characterizes the continuity of the hardware cache.

[0035] S2: Obtain the real-time physical resource status of each thread bundle in the underlying hardware, and calculate the dynamic scheduling impedance when mapping the micro data block to the target thread bundle by combining the non-zero aggregation index of the micro data block.

[0036] It should be noted that, in order to avoid multiple dense data blocks flooding into the same thread bundle at the same time, causing local computing power overload while other thread bundles are idle, this invention obtains the real-time idle resource quantity of the underlying hardware and couples the aggregation characteristics of data blocks with the load status of the thread bundle to construct a dynamic scheduling impedance that characterizes the difficulty of task distribution, and predicts and intercepts scheduling behaviors that cause system load imbalance in advance.

[0037] Specifically, the physical resource status of each thread bundle in the target computing core is read in real time through the hardware performance counters of the underlying processor, and the amount of idle resources of the target thread bundle is extracted. The amount of idle resources in the thread bundle The number of free arithmetic logic unit slots in the target computing core that are currently available for allocation to the target thread bundle.

[0038] Simultaneously, extract micro data blocks. The total number of non-zero elements contained within; the base clock cycles required for the hardware to perform floating-point multiply-accumulate operations; combining these two to reflect the total time workload required to complete the multiply-accumulate calculation of all valid elements within the micro data block; introducing the throughput conversion rate of the underlying architecture as a conversion bridge to map the above total time workload based on the time dimension into the hardware carrying capacity requirements based on the spatial capacity dimension, thereby calculating the time required to complete the micro data block. Theoretical computational resource requirements .

[0039] Furthermore, based on the non-zero aggregation index Theoretical calculation of resource requirements and thread bundle idle resources Calculate micro data blocks Dynamic scheduling impedance when mapping to target thread bundle Then the dynamic scheduling impedance Satisfying the expression:

[0040] In the formula, For micro data blocks Dynamic scheduling impedance when mapped to the target thread bundle; For micro data blocks The non-zero aggregation index; For micro data blocks Includes the theoretical computational resource requirements; The current amount of free resources in the target thread bundle.

[0041] The above formula is constructed by using the reciprocal of the non-zero aggregation exponent as a base calculation generator. When the non-zero aggregation exponent... The smaller the value, i.e., the more discrete the data, the larger the reciprocal of the non-zero aggregation exponent, resulting in a higher numerator value and indicating that processing this discrete data block will incur more basic scheduling costs. The denominator constitutes a hardware-state-based adjustment mechanism, using the square of the ratio of resource demand to idle resources to form a dynamic barrier: when the theoretically calculated resource demand... Greater than the amount of free resources in the thread bundle When the system is overloaded, the square of the ratio expands rapidly, leading to an increase in the overall dynamic scheduling impedance. The value decreases and approaches a smaller value; at this point, in the subsequent calculation logic, the smaller impedance value is essentially transformed into a penalty scaling factor for computing power allocation, which directly suppresses the over-dispatch of the current task queue through positive multiplication limit, forming a physical flow control mechanism.

[0042] S3: Based on dynamic scheduling impedance, perform parallel resource inverse solution on micro data blocks, calculate the optimal thread allocation amount to be injected into the target micro data block, and issue scheduling instructions to the hardware task queue to perform sparse inference.

[0043] It should be noted that, since traditional scheduling uses a fixed parallel granularity, it cannot adapt to the computational difficulty within a data block, which easily leads to some threads within the same thread bundle prematurely ending their computation and entering an idle waiting state. Therefore, this invention, based on the pre-calculated dynamic scheduling impedance, inversely solves for the number of threads that match the complexity of the current task and the hardware capacity, so that the allocation of computing resources adaptively matches each region of the sparse matrix, eliminating the thread divergence phenomenon caused by static allocation.

[0044] Specifically, it calls the application programming interface of the underlying computing platform architecture, queries device properties, and extracts the maximum hardware thread quota allocated to a single task block. .

[0045] Furthermore, based on dynamic scheduling impedance Hardware maximum thread quota and non-zero aggregation index Calculate the micro data block Optimal thread allocation within the current scheduling cycle Optimal thread allocation Satisfying the expression:

[0046] In the formula, The optimal thread allocation is calculated. The round-up operator ensures that the output contains a valid number of discrete physical threads; For micro data blocks The dynamic scheduling impedance is used as a dimensionless penalty scaling factor in the calculation. The maximum number of threads set for the underlying hardware. For micro data blocks The non-zero aggregation index.

[0047] The calculation formula relies on resource scaling and the quadratic damping correction of the Euclidean norm. The calculation process uses the dimensionless dynamic scheduling impedance as the weighting benchmark for quota adjustment, directly multiplying it as a scaling factor into the physically constrained maximum hardware thread quota to ensure that the initial allocated computational scale conforms to the current dynamic load constraints of the hardware. The square root term in the denominator... Similarly, it acts as a secondary correction factor for dimensionless data. This mechanism conforms to the resource contention and memory access merging rules in the field of parallel computing: when the non-zero aggregation index of the micro data block is high, that is, when the data is extremely continuous, the value of the secondary correction factor increases accordingly. The calculation formula not only does not increase the number of threads, but also reduces the optimal number of threads allocated in the final calculation. This is because in physical execution, if highly continuous data is preempted by too many independent threads, it will cause bus access conflicts. Reducing the number of threads can enable a small number of threads to complete the merged memory access in a continuous batch processing manner, release the throughput potential, avoid the waste of clock cycles caused by too many independent threads, and realize the full utilization of the hardware parallel computing unit.

[0048] Finally, the system allocates the calculated optimal number of threads. A set of control instructions is generated for the hardware's task dispatch engine, the set of control instructions containing micro data blocks. The starting physical address of the data in video memory, the line read offset, and the corresponding optimal thread allocation. The task distribution engine sends the control instruction set to the underlying physical computing unit, and the allocated thread bundles strictly follow the instruction parameters to execute the sparse activation matrix multiplication and addition operations of the large language model in parallel, thus completing the inference advancement of the current layer.

[0049] For example, Figure 2 This diagram illustrates the comparison between the present invention and traditional solutions in terms of thread allocation. The horizontal axis represents the micro-data block number, and the vertical axis represents the number of parallel threads allocated. The diagram shows the bar chart corresponding to the traditional static allocation strategy, the bar chart corresponding to the adaptive allocation strategy of the present invention, and a baseline representing the maximum hardware thread quota. As can be seen from the diagram, the bar chart corresponding to the traditional static allocation strategy only performs linear assignment based on the number of non-zero elements, frequently touching or even attempting to exceed the hardware maximum quota baseline, exhibiting blind over-allocation. In contrast, the bar chart corresponding to the adaptive allocation strategy of the present invention exhibits dynamic scalability. Especially when encountering highly discrete micro-data blocks with scarce system resources, the bar chart values shrink significantly and are strictly controlled below the hardware maximum thread quota baseline under all circumstances. This indicates that the present invention successfully implements an adaptive flow truncation mechanism to limit over-allocation based on dynamic scheduling impedance.

[0050] For example, Figure 3This diagram illustrates the comparison between the present invention and conventional solutions in terms of suppressing pipeline pause costs. The horizontal axis represents the micro-data block number, and the vertical axis represents the pipeline pause cost assessment value presented on a logarithmic scale. The diagram shows data curves representing memory access congestion in the conventional solution and data curves representing dynamic congestion control in the present invention. As can be seen from the diagram, the data curve representing memory access congestion in the conventional solution experiences sharp spikes at multiple micro-data blocks, reflecting the bus bottleneck and severe cache misses inevitably triggered by multi-threaded concurrent access to large-scale scattered data. In contrast, the data curve representing dynamic congestion control in the present invention exhibits extremely strong anti-interference capabilities. Regardless of the fluctuation in the dispersion of micro-data blocks, this data curve remains stably suppressed within the low-level safe range at the bottom of the chart. This indicates that the present invention, by decisively cutting off the indiscriminate injection of threads into discrete data, prompts a small number of threads to complete merged memory access in a continuous batch processing manner, eliminating thread divergence caused by single-instruction multiple-data streams from a physical root cause and alleviating catastrophic pipeline pauses caused by unbalanced loading.

[0051] This invention also discloses a computation and storage system for sparse reasoning of large language models, including a processor and a memory. The memory stores computer program instructions, and when the computer program instructions are executed by the processor, a computation and storage method for sparse reasoning of large language models according to the present invention is implemented.

[0052] The system also includes other components well known to those skilled in the art, such as communication buses and communication interfaces, the settings and functions of which are known in the art and will not be described in detail here.

Claims

1. A computational and storage method for sparse inference in large language models, characterized in that, include: Obtain sparse activation matrix data for a large language model; The sparse activation matrix data is divided into grids according to the preset block step size parameters to obtain multiple micro data blocks. The spatial span of adjacent non-zero elements in the memory physical address within the micro data block is extracted, and the non-zero aggregation index of the micro data block is obtained based on the spatial span. Obtain the amount of idle resources of the target thread bundle in the target computing core, and obtain the theoretical computing resource requirement corresponding to the micro data block. Based on the ratio of the theoretical computing resource requirement to the idle resources of the thread bundle, and the non-zero aggregation index of the micro data block, obtain the dynamic scheduling impedance when the micro data block is mapped to the target thread bundle. Based on the dynamic scheduling impedance, the maximum thread quota of the underlying hardware is scaled, and a second correction is made according to the non-zero aggregation exponent to obtain the optimal thread allocation. Based on the optimal thread allocation, a set of control instructions is generated for the hardware task dispatch engine to enable the underlying physical computing unit to perform sparse activation matrix multiplication and addition operations of the large language model.

2. The computation and storage method for sparse inference of large language models according to claim 1, characterized in that, The step of dividing the sparse activation matrix data into multiple micro-data blocks according to a preset block step size parameter, and extracting the spatial span of adjacent non-zero elements in the physical memory address within each micro-data block, includes: Read sparse activation matrix data stored in compressed sparse row format from physical memory space. The sparse activation matrix data includes an array of row pointers, an array of column indices, and an array of non-zero values. The sparse activation matrix data is divided into multiple micro data blocks in a two-dimensional space according to the block step size parameter. The column index array corresponding to the micro data block is parsed, the set of non-zero elements in the micro data block is scanned, and the spatial span between two adjacent non-zero elements in the physical memory address is extracted.

3. The computation and storage method for sparse inference of large language models according to claim 1, characterized in that, The non-zero aggregation index of the micro-data block is obtained based on the spatial span, and it satisfies the following expression: ； In the formula, For micro data blocks The non-zero aggregation index; For micro data blocks The total number of non-zero elements contained within; For micro data blocks The sorted index of adjacent non-zero element pairs within the same element; For the first Spatial span between adjacent non-zero elements; It is an exponential function with the natural constant as its base; This is the base cache line step size for the underlying processor.

4. The computation and storage method for sparse inference of large language models according to claim 1, characterized in that, The process of obtaining the amount of idle resources of the target thread bundle in the target computing core includes: The physical resource status of each thread bundle in the target computing core is read in real time through the hardware performance counter of the underlying processor. Extract the current amount of idle resources of the target thread bundle, where the amount of idle resources represents the number of idle arithmetic logic unit slots in the target computing core that are currently available for allocation to the target thread bundle.

5. The computation and storage method for sparse inference of large language models according to claim 1, characterized in that, Obtaining the theoretical computational resource requirements corresponding to the micro data block includes: Extract the total number of non-zero elements contained within the micro-data block; Obtain the base clock cycles required for the hardware to perform floating-point multiply-accumulate operations; The total number of non-zero elements is combined with the base clock cycle to reflect the total time required to complete the multiplication and addition calculation of all valid elements in the micro data block. By introducing the throughput conversion rate of the underlying architecture, the total time workload is mapped into hardware capacity requirements based on spatial capacity, and the theoretical computing resource requirements are calculated.

6. The computation and storage method for sparse inference of large language models according to claim 1, characterized in that, Based on the theoretically calculated ratio of resource demand to the idle resource quantity of the thread bundle, and the non-zero aggregation index of the micro data block, the dynamic scheduling impedance when the micro data block is mapped to the target thread bundle is obtained, which satisfies the following expression: ； In the formula, For micro data blocks Dynamic scheduling impedance when mapped to the target thread bundle; For micro data blocks The non-zero aggregation index; For micro data blocks Includes the theoretical computational resource requirements; The current amount of free resources in the target thread bundle.

7. The computation and storage method for sparse inference of large language models according to claim 1, characterized in that, The method for obtaining the maximum thread quota of the underlying hardware includes: Call the application programming interface of the underlying computing platform architecture to query device properties; Extract the maximum hardware thread quota allocated to a single task block.

8. The computation and storage method for sparse inference of large language models according to claim 1, characterized in that, Based on the dynamic scheduling impedance, the maximum thread quota of the underlying hardware is scaled, and a secondary correction is performed according to the non-zero aggregation exponent to obtain the optimal thread allocation, which satisfies the following expression: ； In the formula, The optimal thread allocation is calculated. This is the floor function operator; For micro data blocks Dynamic scheduling impedance; The maximum number of threads set for the underlying hardware. For micro data blocks The non-zero aggregation index.

9. The computation and storage method for sparse inference of large language models according to claim 1, characterized in that, Based on the optimal thread allocation, a set of control instructions is generated for the hardware task dispatch engine to enable the underlying physical computing unit to perform sparse activation matrix multiplication and addition operations of the large language model, including: The control instruction set is generated by the task distribution engine of the hardware based on the optimal thread allocation. The control instruction set includes the starting physical address of the micro data block in the video memory, the line read offset, and the corresponding optimal thread allocation. The task distribution engine sends the control instruction set to the underlying physical computing unit, and the allocated thread bundles execute the sparse activation matrix multiplication and addition operation of the large language model in parallel according to the instruction parameters to complete the inference advancement of the current layer.

10. A computational and storage system for sparse inference in large language models, characterized in that, include: A processor and a memory, wherein the memory stores computer program instructions that, when executed by the processor, implement a computation and storage method for sparse inference of large language models according to any one of claims 1-9.