Instruction-level parallel optimization method for micro-matrix operations based on heterogeneous many-core processor systems
By optimizing the computation of tiny matrices using SoA layout and register-level instruction refactoring in heterogeneous many-core processor systems, the problem of insufficient instruction-level parallelism is solved, memory access is merged, and computational pipeline is utilized efficiently, thereby improving computational efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANDONG COMP SCI CENTNAT SUPERCOMP CENT IN JINAN
- Filing Date
- 2026-05-29
- Publication Date
- 2026-06-30
AI Technical Summary
In heterogeneous many-core processor systems, the computation of small matrices suffers from insufficient instruction-level parallelism, resulting in low computational efficiency. In particular, the overhead of numerous loop control and boundary check instructions is too high, which fails to fully leverage the parallel computing advantages of heterogeneous many-core processors. Furthermore, traditional methods lead to discontinuous memory access, resulting in wasted bandwidth.
A linear address mapping mechanism is used to reorganize the physical memory address and the GPU thread index space into a SoA layout. Combined with the local array explicitly defined register direct loading mechanism, through explicit register-level instruction static reconstruction, tiny matrix operations are transformed into hardware-optimal instruction sequences, realizing merged data access and maximized instruction concurrency.
It improves program execution efficiency, enhances data access continuity, improves memory bandwidth utilization, eliminates bottlenecks in memory access and computation pipelines, and improves overall computation efficiency by approximately 20%.
Smart Images

Figure CN122308923A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the technical field of electronic information, specifically relating to a method for parallel optimization of micro-matrix operations at the instruction level based on heterogeneous many-core processor systems. Background Technology
[0002] With the advent of the information technology and big data era, computing demands have surged dramatically, making it difficult for single-type processors to meet performance requirements. This has led to the rapid rise of heterogeneous many-core processor systems. In this process, general-purpose processors (CPUs) and graphics processing units (GPUs) have undergone profound architectural evolution: CPUs have evolved from single-core to multi-core, and then to many-core, establishing the crucial role of multi-threaded parallel processing in improving system performance. Simultaneously, the application boundaries of GPUs have greatly expanded. Leveraging their superior floating-point computing capabilities, they have transcended the traditional realm of graphics rendering, becoming the core of computing power in fields such as scientific computing and deep learning. To adapt to the paradigm shift from serial computing to large-scale parallel computing, programming languages and supporting development tools for GPUs, such as CUDA and OpenCL, have emerged. In this heterogeneous architecture, the CPU, as the host, focuses on task scheduling and complex logic control, while the GPU, as a co-processor, undertakes large-scale data parallel computing. This collaborative model based on "multi-core CPU + many-core GPU" not only effectively breaks through the computing power ceiling of a single architecture but also achieves optimal resource allocation between power consumption and efficiency.
[0003] However, in existing technologies, linear algebra libraries for heterogeneous many-core processors (such as NVIDIA GPUs) (e.g., cuBLAS, cuSOLVER) are primarily optimized for large-scale dense matrices. But when these general-purpose libraries are applied to scenarios involving "massive small matrices," each call to a library function requires interaction between the CPU and GPU, as well as kernel startup. For millions of tiny operations, the startup time can far exceed the actual computation time, which is clearly unsuitable. Furthermore, conventional methods for calculating small matrices primarily rely on loops, containing numerous loop control and boundary check instructions. The overhead of these auxiliary instructions accounts for a disproportionately high percentage of the total instructions, hindering the full realization of instruction-level parallelism (ILP). This not only fails to fully leverage the parallel computing advantages of heterogeneous many-core processors but also severely restricts overall computational efficiency.
[0004] In heterogeneous many-core processor systems, hotspot functions are identified and their parallel computation portions are migrated to the GPU for execution, with acceleration optimization achieved using the CUDA parallel computing framework. However, during the optimization process, it was discovered that the ported code contains a large number of tiny matrix calculations, with extremely short core computation loops. In the compiled GPU instructions, the number of loop control instructions (such as conditional comparisons, branch jumps, and index increments) may be comparable to or even greater than the number of actual floating-point operation instructions (such as multiply-accumulate). This leads to insufficient instruction-level parallelism, resulting in two prominent problems: first, a large number of "bubbles" exist in the computation instruction flow, making it difficult to fully utilize the instruction pipeline; second, it is difficult to effectively utilize the long-latency pipeline operations deeply optimized for high-performance computing by the GPU, especially the throughput potential of the fused multiply-accumulate (FMA) instruction pipeline cannot be fully realized. At the same time, traditional physics simulation code often uses a "structure array" layout. Under the GPU's SIMT (Single Instruction, Multithreaded) architecture, thread bundles experience non-merged accesses when accessing video memory, resulting in severe bandwidth waste. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a micro-matrix operation instruction-level parallel optimization method based on heterogeneous many-core processor systems; This invention proposes an innovative collaborative optimization method. First, it innovatively proposes a linear isomorphic mapping mechanism between physical memory addresses and GPU thread index spaces, reorganizing the "structure array" layout commonly used in traditional physics simulations into an "array structure" layout. By redefining the data arrangement order in GPU memory, it establishes an addressing rule with particle index as the primary dimension, ensuring that the same matrix elements of different particles are continuously distributed at physical addresses, thus achieving memory access merging. Second, based on efficient data loading, it explicitly defines local variables to directly load data from global GPU memory into the private registers of each GPU thread, establishing a register-level high-speed cache within the thread. Finally, in a zero-latency data environment, it implements register-level instruction static reconstruction based on orthogonal computational flow. By transforming minute matrix operations into explicit fused multiply-add recursive formulas, it leverages the data orthogonality between different computational paths to maximize concurrent instruction issuance. In summary, this invention achieves end-to-end performance improvement from memory access to computation through a coherent optimization path of "linear address mapping—direct register loading—static instruction reconstruction."
[0006] In heterogeneous many-core processor systems, tiny matrices are typically calculated using nested loops, which contain numerous loop control and boundary check instructions, hindering instruction-level parallelism. This invention addresses this issue by optimizing the process, effectively improving program efficiency. Furthermore, this invention includes an automated interface, AIIP (Address Insufficient Instruction-Level Parallelism), for convenient direct invocation by programmers.
[0007] The technical solution of this invention is as follows: Instruction-level parallel optimization methods for small matrix operations based on heterogeneous many-core processor systems include: Step 1: Particle-first linear address mapping architecture; mapping the multidimensional tensor space in the physics problem to the one-dimensional linear physical address space of the GPU; Step 2: Construct a register direct loading mechanism based on explicitly defined local arrays; By combining contiguous address layout, a fixed-size local array is explicitly defined in the kernel, and the compiler's register allocation mechanism is used to build a direct data path from global video memory to registers; Step 3: Register-level instruction static reconfiguration method based on orthogonal computation flow; An explicit code refactoring strategy is employed to transform tiny tensor operations into hardware-optimal instruction sequences.
[0008] According to a preferred embodiment of the present invention, the specific implementation process of step 1 includes: The multidimensional tensor space vϵR in physics problems (NxRxC) Mapping to the one-dimensional linear physical address space A of the GPU, define the mapping function Φ:(p,r,c)→A, where p is the particle index, r is the row index, c is the column index; N represents the total number of particles, serving as the span multiplier between rows and columns; R represents the number of rows in the matrix, and C represents the number of columns in the matrix; Let the index vector i = [p, r, c] ⊺ The mapping function Φ corresponding to the SoA data layout SoA That is, the mapping function Φ:(p,r,c)→Α is defined as: ; Where Base represents the base address of the data in global video memory, and sizeof(Type) represents the byte width of the data type; These refer to the coordinate values of a specific dimension, with the following correspondence: i1 = p, i2 = r, i3 = c; This refers to the dimensional span, representing the number of elements skipped in the linear address space when adding 1 to the k-th dimension; k refers to the dimension of the space, k=1, 2, 3 correspond to the index vector i = [p, r, c] respectively. ⊺ The three components; The corresponding span vector S is defined as follows: ; Wherein, the span vector S includes and , This refers to the unit span. This refers to the row span. This refers to the column span; Therefore, the linear address of the expanded SoA for: .
[0009] According to a preferred embodiment of the present invention, step 2 includes the following specific implementation process: Step 2.1: Explicitly define a local array; In the GPU kernel, a fixed-size local array is defined for each thread; this local array is then directly allocated into a register. Step 2.2: Construct a direct data pathway; By utilizing the compiler's register allocation mechanism, direct loading from global video memory into registers can be achieved; including: Step 2.2.1: Loop Unrolling and Index Constantization; The loop logic involving local array access is physically unrolled, and the array element access after physical unrolling is transformed from dynamic variable indexes to fixed register operands; Step 2.2.2: Explicit local variable mapping; The compiler generates direct memory access instructions based on the expanded constant index. The destination operand of the direct memory access instructions points directly to the pre-allocated physical register number, so that after the data is read from the global video memory, it can be directly delivered to the input register of the instruction execution unit without going through shared memory or L1 cache.
[0010] According to a preferred embodiment of the present invention, step 3 specifically includes the following steps: Step 3.1: Reconstruct the recursive formula based on the FMA operator; Transform iterative matrix multiplication into an explicit sequence of linear operators, and reconstruct the calculation process of each output element into a set of nested double-precision fused multiply-add recursive formulas; Let the objective be calculated as y = A*x, for any component y of the output vector i Define the k-th level cumulative state operator S i (k) : ; Among them, S i (0) =0, C is the number of columns in the matrix, and DFMA(a,b,c) corresponds to the GPU hardware instruction a×b + c; This refers to the element in the i-th row and k-th column of the input matrix A. These elements are stored in global video memory through the SoA layout and loaded into registers to participate in the calculation. It refers to the k-th component of the input vector x; Stored according to the SoA mapping function, ensuring that thread bundles can be merged and read, and then loaded directly through the register mechanism. and The values are directly sent to the thread's private register. Finally, the computation unit retrieves these two values directly from the register and efficiently completes the iterative accumulation of y=A*x using the DFMA instruction. Step 3.2: Maximize instruction issuance based on data orthogonality; By leveraging the row independence of tiny matrix operations, instruction blocks without data hazards are constructed; through mathematical proof of the orthogonality of different computation paths, the GPU's instruction scheduler is forced to concurrently issue multiple instructions, filling the pipeline slots of the execution unit; For different components y of the output vector i and y j , i≠j, calculate path S i With S j Satisfying the data dependency orthogonality condition: .
[0011] An automated interface AIIP is provided to implement the aforementioned micro-matrix operation instruction-level parallel optimization method based on heterogeneous many-core processor systems.
[0012] A computer device includes a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the above-described micro-matrix operation instruction-level parallel optimization method based on a heterogeneous many-core processor system.
[0013] A computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the above-described micro-matrix operation instruction-level parallel optimization method based on a heterogeneous many-core processor system.
[0014] The beneficial effects of this invention are as follows: 1. Deep alignment of data layout with hardware characteristics: The SoA mapping function solves the memory scattering problem of large-scale fine-grained matrices in parallel computing and realizes merged memory access; 2. Data transfer path compression: The direct register loading mechanism is adopted to bypass the shared memory level and eliminate the performance bottleneck on the memory access side by utilizing the high bandwidth of registers. 3. Instruction-level refactoring of computational logic: By refactoring complex tensor operations into recursive instruction chains based on FMA operators, the loop control overhead at the software level is eliminated, and the problem of excessive pipeline bubbles is solved. 4. Improved overall program efficiency: Through testing on particle numbers of different scales, the method of this invention improves efficiency by about 20% compared with ordinary optimization, effectively demonstrating the technical advantages of this invention in solving the problem of insufficient instruction-level parallelism. Attached Figure Description
[0015] Figure 1 This is a diagram illustrating GPU memory access modes. Figure 2 This is a diagram illustrating GPU direct register loading. Figure 3 This is a diagram comparing the efficiency of instruction pipelines. Figure 4 This is a data flow path diagram. Detailed Implementation
[0016] The present invention will be further defined below with reference to the accompanying drawings and embodiments, but is not limited thereto.
[0017] Terminology Explanation: SoA is an abbreviation for Structure of Arrays. It is a layout method for storing data in memory, in contrast to the traditional AoS (Array of Structures). Its core feature is to physically separate the attribute fields of a composite object and arrange data elements with the same attribute consecutively in memory space.
[0018] Example 1 Instruction-level parallel optimization methods for small matrix operations based on heterogeneous many-core processor systems include: Step 1: Particle-first linear address mapping architecture; mapping the multidimensional tensor space in the physics problem to the one-dimensional linear physical address space of the GPU; Step 2: Construct a register direct loading mechanism based on explicitly defined local arrays; GPU optimization typically utilizes shared memory as a data cache. Frequent access to global or shared memory can cause pipeline stalls and easily lead to bank conflicts and synchronization overhead. To further improve performance, frequent direct reads and writes to global memory should be avoided as much as possible within the computing kernel. Instead, local variables should be explicitly defined, and thread-private caches should be established in registers to reduce access latency and improve execution efficiency.
[0019] By combining contiguous address layout, fixed-size local arrays are explicitly defined in the kernel. Utilizing the compiler's register allocation mechanism, a direct data path from global memory to registers is constructed. This process avoids shared memory intermediaries, preventing memory conflicts that may arise from multi-threaded access to shared memory. Furthermore, it leverages registers—the fastest and highest-bandwidth storage level—achieving zero-overhead data residency in this fastest storage level, thus creating a low-latency, high-bandwidth data environment for subsequent computations. GPU direct register loading is as follows: Figure 2 As shown, taking 4 threads as an example, R0, R1, R2, and R3 represent registers.
[0020] Step 3: Register-level instruction static reconfiguration method based on orthogonal computation flow; While general-purpose compilers can perform loop unrolling, they cannot predict the constancy of matrix dimensions in the physical model, nor can they guarantee strict adherence to orthogonal scheduling policies after unrolling. This method does not rely on the compiler's default optimization options but instead employs an explicit code refactoring strategy to transform tiny tensor operations into hardware-optimal instruction sequences.
[0021] Example 2 The difference between the micro-matrix operation instruction-level parallel optimization method based on heterogeneous many-core processor systems described in Example 1 and the method described in Example 1 is as follows: The specific implementation process of step 1 includes: The multidimensional tensor space vϵR in physics problems (NxRxC) Mapping to the one-dimensional linear physical address space A of the GPU, define the mapping function Φ:(p,r,c)→A, where p is the particle index, r is the row index, c is the column index; N represents the total number of particles, serving as the span multiplier between rows and columns; R represents the number of rows in the matrix, and C represents the number of columns in the matrix; In existing AoS technology, the mapping function Φ AoS It is usually defined as: ; Where Base represents the base address of the data in global video memory, R represents the number of rows in the matrix, C represents the number of columns in the matrix, and sizeof(Type) represents the byte width of the data type.
[0022] This mapping results in an R×C stride between adjacent threads p and p+1, leading to non-contiguous memory addresses. Simultaneously, the original CPU code reuses the same small array in serial loops, lacking thread independence, and direct parallelism causes data races and overlapping. To achieve code parallelization, the "particle index" is uniformly placed in the first primary dimension, forming a layout specification of A(d_i_ptcl_loc, …). All particle attributes and states are reconstructed using SoA. Since Fortran uses column-major order storage, this layout ensures that when a GPU warp (typically containing 32 threads) accesses the same matrix element of different particles in parallel, these elements are contiguously distributed in physical memory addresses. This triggers the GPU's merged memory access mechanism, enabling data reading with minimal memory transactions. This change fundamentally solves the memory bandwidth waste problem caused by non-merged access, providing a continuous and efficient data supply to the computing core.
[0023] Let the index vector i = [p, r, c] ⊺ The mapping function Φ corresponding to the SoA data layout SoA That is, the mapping function Φ:(p,r,c)→Α is defined as: The logical three-dimensional coordinates (p, r, c) are calculated as a unique one-dimensional linear offset address in memory, and then combined with the base address Base to obtain the final physical address A.
[0024] ; Where Base represents the base address of the data in global video memory, and sizeof(Type) represents the byte width of the data type; These refer to the coordinate values of a specific dimension, with the following correspondence: i1 = p, i2 = r, i3 = c; This refers to the dimensional span, representing the number of elements skipped in the linear address space when adding 1 to the k-th dimension; k refers to the dimension of the space, k=1, 2, 3 correspond to the index vector i = [p, r, c] respectively. ⊺ The three components; The corresponding span vector S is defined as follows: ; Wherein, the span vector S includes and , This refers to the unit span, meaning that the same components of adjacent particles (p and p+1) are strictly continuous in memory; This refers to the row span, which is the number of particles N that need to be traversed to skip a row of data. This refers to the column span; skipping a column of data requires traversing N particles and all their rows R. Therefore, the linear address of the expanded SoA for: .
[0025] GPU memory access modes, for example Figure 1 As shown, T0, T1, T2, and T3 represent GPU threads. Figure 1 T0 is the first thread, and so on; P0, P1, P2, P3: represent data for different particles, for example: P0_Dat: data for particle 0; Row: row coordinates of the particle in the grid, Col: column coordinates, Particle: particle attributes.
[0026] The specific implementation process of step 2 includes: The register direct loading mechanism based on explicitly defined local arrays is an advanced optimization technique that bypasses GPU shared memory and directly loads data from global video memory into thread-private registers.
[0027] Step 2.1: Explicitly define a local array; In the GPU kernel, a fixed-size local array is defined for each thread; since its size is known at compile time, the compiler tends to allocate the local array directly to a register rather than placing it in slower local memory. Step 2.2: Construct a direct data pathway; By utilizing the compiler's register allocation mechanism, direct loading from global video memory into registers can be achieved; including: Step 2.2.1: Loop Unrolling and Index Constantization; The loop logic involving local array access is physically unrolled, and the array element access after physical unrolling is transformed from dynamic variable indexes to fixed register operands; thereby eliminating the overhead of calculating array offsets at runtime.
[0028] Step 2.2.2: Explicit Local Variable Mapping; The compiler generates direct memory access instructions based on the expanded constant indices. The destination operand of these direct memory access instructions points directly to the pre-allocated physical register number, allowing data read from global memory to be delivered directly to the input register of the instruction execution unit without going through shared memory or L1 cache. Assuming we need to process 1000 particles, each with a 4×4 matrix, and thread number 5 wants to load the element in the 1st row and 2nd column of its matrix, we can use the mapping function Φ... SoA The offset is calculated to be 9005. Direct loading is executed. Thread 5 issues an instruction to directly find the data at position 9005 of the global video memory and store it in its private register variable.
[0029] Eliminating intermediate layer overhead avoids storage conflicts that may occur when multiple threads access shared memory, as data no longer passes through shared memory, and also reduces thread synchronization waiting time. By leveraging the top-level storage hierarchy, registers are the fastest and highest-bandwidth storage hierarchy in a GPU. Through this mechanism, data achieves "zero-overhead residency" in the fastest storage hierarchy, creating a low-latency environment for subsequent computations.
[0030] First, use the mapping function Φ SoA Converting multidimensional tensor data to a SoA layout ensures address continuity within the global memory space. Furthermore, only under a SoA layout can the GPU's merge access mechanism be triggered when data is "directly loaded" from global memory into registers. If an AoS layout is used, direct loading leads to numerous unaligned accesses, negating the performance advantages of registers. Subsequently, combined with explicitly defined local arrays in kernel functions, the compiler is triggered to directly map data from contiguous global addresses to thread-private registers, thus avoiding memory conflicts in shared memory and enabling direct data migration from high-latency video memory to the zero-latency register level.
[0031] The specific implementation process of step 3 includes: Step 3.1: Reconstruct the recursive formula based on the FMA operator; Transforming iterative matrix multiplication into an explicit sequence of linear operators is a conventional approach that uses the summation logic based on row and column traversal in matrix multiplication. However, this approach has implicit temporal dependencies in its mathematical expression and cannot adequately address the problem of computing small matrices. To address this, I employ an innovative method that reconstructs the computation process of each output element into a set of nested Double Precision Fused Multiply-Add (DFMA) recursive formulas. Let the objective be calculated as y = A*x, for any component y of the output vector i Define the k-th level cumulative state operator S i (k) : ; Among them, S i (0) =0, C is the number of columns in the matrix, and DFMA(a,b,c) corresponds to the GPU hardware instruction a×b + c; This refers to the element in the i-th row and k-th column of the input matrix A. These elements are stored in global video memory through the SoA layout and loaded into registers to participate in the calculation. It refers to the k-th component of the input vector x; Stored according to the SoA mapping function, ensuring that thread bundles can be merged and read, and then loaded directly through the register mechanism. and The values are directly sent to the thread's private register. Finally, the computation unit retrieves these two values directly from the register and efficiently completes the iterative accumulation of y=A*x using the DFMA instruction. This refactoring locks the computation process into a pure arithmetic instruction chain of depth C, eliminating all register overhead caused by loop index variables (such as i, j).
[0032] Meanwhile, conventional code often only focuses on the correctness of the calculation result. This method mathematizes the calculation process into a recursive operator and, by leveraging the independence of this operator, actively designs the instruction issue sequence, thereby solving the core problem of excessive pipeline bubbles in heterogeneous many-core processor systems when processing small tasks. The instruction pipeline efficiency of the traditional loop approach and the FMA recursive refactoring method is compared... Figure 3 As shown, taking a loop of 4 times and matrix A as an example with a dimension of 4×4.
[0033] Step 3.2: Maximize instruction issuance based on data orthogonality; By leveraging the row independence of tiny matrix operations, data hazard-free instruction blocks are constructed; through mathematical proof of the orthogonality of different computation paths, the GPU's instruction scheduler is forced to concurrently issue multiple instructions, filling the pipeline slots of the execution unit; For different components y of the output vector i and y j , i≠j, calculate path S i With S j Satisfying the data dependency orthogonality condition: .
[0034] This formula shows that calculating y i Any intermediate state is completely independent of y j In this state, the GPU can simultaneously fire them onto the pipeline for computation, without having to compute one before moving on to the next. Based on this property, this method interweaves R independent S sequences together, when S1 (k) When waiting for video memory data or experiencing execution delays, the hardware scheduler immediately issues S2. (k) The instructions.
[0035] Combining steps 1 and 2, the data flow path is as follows: Figure 4 As shown. By Figure 4 It can be seen that there are no L1 cache or shared memory read / write operations in the entire process, achieving zero intermediate storage overhead.
[0036] In summary, this invention addresses the bandwidth bottleneck of data transmission through linear address mapping, serving as the physical foundation; it overcomes the latency bottleneck at the memory level through direct register loading, acting as a connecting bridge; and finally, it solves the core problem of excessive control flow overhead in small-scale computations through static instruction refactoring, thereby improving the overall execution efficiency of the program. These three aspects work together to address the problem of insufficient instruction-level parallelism in general tensor computation methods on heterogeneous many-core processor systems.
[0037] This invention designs and implements the Automation Interface (AIIP), which aims to address the insufficient instruction-level parallelism in small matrix computations on heterogeneous many-core processor systems. The implementation of this interface reduces programming complexity, allowing programmers to directly call it to solve problems. Interface specifications are shown in Table 1.
[0038] Table 1 AIIP Interface Description; The following is a sample call to the AIIP automation interface: / / Call the AIIP interface on the host (CPU) program main use AIIP … / / Other operations omitted / / Initialize the interface call AIIP_Init() / / Parameter assignment N=43063 R=4 C=4 / / Call the interface function call AIIP_Load_to_Registers(d_GlobalA, idx, reg_A) call AIIP_Static_Reconfiguration(reg_A, reg_x, reg_y) / / Close the interface call AIIP_Finalize() … / / Other operations omitted end program main Through the implementation of the above method, the invention was tested in a heterogeneous many-core processor system based on the computing environment of NVIDIA A100 GPU.
[0039] This test uses GYCAVA, a charged particle orbit simulation program based on the guiding center equation, as an example. Under the same parameters, the time was tested on a CPU serial program, a GPU general-optimized program, and an invention-optimized program with particle numbers of 5,382,875, 10,765,750, and 21,708,288, respectively. The CPU serial program refers to a program that executes instructions sequentially on a single-core CPU, relying entirely on the CPU's logic control capabilities and caching mechanisms, without utilizing multi-core or many-core acceleration. The GPU general-optimized program is a version that has been ported to CUDA and uses multi-threaded parallelism, but its memory access mode is still relatively traditional and has not been specifically adapted for GPU hardware architecture (such as merged access and instruction pipeline). The invention-optimized program is a program that fully applies the methods described in this invention and the AIIP interface.
[0040] This experiment compares the time and speedup of the CPU serial program, the GPU's ordinary optimized program, and the invented optimized program. The experimental results demonstrate that the method described in this invention has a significant speedup effect compared to the CPU serial method and the GPU's ordinary optimized method. Specific results and comparisons are shown in the table below.
[0041] Table 2 shows the running time and speedup ratio of the test program before and after optimization when the particle number is 5,382,875. Table 2 is as follows: Table 2 Comparison of acceleration ratios for particles with a number of 5,382,875; Table 3 shows the running time and speedup ratio of the test program before and after optimization when the number of particles is 10,765,750. Table 3 is as follows: Table 3. Comparison of acceleration ratios for particles with a number of 10,765,750; Table 4 shows the running time and speedup ratio of the test program before and after optimization when the number of particles is 21,708,288. Table 4 is as follows: Table 4. Comparison of acceleration ratios for particles with a number of 21,708,288; Example 3 An automated interface AIIP (Address Insufficient Instruction-Level Parallelism) is used to implement the micro-matrix operation instruction-level parallel optimization method based on heterogeneous many-core processor systems described in Embodiment 1 or 2.
[0042] Example 4 A computer device includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the steps of the micro-matrix operation instruction-level parallel optimization method based on a heterogeneous many-core processor system as described in Embodiment 1 or 2.
[0043] Example 5 A computer-readable storage medium storing a computer program thereon, wherein when the computer program is executed by a processor, it implements the steps of the micro-matrix operation instruction-level parallel optimization method based on a heterogeneous many-core processor system as described in Embodiment 1 or 2.
Claims
1. A method for optimizing micro-matrix operation instruction-level parallelism based on heterogeneous many-core processor systems, characterized in that: include: Step 1: Particle-first linear address mapping architecture; Mapping the multidimensional tensor space in physics problems to the one-dimensional linear physical address space of the GPU; Step 2: Construct a register direct loading mechanism based on explicitly defined local arrays; By combining contiguous address layout, a fixed-size local array is explicitly defined in the kernel, and the compiler's register allocation mechanism is used to build a direct data path from global video memory to registers; Step 3: Register-level instruction static reconfiguration method based on orthogonal computation flow; An explicit code refactoring strategy is employed to transform tiny tensor operations into hardware-optimal instruction sequences.
2. The micro-matrix operation instruction-level parallel optimization method based on heterogeneous many-core processor systems according to claim 1, characterized in that, The specific implementation process of step 1 includes: The multidimensional tensor space vϵR in physics problems (NxRxC) Mapping to the one-dimensional linear physical address space A of the GPU, the mapping function is defined as Φ:(p,r,c)→A, where p is the particle index, r is the row index, c is the column index; N represents the total number of particles, serving as the span multiplier between rows and columns; R represents the number of rows in the matrix, and C represents the number of columns in the matrix; Let the index vector i = [p, r, c] ⊺ The mapping function Φ corresponding to the SoA data layout SoA That is, the mapping function Φ:(p,r,c)→Α is defined as: ; Where Base represents the base address of the data in global video memory, and sizeof(Type) represents the byte width of the data type; These refer to the coordinate values of a specific dimension, with the following correspondence: i1 = p, i2 = r, i3 = c; This refers to the dimensional span, representing the number of elements skipped in the linear address space when adding 1 to the k-th dimension; k refers to the dimension of the space, k=1, 2, 3 correspond to the index vector i = [p, r, c] respectively. ⊺ The three components; The corresponding span vector S is defined as follows: ; Wherein, the span vector S includes and , This refers to the unit span. This refers to the row span. This refers to the column span; Therefore, the linear address of the expanded SoA for: 。 3. The micro-matrix operation instruction-level parallel optimization method based on heterogeneous many-core processor systems according to claim 1, characterized in that, The specific implementation process of step 2 includes: Step 2.1: Explicitly define a local array; In the GPU kernel, a fixed-size local array is defined for each thread; this local array is then directly allocated into a register. Step 2.2: Construct a direct data pathway; By utilizing the compiler's register allocation mechanism, direct loading from global video memory into registers can be achieved; including: Step 2.2.1: Loop Unrolling and Index Constantization; The loop logic involving local array access is physically unrolled, and the array element access after physical unrolling is transformed from dynamic variable indexes to fixed register operands; Step 2.2.2: Explicit local variable mapping; The compiler generates direct memory access instructions based on the expanded constant index. The destination operand of the direct memory access instructions points directly to the pre-allocated physical register number, so that after the data is read from the global video memory, it can be directly delivered to the input register of the instruction execution unit without going through shared memory or L1 cache.
4. The micro-matrix operation instruction-level parallel optimization method based on heterogeneous many-core processor systems according to claim 1, characterized in that, The specific implementation process of step 3 includes: Step 3.1: Reconstruct the recursive formula based on the FMA operator; Transform iterative matrix multiplication into an explicit sequence of linear operators, and reconstruct the calculation process of each output element into a set of nested double-precision fused multiply-add recursive formulas; Let the objective be calculated as y = A*x, for any component y of the output vector i Define the k-th level cumulative state operator S i (k) : ; Among them, S i (0) =0, C is the number of columns in the matrix, and DFMA(a,b,c) corresponds to the GPU hardware instruction a×b + c; This refers to the element in the i-th row and k-th column of the input matrix A. These elements are stored in global video memory through the SoA layout and loaded into registers to participate in the calculation. It refers to the k-th component of the input vector x; Stored according to the SoA mapping function, ensuring that thread bundles can be merged and read, and then loaded directly through the register mechanism. and The values are directly sent to the thread's private register. Finally, the computation unit retrieves these two values directly from the register and efficiently completes the iterative accumulation of y=A*x using the DFMA instruction. Step 3.2: Maximize instruction issuance based on data orthogonality; By leveraging the row independence of tiny matrix operations, instruction blocks without data hazards are constructed; through mathematical proof of the orthogonality of different computation paths, the GPU's instruction scheduler is forced to concurrently issue multiple instructions, filling the pipeline slots of the execution unit; For different components y of the output vector i and y j , i≠j, calculate path S i With S j Satisfying the data dependency orthogonality condition: 。 5. An automated interface AIIP, characterized in that, This method is used to implement the micro-matrix operation instruction-level parallel optimization method based on heterogeneous many-core processor systems as described in any one of claims 1-4.
6. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the micro-matrix operation instruction-level parallel optimization method based on a heterogeneous many-core processor system as described in any one of claims 1-4.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the micro-matrix operation instruction-level parallel optimization method based on a heterogeneous many-core processor system as described in any one of claims 1-4.