Hardware abstraction based cross-architecture compute-intensive fusion operator generation method

By using coordinate-driven unified semantic abstraction and cross-hardware normalization mechanism, the semantics of operators are decoupled from hardware execution, solving the problem of porting computationally intensive operators across different architectures and achieving high efficiency in cross-platform performance and development.

CN122240123APending Publication Date: 2026-06-19SOUTH CHINA UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SOUTH CHINA UNIV OF TECH
Filing Date
2026-03-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In the existing technology, computationally intensive operators are difficult to port efficiently between different computing architectures, resulting in low development efficiency. Furthermore, the semantics of the operators are deeply coupled with the hardware execution, making it difficult to support complex broadcast semantics and multiple auxiliary tensor inputs.

Method used

By introducing coordinate-driven unified semantic abstraction and cross-hardware normalization mechanism, the differences in register layout, thread organization and instruction set of heterogeneous hardware are shielded, thereby decoupling the mathematical semantics from hardware execution and generating high-performance fused code across architectures.

Benefits of technology

Without introducing explicit branches and additional runtime overhead, it generates high-performance fused code, significantly improving overall computing performance and development efficiency, and supporting efficient cross-platform execution.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240123A_ABST
    Figure CN122240123A_ABST
Patent Text Reader

Abstract

This application relates to a method for generating cross-architecture computationally intensive fusion operators based on hardware abstraction. It constrains and determines the operator subgraph according to preset legality rules, establishes a unified semantic abstraction, defines the fusion operator as a coordinate-driven pure function, and encapsulates auxiliary tensor memory access handles and layout geometry information through runtime context. A backend normalization layer is used to semantically reinterpret the accumulator layout and thread organization of different hardware vendors during compile time, and cross-platform vectorization primitives are invoked to mask the differences in backend instruction set architectures. Arithmetic logic is downgraded to a sequence of instructions executed within a scalar scope. The abstract vector view is first decomposed into independent scalar elements, arithmetic instructions are generated in the scalar dimension, the calculation results are then reorganized into a vector register, and finally, an address expression is constructed by deriving the effective step size to handle broadcast and indexing logic, generating fusion code that can access the underlying vendor's computationally intensive kernel, thus improving development efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the fields of artificial intelligence technology and operator optimization, and in particular to a method for generating cross-architecture computationally intensive fusion operators based on hardware abstraction. Background Technology

[0002] In deep learning model deployment, computationally intensive operators such as matrix multiplication and convolution are key to performance optimization. The industry commonly employs operator fusion techniques, cascading these operators with subsequent element-wise operations to reduce redundant memory access overhead across storage layers. However, due to the severe fragmentation of underlying hardware architectures and the deep coupling between operator semantics and specific execution logic, existing fusion techniques often heavily rely on hand-written customizations for specific architectures. This lack of a unified hardware abstraction leads to low development efficiency and makes it difficult to efficiently port high-performance fusion operators across different computing architectures. Summary of the Invention

[0003] Therefore, it is necessary to provide a hardware-based, cross-architecture, computationally intensive fusion operator generation method, system, computer device, and computer-readable storage medium that can improve development efficiency and address the above problems.

[0004] The first aspect of this application provides a method for generating cross-architecture computationally intensive fusion operators based on hardware abstraction, including: The compiler front end captures operator subgraphs that depend on the output of matrix multiplication computation-intensive operators, and performs constraint judgment on the operator subgraphs according to preset legality rules. The legality rules require that the subgraph operators have element-wise computational features, shape and broadcast consistency, and that all access to auxiliary tensors must satisfy affine indexability based on the coordinates of the output of computation-intensive operators. A unified semantic abstraction is established, the fusion operator is defined as a coordinate-driven pure function, and the auxiliary tensor memory access handle and layout geometry information are encapsulated through runtime context to achieve decoupling of the arithmetic semantics and hardware execution; By leveraging the backend normalization layer to semantically reinterpret the accumulator layout and thread organization of hardware from different vendors during compilation, the normalization is standardized into a unified abstract vector view and global logical coordinates, and cross-platform vectorization primitives are called to shield the differences between different backend instruction set architectures. The arithmetic logic is downgraded to a sequence of instructions executed within a scalar scope. First, the abstract vector view is decomposed into independent scalar elements, and arithmetic instructions are generated on the scalar dimension. Then, the calculation results are reorganized into the vector register. Finally, the address expression is constructed by deriving the effective step size to handle broadcast and index logic, generating fused code that can be accessed by the underlying vendor's computationally intensive kernel.

[0005] In one embodiment, capturing the operator subgraph from the compiler front end, which depends on the output of matrix multiplication computation-intensive operators, includes: identifying candidate operators that satisfy element-wise semantics and whose data flow starts from the matrix multiplication output, and transforming the operator subgraph into a preset canonical form through algebraic regularization and chain flattening.

[0006] In one embodiment, the constraint determination of the operator subgraph according to the preset legality rules includes: verifying, through shape consistency check, that all operators in the subgraph can be interpreted as point-to-point transformations of the matrix multiplication output tensor; and checking, through affine indexability verification, whether the memory access offset of the auxiliary tensor in the subgraph can be mapped to an affine function with respect to the global logical coordinates of the matrix multiplication output.

[0007] In one embodiment, establishing a unified semantic abstraction, defining the fusion operator as a coordinate-driven pure function, and encapsulating auxiliary tensor memory access handles and layout geometry information through runtime context to decouple the computational semantics from hardware execution, includes: Construct a runtime context container, which encapsulates the base address pointer of the auxiliary tensor and the layout step size of the matrix multiplication output, and dynamically resolves the global logical coordinates into the physical memory access offsets of each auxiliary tensor at execution time; Construct a function interface based on accumulators and coordinates, and map the hardware thread organization of different backends into normalized three-dimensional logical coordinates containing batch, row and column dimensions, and combine validity predicates for boundary protection and computation task distribution; During the execution phase, a scalarization strategy is adopted, which decomposes the vector data in the register accumulator into a static scalar loop. Within the scalar domain, a pipeline instruction sequence is executed sequentially, which loads operands from auxiliary tensors, performs arithmetic and logical transformations, and fills the results back into the vector register. This transforms abstract vector operations into a scalar instruction stream that can be registered and revectorized by the underlying compiler.

[0008] In one embodiment, the step of utilizing a backend normalization layer to semantically reinterpret the accumulator layout and thread organization of hardware from different vendors during compilation, and normalizing them into a unified abstract vector view and global logical coordinates, includes: By constructing an accumulator view adapter through template specialization, the opaque memory fragments or register layouts specific to different backends are semantically reinterpreted into a unified abstract vector view at compile time, so as to achieve type alignment with zero runtime overhead. By utilizing a local-to-global coordinate mapping mechanism, the backend-specific thread-level index is projected into a unified batch, row, and column three-dimensional logical coordinate system. By using kernel-complete configuration calculation validity predicates and propagating them to the fusion operator interface, cross-architecture boundary illegal access can be shielded without introducing explicit branch instructions.

[0009] In one embodiment, the invocation of cross-platform vectorized primitives masks the differences between different backend instruction set architectures, including: By using abstract vector containers as a porting and encapsulation of hardware native vectors, the decoupling of logical vector length and physical register bit width is achieved. The vector loading primitive with predicates is invoked, which receives global logical coordinates and validity predicates, and automatically schedules the corresponding native loading instructions to perform masked memory access based on the instruction set characteristics of the target architecture. By using vector extraction and compile-time static expansion primitives, a unified fusion description is mapped to a scalarized execution flow, thereby enabling backend compilers to perform re-vectorization and register allocation optimization on different single instruction multiple data or single instruction multiple thread architectures.

[0010] In one embodiment, the step of reducing arithmetic logic to a sequence of instructions executed within a scalar scope first involves decomposing the abstract vector view into independent scalar elements and generating arithmetic instructions on the scalar dimension, including: By analyzing the shape relationship between the auxiliary tensor and the matrix multiplication output, the effective step size for each dimension is derived, and the step size of the broadcast dimension is set to zero in order to construct an affine expression for memory access offset based on logical coordinates. After topological linearization of the operator graph, it is mapped to the scalar scope, and a static single-assignment instruction sequence is generated using a pipelined pattern of vector decomposition-scalar computation-vector backfilling.

[0011] A second aspect of this application provides a cross-architecture computationally intensive fusion operator generation system based on hardware abstraction, comprising: The subgraph capture and validity check module is used to capture operator subgraphs that depend on the output of matrix multiplication computation-intensive operators from the compiler front end, and to perform constraint judgment on the operator subgraphs according to preset validity rules. The validity rules require that the subgraph operators have element-wise computational features, shape and broadcast consistency, and that all access to auxiliary tensors must satisfy affine indexability based on the coordinates of the output of computation-intensive operators. The Unified Semantic Abstraction Module is used to establish a unified semantic abstraction, define the fusion operator as a coordinate-driven pure function, and encapsulate the auxiliary tensor memory access handle and layout geometry information through runtime context to achieve decoupling of the arithmetic semantics and hardware execution. The cross-hardware abstract mapping module is used to leverage the backend normalization layer to semantically reinterpret the accumulator layout and thread organization of different hardware vendors during compilation, normalize them into a unified abstract vector view and global logical coordinates, and call cross-platform vectorization primitives to shield the differences between different backend instruction set architectures. The code generation module is used to reduce arithmetic logic to a sequence of instructions that can be executed within a scalar scope. It first decomposes the abstract vector view into independent scalar elements, generates arithmetic instructions in the scalar dimension, then reorganizes the calculation results into the vector register, and finally constructs the address expression by deriving the effective step size to handle broadcast and index logic, generating fused code that can be accessed by the underlying vendor's compute-intensive kernel.

[0012] A third aspect of this application provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the above-described method.

[0013] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described method.

[0014] The aforementioned method, system, computer device, and computer-readable storage medium for generating cross-architecture computationally intensive fusion operators based on hardware abstraction capture operator subgraphs from the compiler front-end that depend on the output of matrix multiplication computationally intensive operators, and constrain the operator subgraphs according to preset validity rules. A unified semantic abstraction is established, defining fusion operators as coordinate-driven pure functions, and encapsulating auxiliary tensor memory access handles and layout geometry information through runtime context, thereby decoupling the semantics of the operators from hardware execution. A back-end normalization layer is used to semantically reinterpret the accumulator layout and thread organization of hardware from different vendors during compile time, normalizing them into a unified abstract vector view and global logical coordinates, and calling cross-platform vectorization primitives to shield the differences between different back-end instruction set architectures. By reducing arithmetic logic to a sequence of instructions executed within a scalar scope, the abstract vector view is first decomposed into independent scalar elements. Arithmetic instructions are then generated on the scalar dimension, and the calculation results are reorganized into a vector register. Finally, address expressions are constructed by deriving the effective step size to handle broadcast and index logic, generating fused code that can be accessed by the underlying vendor's computationally intensive kernel. This decouples the operator logic from the underlying architecture and can effectively generate high-performance fused code without introducing explicit branches and additional runtime overhead, significantly improving overall computing performance and development efficiency. Attached Figure Description

[0015] Figure 1 This is a flowchart of a cross-architecture computationally intensive fusion operator generation method based on hardware abstraction in one embodiment; Figure 2 This is a block diagram of a cross-architecture computationally intensive fusion operator generation system based on hardware abstraction in one embodiment; Figure 3 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0016] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0017] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

[0018] When used herein, the singular forms of “a,” “an,” and “the” may also include the plural forms unless the context clearly indicates otherwise. It should also be understood that the terms “comprising / including” or “having,” etc., specify the presence of the stated features, wholes, steps, operations, components, parts, or combinations thereof, but do not preclude the possibility of the presence or addition of one or more other features, wholes, steps, operations, components, parts, or combinations thereof. Meanwhile, the term “and / or” as used in this specification includes any and all combinations of the associated listed items.

[0019] In the fields of high-performance computing (HPC) and artificial intelligence (AI), with the explosive growth in the scale of deep neural networks such as large language models (LLM), the contradiction between computing power requirements and memory access bandwidth has become increasingly prominent. Computationally intensive operators, such as matrix multiplication and convolution, account for the main load of model inference and training, and their execution efficiency directly determines end-to-end performance. To alleviate hardware bandwidth bottlenecks, the industry widely adopts operator fusion technology, which merges computationally intensive operators with their subsequent element-wise operations (such as bias stacking and activation functions) into the same kernel for execution, thereby significantly reducing the redundant data transfer overhead between global memory and on-chip cache. However, existing operator fusion technologies face the following challenges: Severe hardware fragmentation: Hardware from different manufacturers has huge differences in register accumulator layout, thread organization model and instruction set architecture (ISA), which means that the fused kernel often needs to be handwritten for specific hardware and is difficult to port.

[0020] The deep coupling between arithmetic semantics and hardware execution: Traditional fusion logic usually operates directly on specific vector registers or thread indices, lacking a unified abstraction layer, resulting in low development efficiency and difficulty in supporting complex broadcast semantics and multiple auxiliary tensor inputs.

[0021] In view of this, this application proposes a hardware-based method for generating cross-architecture computationally intensive fusion operators for heterogeneous backends (such as GPUs, NPUs, and CPUs), addressing the technical problems of existing technologies where high-performance operators are bound to specific hardware architectures and vendor kernels, resulting in low development efficiency and difficulty in cross-platform portability. The core idea of ​​this method is to decouple operator semantics from hardware execution. By introducing coordinate-driven unified semantic abstraction and a cross-hardware normalization mechanism, the differences in register layout, thread organization, and instruction sets among heterogeneous hardware are masked, enabling efficient cross-platform execution on various architectures such as domestically produced NPUs and general-purpose GPUs.

[0022] In one embodiment, such as Figure 1 As shown, a method for generating cross-architecture computationally intensive fusion operators based on hardware abstraction is provided, including: Step S110: Capture the operator subgraph that depends on the output of matrix multiplication computation-intensive operators from the compiler front end, and perform constraint judgment on the operator subgraph according to the preset legality rules.

[0023] The validity rules require that subgraph operators possess element-wise computational characteristics, shape and broadcast consistency, and that all access to auxiliary tensors must satisfy affine indexability based on the output coordinates of computationally intensive operators. In this embodiment, capturing operator subgraphs can be achieved by identifying candidate operators that satisfy element-wise semantics and whose data flow starts with matrix multiplication outputs based on different pattern matching methods, and then transforming the operator subgraphs into a preset canonical form through algebraic regularization and chain flattening.

[0024] First, the system traverses the intermediate representation (IR) or computation graph provided by the compiler frontend, locating computation nodes of types such as matrix multiplication and convolution as starting nodes. Then, based on data flow dependencies, it traces backwards to all subsequent operator nodes that consume the output tensor of the starting node. During this process, a predefined pattern matching algorithm is used to filter candidate operators, and the system identifies operator sequences with element-wise computation characteristics. If a subsequent node contains an operator that disrupts the element-wise correspondence, such as causing data rearrangement (e.g., Reduce), the search of the current path is stopped. After identifying candidate operators and obtaining the operator subgraph, it is transformed into a preset canonical form through algebraic regularization and chain flattening. The specific type of the canonical form is not unique and can be set according to actual needs.

[0025] Further, in step S110, the operator subgraph is constrained according to preset legality rules, including: verifying, through shape consistency checks, that all operators within the subgraph can be interpreted as point-to-point transformations of the matrix multiplication output tensor; and verifying, through affine indexability checks, that the memory access offsets of auxiliary tensors within the subgraph can be mapped to affine functions about the global logical coordinates of the matrix multiplication output. If both the shape consistency check and the affine indexability check pass, the operator subgraph satisfies the constraint determination and can proceed to subsequent steps.

[0026] Specifically, the system checks whether all operators within the subgraph strictly adhere to point-to-point transformation semantics, verifies that each operator depends only on the input value at the current position, and not on data from neighboring pixels or other positions, ensuring that the operator logic can be interpreted as an independent function transformation performed on each element of the matrix multiplication output tensor, i.e., satisfying: ; in ci_out To compute the output of dense operators, aux_bcast To compute the coordinates of the auxiliary tensor after broadcasting the intensive operator, an affine indexability verification is then performed. The system checks the memory access patterns of all tensors within the subgraph to ensure their memory access offsets ( Offset This can be expressed as a linear affine function of the computationally intensive output global coordinates (b, m, n), and the formula can be expressed as: .

[0027] Step S120: Establish a unified semantic abstraction, define the fusion operator as a coordinate-driven pure function, and encapsulate the auxiliary tensor memory access handle and layout geometry information through runtime context to achieve decoupling of the arithmetic semantics and hardware execution.

[0028] This involves establishing a unified Epilogue semantic abstraction, defining fusion operators as coordinate-driven pure functions whose execution semantics are based on global logical coordinates. By introducing a runtime Epilogue context to encapsulate auxiliary tensor memory access handles and layout geometry information, and by using a unified interface to explicitly receive register accumulator vectors, global coordinates, and validity predicates, the decoupling of computational semantics from specific hardware execution is achieved.

[0029] In one embodiment, step S120 includes: constructing a runtime context container, which encapsulates the base address pointer of the auxiliary tensor and the layout step size of the matrix multiplication output, and dynamically resolves the global logical coordinates into the physical memory access offsets of each auxiliary tensor during execution; constructing a function interface based on accumulators and coordinates, which maps the hardware thread organization of different backends into normalized three-dimensional logical coordinates containing batch, row, and column dimensions, and combines validity predicates for boundary protection and computation task distribution; and adopting a scalarization strategy during the execution phase, which decomposes the vector data in the register accumulator into a static scalar loop, and sequentially executes a pipeline instruction sequence in the scalar domain, which loads operands from the auxiliary tensor, performs arithmetic and logical transformations, and fills the results back into the vector register, thereby converting abstract vector operations into a scalar instruction stream that can be allocated and revectorized by the underlying compiler.

[0030] Specifically, the system constructs a lightweight parameter container as runtime to pass the base address pointers (data pointers) and layout strides of auxiliary tensors during kernel startup. The fusion operator is then defined as a function with no side effects, as follows: ; in, It is the original accumulator vector calculated by the computationally intensive operator; Context It is a context container used to uniformly manage memory access handles, tensor shapes, and layout metadata of auxiliary tensors. These are the normalized logical coordinates (Batch, Row, Col, Predicate). Predicate It is a Boolean mask used to identify whether the current data is out of bounds.

[0031] Step S120 provides a unified Epilogue semantic abstraction method based on coordinate-driven and three-stage execution, transforming Epilogue fusion operators from hardware-specific thread programs into mathematically pure functions. Under this mechanism, the system constructs a runtime Epilogue context container to uniformly manage the memory access handles, tensor shapes, and layout metadata of auxiliary tensors. Unlike traditional implementations based on physical registers or specific SIMD instructions, this application defines a coordinate-aware function interface whose execution semantics are entirely based on global logical coordinates (a three-dimensional spatial coordinate system composed of batch processing, row indices, and column indices). In this way, the operator logic is no longer aware of the underlying hardware scheduling mechanisms (such as CUDA's BlockSwizzling or CPU's multi-level nested loops), but only focuses on the data transformations at each point within the logical space. Meanwhile, this application employs a "scalar execution" strategy at the semantic level, coupled with a strict Load-Compute-Store three-stage structure. During the downgrade process, complex vector arithmetic logic is decomposed into a static scalar instruction sequence. The key significance of this strategy lies in its complete decoupling of arithmetic definitions from the management of physical vector registers by describing the computational logic within the scalar domain. This allows backend compilers (such as LLVM and NVCC) to autonomously allocate registers, revectorize, and schedule instructions based on the register bit width and instruction set characteristics of the target hardware. This abstraction not only enhances code portability but also leverages the powerful optimization capabilities of modern compilers to ensure that high-performance low-level code is still generated when downgraded to different hardware (such as the SIMT mode of a GPU or the SIMD mode of a CPU).

[0032] Step S130: Utilize the backend normalization layer to semantically reinterpret the accumulator layout and thread organization of different hardware vendors during compilation, normalize them into a unified abstract vector view and global logical coordinates, and call cross-platform vectorization primitives to shield the differences between different backend instruction set architectures.

[0033] Specifically, the backend normalization layer performs semantic reinterpretation of accumulator layouts, thread organization, and register mapping patterns from different vendors' computationally intensive operator implementations during compile time, normalizing them into a unified abstract vector view and global logical coordinates, and automatically generating validity predicates to handle boundary alignment issues. Simultaneously, a set of cross-platform vectorization primitives, including abstract vector containers, predicate-based vector loading, scalar extraction, and static expansion primitives, masks the differences between different backend instruction set architectures (ISAs).

[0034] In step S130, the backend normalization layer performs semantic reinterpretation of the accumulator layout and thread organization of different hardware vendors during compilation, normalizing them into a unified abstract vector view and global logical coordinates. This can be achieved by constructing an accumulator view adapter through template specialization, semantically reinterpreting the opaque memory fragments or register layouts specific to different backends into a unified abstract vector view during compilation, thus achieving type alignment with zero runtime overhead. A local-to-global coordinate mapping mechanism is used to project the backend-specific thread-level indexes into unified batch, row, and column three-dimensional logical coordinates. Based on kernel completion, the configuration calculation validity predicate is propagated to the fusion operator interface, thereby achieving cross-architecture boundary illegal access shielding without introducing explicit branch instructions.

[0035] Furthermore, in step S130, calling cross-platform vectorization primitives to mask the differences between different backend instruction set architectures can be achieved by using abstract vector containers as a porting encapsulation of hardware native vectors, thus decoupling the logical vector length from the physical register bit width. A vector loading primitive with predicates is called, receiving global logical coordinates and validity predicates, and automatically scheduling the corresponding native loading instructions to execute masked memory accesses based on the instruction set characteristics of the target architecture. Through vector extraction and compile-time static expansion primitives, a unified fusion description is mapped to a scalarized execution flow, thereby supporting backend compilers to perform re-vectorization and register allocation optimization on different Single Instruction Multiple Data (SIMD) or Single Instruction Multiple Thread (SIMT) architectures.

[0036] Specifically, based on C++ template metaprogramming techniques, specialized accumulator view adapters can be built for each target backend (such as CUTLASS, CK). This adapter will convert backend-specific opaque storage objects (such as Arrays) at compile time.<half,8> The adapter performs semantic reinterpretation (or a private register array) to convert the data into a unified abstract vector view defined by the system. Simultaneously, the adapter establishes a local-to-global coordinate mapping mechanism, projecting hardware-specific thread indices (such as threadIdx.x, laneId) into the global logical space of the output matrix of computationally intensive operators. For example, in the NVIDIA GPU architecture, the adapter reverse-calculates the swizzling-processed thread ID into logical coordinates containing batch, row, and column dimensions. Furthermore, to address the boundary alignment issue where the matrix size cannot be calculated as a block divisible unit, the adapter automatically calculates the validity of the elements held by the current thread and generates a mask based on the kernel's block or tile configuration. This mask is explicitly passed to the fusion operator interface, thereby shielding illegal access within the fusion operator using predicated instructions and avoiding inefficient explicit if-else branching.

[0037] Step S130 provides a two-layer cross-hardware abstraction mechanism, including a back-end normalization layer and a unified vectorization primitive set. The back-end normalization layer, acting as a compile-time semantic converter, is responsible for masking the physical differences in accumulator layout and thread organization between different hardware operator libraries (such as CUTLASS, CK, etc.). Addressing the issue of inconsistent accumulator data distribution in registers across different architectures, this layer implements an accumulator view adapter. It utilizes template specialization techniques to semantically reinterpret the opaque physical storage at compile time, mapping it to a unified abstract vector view defined in this application. This process is entirely based on type inference and does not generate additional data movement or runtime overhead. Furthermore, the back-end normalization layer normalizes hardware-specific thread indices into global logical coordinates through a local-to-global coordinate projection technique and synchronously generates validity predicates. This mechanism allows boundary handling logic (padding) to propagate in the form of predicate masks, avoiding the insertion of explicit conditional branches in the generated operator code and significantly improving the execution efficiency of the instruction pipeline. Above the backend normalization layer, this application defines a set of cross-platform vectorization primitives as a bridge between operator logic and instruction set architecture (ISA). These primitives include, but are not limited to: abstract vector containers (used to decouple logical length and physical bit width), vector loading primitives with predicates (supporting masked boundary-aligned access), and static expansion and extraction primitives. Through this set of primitives, the same epilogue description can be automatically scheduled into the corresponding underlying instructions at compile time based on the target architecture.

[0038] Step S140: Degrade the arithmetic logic to a sequence of instructions to be executed within the scalar scope. First, decompose the abstract vector view into independent scalar elements, generate arithmetic instructions in the scalar dimension, then reorganize the calculation results into the vector register, and finally construct the address expression by deriving the effective step size to handle broadcast and index logic, generating fused code that can be accessed by the underlying vendor's computationally intensive kernel.

[0039] Specifically, after downgrading high-level arithmetic logic to a sequence of instructions executed within a scalar scope, the abstract vector view is first decomposed into independent scalar elements. Arithmetic instructions are then generated on the scalar dimension, and the calculation results are reorganized into the vector register. Finally, by deriving the effective step size, an address expression is constructed to handle broadcast and indexing logic, generating fused code that can access the underlying vendor's computationally intensive kernel and connect to the underlying vendor's matrix multiplication kernel.

[0040] In one embodiment, step S140 reduces arithmetic logic to a sequence of instructions executed within a scalar scope. First, the abstract vector view is decomposed into independent scalar elements, and arithmetic instructions are generated in the scalar dimension. This can be achieved by analyzing the shape relationship between the auxiliary tensor and the matrix multiplication output, deriving the effective step size for each dimension, and setting the step size of the broadcast dimension to zero to construct an affine expression for memory access offset based on logical coordinates. Then, a scalar execution model is used to topologically linearize the operator graph and map it to the scalar scope. A pipelined pattern of "vector decomposition—scalar computation—vector backfilling" is used to generate a sequence of instructions in the form of Static Single Assignment (SSA).

[0041] Specifically, during the code generation phase, the system first analyzes the shape relationship between the auxiliary tensor and the output tensor of the computationally intensive operator, derives the effective step size for each dimension, and forces the step size of the broadcast dimension to zero. It then constructs a memory access address expression based on the global logical coordinates from step S130. Subsequently, the PredicatedLoad primitive is called to execute data loading. This primitive can automatically schedule underlying instructions based on the target architecture characteristics (e.g., generating mm256_maskload_ps on x86 architecture and cutlass::global_load on CUDA architecture), and stores the loaded data in an abstract vector container. Regarding instruction sequence generation, the system adopts an execution mode of "vector decomposition—scalar computation—vector backfilling": the accumulator vector and the loaded auxiliary vector are expanded into a scalar loop, and the corresponding arithmetic instructions are generated in the scalar scope using static single assignment (SSA) form. The generated instruction sequence is then backfilled into the accumulator register as the computation result. Finally, the generated fusion code snippet is embedded into the tail of the main loop of the underlying vendor's computationally intensive operator kernel, and then compiled to output a high-performance binary file.

[0042] The aforementioned hardware-abstract-based cross-architecture computationally intensive fusion operator generation method addresses the issues of hardware-specific binding, low development efficiency, and cross-platform portability difficulties caused by the coupling between operator semantics and hardware execution in existing operator fusion technologies. This application establishes a coordinate-driven unified semantic abstraction and scalarization execution strategy, coupled with a backend normalization layer that performs semantic reinterpretation of heterogeneous hardware accumulator layout and thread organization during compile time, replacing the architecture-specific hand-written optimization mode. Utilizing cross-platform vectorized primitives and predicate propagation mechanisms, this method decouples operator logic from the underlying architecture, effectively generating high-performance fusion code without introducing explicit branches and additional runtime overhead, significantly improving overall computational performance and development efficiency.

[0043] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0044] Based on the same inventive concept, this application also provides a hardware-abstract-based cross-architecture computationally intensive fusion operator generation system for implementing the aforementioned hardware-abstract-based cross-architecture computationally intensive fusion operator generation method. The solution provided by this system is similar to the implementation described in the above method. Therefore, the specific limitations of one or more hardware-abstract-based cross-architecture computationally intensive fusion operator generation system embodiments provided below can be found in the limitations of the hardware-abstract-based cross-architecture computationally intensive fusion operator generation method described above, and will not be repeated here.

[0045] In one embodiment, such as Figure 2 As shown, a cross-architecture computationally intensive fusion operator generation system based on hardware abstraction is provided, including: The subgraph capture and validity check module 110 is used to capture operator subgraphs that depend on the output of matrix multiplication computation-intensive operators from the compiler front end, and to perform constraint judgment on the operator subgraphs according to preset validity rules. The validity rules require that the subgraph operators have element-wise computational features, shape and broadcast consistency, and that all access to auxiliary tensors must satisfy affine indexability based on the coordinates of the output of computation-intensive operators.

[0046] The unified semantic abstraction module 120 is used to establish a unified semantic abstraction, define the fusion operator as a coordinate-driven pure function, and encapsulate the auxiliary tensor memory access handle and layout geometry information through runtime context to achieve decoupling of the algorithm semantics and hardware execution.

[0047] The cross-hardware abstract mapping module 130 is used to perform semantic reinterpretation of the accumulator layout and thread organization of different manufacturers' hardware at compile time using the backend normalization layer, normalize them into a unified abstract vector view and global logical coordinates, and call cross-platform vectorization primitives to shield the differences between different backend instruction set architectures.

[0048] The code generation module 140 is used to degrade arithmetic logic into a sequence of instructions that can be executed within a scalar scope. It first decomposes the abstract vector view into independent scalar elements, generates arithmetic instructions in the scalar dimension, then reorganizes the calculation results into the vector register, and finally constructs the address expression by deriving the effective step size to handle broadcast and index logic, generating fused code that can be accessed by the underlying vendor's computationally intensive kernel.

[0049] In one embodiment, the subgraph capture and legality check module 110 is used to identify candidate operators that satisfy element-wise semantics and whose data stream starts with matrix multiplication output based on pattern matching, and to transform the operator subgraph into a preset canonical form through algebraic regularization and chain flattening.

[0050] In one embodiment, the subgraph capture and legality check module 110 is used to verify, through shape consistency check, that all operators in the subgraph can be interpreted as point-to-point transformations of the matrix multiplication output tensor; and through affine indexability verification, to check whether the memory access offsets of auxiliary tensors in the subgraph can be mapped to affine functions with respect to the global logical coordinates of the matrix multiplication output.

[0051] In one embodiment, the unified semantic abstraction module 120 is used to construct a runtime context container. By encapsulating the base address pointer of the auxiliary tensor and the layout step size of the matrix multiplication output, the global logical coordinates are dynamically resolved into the physical memory access offsets of each auxiliary tensor during execution. A function interface based on accumulators and coordinates is constructed. By mapping the hardware thread organization of different backends into normalized three-dimensional logical coordinates containing batch, row, and column dimensions, and combining validity predicates for boundary protection and computation task distribution, a scalarization strategy is adopted during the execution phase. The vector data in the register accumulator is decomposed into a static scalar loop. Within the scalar domain, a pipeline instruction sequence is executed sequentially, which loads operands from the auxiliary tensor, performs arithmetic and logical transformations, and fills the results back into the vector register. This converts the abstract vector operations into a scalar instruction stream that can be allocated and revectorized by the underlying compiler.

[0052] In one embodiment, the cross-hardware abstraction mapping module 130 is used to construct an accumulator view adapter through template specialization, which semantically reinterprets different backend-specific opaque memory fragments or register layouts into a unified abstract vector view at compile time to achieve type alignment with zero runtime overhead; it uses a local-to-global coordinate mapping mechanism to project backend-specific thread-level indexes into unified batch, row, and column three-dimensional logical coordinates; it completes the configuration calculation validity predicate based on kernel completion and propagates it to the fusion operator interface, thereby achieving cross-architecture boundary illegal access shielding without introducing explicit branch instructions.

[0053] In one embodiment, the cross-hardware abstraction mapping module 130 is used to utilize an abstract vector container as a porting encapsulation of hardware native vectors to decouple the logical vector length from the physical register bit width; it calls a vector loading primitive with predicates, receives global logical coordinates and validity predicates, and automatically schedules the corresponding native loading instructions to execute masked memory accesses based on the instruction set characteristics of the target architecture; through vector extraction and compile-time static expansion primitives, it maps the unified fusion description into a scalarized execution flow, thereby supporting the backend compiler to perform revectorization and register allocation optimization on different single instruction multiple data or single instruction multiple thread architectures.

[0054] In one embodiment, the code generation module 140 is used to derive the effective step size for each dimension by analyzing the shape relationship between the auxiliary tensor and the matrix multiplication output, and set the step size of the broadcast dimension to zero in order to construct an affine expression for memory access offset based on logical coordinates; adopting a scalar execution model, the operator graph is topologically linearized and mapped to the scalar scope, and the instruction sequence in static single assignment form is generated using the pipeline mode of "vector decomposition - scalar computation - vector backfilling".

[0055] The modules in the aforementioned hardware-based cross-architecture computationally intensive fusion operator generation system can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can invoke and execute the operations corresponding to each module.

[0056] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 3As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When executed by the processor, the computer program implements a hardware abstraction-based cross-architecture computationally intensive fusion operator generation method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0057] Those skilled in the art will understand that Figure 3 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0058] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the method described above.

[0059] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described method.

[0060] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the method described above.

[0061] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The processors involved in the embodiments provided in this application can be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited thereto.

[0062] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0063] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A method for generating cross-architecture computationally intensive fusion operators based on hardware abstraction, characterized in that, include: The compiler front end captures operator subgraphs that depend on the output of matrix multiplication computation-intensive operators, and performs constraint judgment on the operator subgraphs according to preset legality rules. The legality rules require that the subgraph operators have element-wise computational features, shape and broadcast consistency, and that all access to auxiliary tensors must satisfy affine indexability based on the coordinates of the output of computation-intensive operators. A unified semantic abstraction is established, the fusion operator is defined as a coordinate-driven pure function, and the auxiliary tensor memory access handle and layout geometry information are encapsulated through runtime context to achieve decoupling of the arithmetic semantics and hardware execution; By leveraging the backend normalization layer to semantically reinterpret the accumulator layout and thread organization of hardware from different vendors during compilation, the normalization is standardized into a unified abstract vector view and global logical coordinates, and cross-platform vectorization primitives are called to shield the differences between different backend instruction set architectures. The arithmetic logic is downgraded to a sequence of instructions executed within a scalar scope. First, the abstract vector view is decomposed into independent scalar elements, and arithmetic instructions are generated on the scalar dimension. Then, the calculation results are reorganized into the vector register. Finally, the address expression is constructed by deriving the effective step size to handle broadcast and index logic, generating fused code that can be accessed by the underlying vendor's computationally intensive kernel.

2. The method according to claim 1, characterized in that, The process of capturing operator subgraphs from the compiler front end that depend on matrix multiplication computation-intensive operator outputs includes: identifying candidate operators that satisfy element-wise semantics and whose data flow starts from matrix multiplication outputs, and transforming the operator subgraphs into a preset canonical form through algebraic regularization and chain flattening.

3. The method according to claim 1, characterized in that, The constraint determination of the operator subgraph according to the preset legality rules includes: verifying that all operators in the subgraph can be interpreted as point-to-point transformations of the matrix multiplication output tensor through shape consistency checks; and checking whether the memory access offsets of auxiliary tensors in the subgraph can be mapped to affine functions about the global logical coordinates of the matrix multiplication output through affine indexability verification.

4. The method according to claim 1, characterized in that, The establishment of a unified semantic abstraction, defining the fusion operator as a coordinate-driven pure function, and encapsulating auxiliary tensor memory access handles and layout geometry information through runtime context, achieves decoupling of computational semantics from hardware execution, including: Construct a runtime context container, which encapsulates the base address pointer of the auxiliary tensor and the layout step size of the matrix multiplication output, and dynamically resolves the global logical coordinates into the physical memory access offsets of each auxiliary tensor at execution time; Construct a function interface based on accumulators and coordinates, and map the hardware thread organization of different backends into normalized three-dimensional logical coordinates containing batch, row and column dimensions, and combine validity predicates for boundary protection and computation task distribution; During the execution phase, a scalarization strategy is adopted, which decomposes the vector data in the register accumulator into a static scalar loop. Within the scalar domain, a pipeline instruction sequence is executed sequentially, which loads operands from auxiliary tensors, performs arithmetic and logical transformations, and fills the results back into the vector register. This transforms abstract vector operations into a scalar instruction stream that can be registered and revectorized by the underlying compiler.

5. The method according to claim 1, characterized in that, The method of utilizing a backend normalization layer to semantically reinterpret the accumulator layout and thread organization of different hardware vendors during compilation, and normalize them into a unified abstract vector view and global logical coordinates, includes: By constructing an accumulator view adapter through template specialization, the opaque memory fragments or register layouts specific to different backends are semantically reinterpreted into a unified abstract vector view at compile time, so as to achieve type alignment with zero runtime overhead. By utilizing a local-to-global coordinate mapping mechanism, the backend-specific thread-level index is projected into a unified batch, row, and column three-dimensional logical coordinate system. By using kernel-complete configuration calculation validity predicates and propagating them to the fusion operator interface, cross-architecture boundary illegal access can be shielded without introducing explicit branch instructions.

6. The method according to claim 1, characterized in that, The invocation of cross-platform vectorized primitives to mask the differences in different backend instruction set architectures includes: By using abstract vector containers as a porting and encapsulation of hardware native vectors, the decoupling of logical vector length and physical register bit width is achieved. The vector loading primitive with predicates is invoked, which receives global logical coordinates and validity predicates, and automatically schedules the corresponding native loading instructions to perform masked memory access based on the instruction set characteristics of the target architecture. By using vector extraction and compile-time static expansion primitives, a unified fusion description is mapped to a scalarized execution flow, thereby enabling backend compilers to perform re-vectorization and register allocation optimization on different single instruction multiple data or single instruction multiple thread architectures.

7. The method according to claim 1, characterized in that, The process of reducing arithmetic logic to a sequence of instructions executed within a scalar scope first decomposes the abstract vector view into independent scalar elements, generating arithmetic instructions in the scalar dimension, including: By analyzing the shape relationship between the auxiliary tensor and the matrix multiplication output, the effective step size for each dimension is derived, and the step size of the broadcast dimension is set to zero in order to construct an affine expression for memory access offset based on logical coordinates. After topological linearization of the operator graph, it is mapped to the scalar scope, and a static single-assignment instruction sequence is generated using a pipelined pattern of vector decomposition-scalar computation-vector backfilling.

8. A cross-architecture computationally intensive fusion operator generation system based on hardware abstraction, characterized in that, include: The subgraph capture and validity check module is used to capture operator subgraphs that depend on the output of matrix multiplication computation-intensive operators from the compiler front end, and to perform constraint judgment on the operator subgraphs according to preset validity rules. The validity rules require that the subgraph operators have element-wise computational features, shape and broadcast consistency, and that all access to auxiliary tensors must satisfy affine indexability based on the coordinates of the output of computation-intensive operators. The Unified Semantic Abstraction Module is used to establish a unified semantic abstraction, define the fusion operator as a coordinate-driven pure function, and encapsulate the auxiliary tensor memory access handle and layout geometry information through runtime context to achieve decoupling of the arithmetic semantics and hardware execution. The cross-hardware abstract mapping module is used to leverage the backend normalization layer to semantically reinterpret the accumulator layout and thread organization of different hardware vendors during compilation, normalize them into a unified abstract vector view and global logical coordinates, and call cross-platform vectorization primitives to shield the differences between different backend instruction set architectures. The code generation module is used to reduce arithmetic logic to a sequence of instructions that can be executed within a scalar scope. It first decomposes the abstract vector view into independent scalar elements, generates arithmetic instructions in the scalar dimension, then reorganizes the calculation results into the vector register, and finally constructs the address expression by deriving the effective step size to handle broadcast and index logic, generating fused code that can be accessed by the underlying vendor's compute-intensive kernel.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.