Application graphical interface redirection and hardware acceleration method and system in heterogeneous computing environment

By intercepting and parsing graphics instruction streams in a heterogeneous computing environment, generating target hardware executable instruction streams, and performing hardware-accelerated execution, the problem of limited graphics rendering performance and quality in heterogeneous computing environments is solved, achieving efficient graphics rendering effects.

CN122288971APending Publication Date: 2026-06-26XINCHUANGQIAO (CHENGDU) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XINCHUANGQIAO (CHENGDU) TECH CO LTD
Filing Date
2026-05-22
Publication Date
2026-06-26

Smart Images

  • Figure CN122288971A_ABST
    Figure CN122288971A_ABST
Patent Text Reader

Abstract

This invention provides a method and system for application graphics interface redirection and hardware acceleration in a heterogeneous computing environment, relating to the field of computer technology. First, the initial graphics instruction stream generated by the target application, conforming to a first graphics interface specification, is intercepted at the source computing node. Next, relevant information about the target computing node is obtained, and the initial graphics instruction stream is semantically parsed to obtain a sequence of instruction stream semantic units. Based on this sequence of instruction stream semantic units, an intermediate representation structure of the graphics pipeline is constructed and translated to generate a target hardware-executable graphics instruction stream. The target hardware-executable graphics instruction stream is transmitted to the target computing node via a cross-node transmission channel. At the target computing node, the graphics processing of the target hardware-executable graphics instruction stream is invoked to obtain graphics rendering result frame data, which is then returned to the source computing node via a return channel. This invention achieves graphics interface redirection and hardware acceleration in a heterogeneous environment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and more specifically, to a method and system for redirecting and accelerating application graphics interfaces in a heterogeneous computing environment. Background Technology

[0002] Heterogeneous computing environments are becoming increasingly common. They improve computing performance and efficiency by integrating different types of computing resources, such as central processing units (CPUs) and graphics processing units (GPUs), to meet the complex and diverse application requirements. In terms of graphics processing, different operating system architectures and graphics processing hardware sets have their own unique graphics interface specifications and instruction sets.

[0003] Traditional graphics processing methods are typically limited to a single computing node. Applications interact directly with the local graphics processing hardware to generate and execute graphics instruction streams to complete graphics rendering tasks. However, in heterogeneous computing environments, different computing nodes may run different operating system architectures and be equipped with different types and capabilities of graphics processing hardware. This means that the initial graphics instruction stream generated by the target application, conforming to a specific graphics interface specification, cannot be directly executed on other computing nodes with different operating system architectures and graphics processing hardware.

[0004] Existing cross-node graphics processing technologies often have several limitations. Some technologies can only achieve simple graphics data transfer, but cannot effectively convert and adapt instruction streams from different graphics interface specifications. This results in the inability to correctly execute graphics instructions on the target computing node, failing to fully utilize the acceleration capabilities of hardware in heterogeneous computing environments. Other technologies, while attempting instruction conversion, lack precise analysis and adaptation to different hardware capabilities. The converted instruction stream may not efficiently utilize the graphics processing hardware of the target computing node, affecting the performance and quality of graphics rendering. Summary of the Invention

[0005] In view of the aforementioned problems, and in conjunction with the first aspect of the present invention, embodiments of the present invention provide a method for application graphics interface redirection and hardware acceleration in a heterogeneous computing environment, the method comprising: The source computing node intercepts the initial graphics instruction stream generated by the target application. The initial graphics instruction stream includes a sequence of graphics drawing commands and a set of graphics resource descriptors that conform to the first graphics interface specification. The source computing node runs in a first operating system architecture environment and is configured with a first graphics processing hardware set. The second operating system architecture environment type and the hardware capability description information of the second graphics processing hardware set of the target computing node are obtained. Based on the second operating system architecture environment type and the hardware capability description information of the second graphics processing hardware set of the target computing node, the initial graphics instruction stream is subjected to instruction stream semantic parsing processing to obtain the instruction stream semantic unit sequence. The target computing node and the source computing node are in the same physical network or virtual network layer and the target computing node runs in the second operating system architecture environment. Based on the instruction stream semantic unit sequence, a graphics pipeline intermediate representation structure containing a drawing state machine description block and a resource binding relationship description block is constructed. Based on the hardware instruction encoding rules of the second graphics processing hardware set, the graphics pipeline intermediate representation structure is subjected to hardware instruction translation processing to generate a target hardware executable graphics instruction stream. The target hardware executable graphics instruction stream is transmitted from the source computing node to the target computing node through a cross-node transmission channel. The cross-node transmission channel is constructed by the network transmission protocol stack and the shared memory mapping space, and the cross-node transmission channel supports binary data exchange between the first operating system architecture environment and the second operating system architecture environment. The second graphics processing hardware set is invoked in the target computing node to perform hardware-accelerated execution of the target hardware executable graphics instruction stream to obtain graphics rendering result frame data, and the graphics rendering result frame data is returned from the target computing node to the frame buffer display area of ​​the source computing node through the return channel.

[0006] Furthermore, embodiments of the present invention also provide an application graphics interface redirection and hardware acceleration system for heterogeneous computing environments, comprising: A processor; a machine-readable storage medium for storing machine-executable instructions of the processor; wherein the processor is configured to execute the above-described application graphics interface redirection and hardware acceleration method in a heterogeneous computing environment by executing the machine-executable instructions.

[0007] Based on the above, firstly, the initial graphics instruction stream generated by the target application and conforming to the first graphics interface specification is intercepted at the source computing node. By obtaining the operating system architecture environment type and hardware capability description information of the graphics processing hardware set of the target computing node, and performing instruction stream semantic parsing processing on the initial graphics instruction stream, the meaning of the instruction stream can be deeply understood, generating an instruction stream semantic unit sequence. Based on the instruction stream semantic unit sequence, an intermediate representation structure of the graphics pipeline is constructed, and hardware instruction translation processing is performed based on the hardware instruction encoding rules of the graphics processing hardware set of the target computing node to generate a target hardware executable graphics instruction stream. This achieves accurate conversion from source graphics instructions to target hardware executable instructions, fully considering the characteristics of the target hardware and instruction encoding rules, enabling the generated instruction stream to be executed efficiently on the target computing node, and fully leveraging the acceleration capabilities of the target hardware. The target hardware executable graphics instruction stream is transmitted from the source computing node to the target computing node through a cross-node transmission channel. This transmission channel is jointly constructed by a network transmission protocol stack and a shared memory mapping space, supporting binary data exchange between different operating system architecture environments, ensuring stable and efficient transmission of the instruction stream between different computing nodes. Finally, the graphics processing hardware set is invoked in the target computing node to perform hardware-accelerated execution of the target hardware executable graphics instruction stream, obtain the graphics rendering result frame data, and return the result to the frame buffer display area of ​​the source computing node through the return channel. This realizes the redirection and hardware acceleration of the application graphics interface in a heterogeneous computing environment, and improves the performance and quality of graphics rendering. Attached Figure Description

[0008] Figure 1 This is a schematic diagram of the execution flow of the application graphics interface redirection and hardware acceleration method in a heterogeneous computing environment provided in the embodiments of the present invention.

[0009] Figure 2 This is a schematic diagram of exemplary hardware and software components of the application graphics interface redirection and hardware acceleration system in a heterogeneous computing environment provided in the embodiments of the present invention. Detailed Implementation

[0010] Figure 1 This is a flowchart illustrating an application graphics interface redirection and hardware acceleration method in a heterogeneous computing environment, provided by an embodiment of the present invention. A detailed description follows.

[0011] Step S110: Intercept the initial graphics instruction stream generated by the target application at the source computing node. The initial graphics instruction stream includes a sequence of graphics drawing commands and a set of graphics resource descriptors that conform to the first graphics interface specification. The source computing node runs on a first operating system architecture environment and is configured with a first set of graphics processing hardware.

[0012] In a heterogeneous computing system deployed in a hybrid cloud environment, the source computing node runs on a first operating system architecture environment. This first operating system architecture environment can be a first operating system based on a first instruction set architecture, such as a first operating system running on a Reduced Instruction Set Computing (RISC) CPU. The source computing node is configured with a first set of graphics processing hardware, such as a graphics card of a first brand and model of graphics processor. Step S110 intercepts the initial graphics instruction stream generated by the target application on the source computing node. This initial graphics instruction stream contains a sequence of graphics drawing commands and a set of graphics resource descriptors that conform to a first graphics interface specification, such as a first graphics library application programming interface specification. The core feature of the above heterogeneous computing environment is that the source computing node and the target computing node (mentioned later) can run on different operating system architecture environments (e.g., the source node runs a first operating system while the target node runs a second operating system), and can be configured with CPUs of different instruction set architectures (e.g., the source node is configured with a RISC CPU while the target node is configured with a Complex Instruction Set Computing (CISC) CPU), as well as different brands or models of graphics processing hardware.

[0013] Step S111: An instruction flow interception intermediate layer is implanted between the graphics interface runtime library of the source computing node and the system graphics driver layer. The instruction flow interception intermediate layer is embedded in the software stack of the source computing node in the form of an operating system kernel module or a user-mode dynamic link library.

[0014] In the software stack of the source compute node, an instruction flow interception intermediate layer is implanted between the graphics interface runtime library and the system graphics driver layer. This instruction flow interception intermediate layer is implemented as an operating system kernel module, dynamically inserted into the kernel address space via kernel module loading commands, or implemented as a user-mode dynamic link library, with pre-loading environment variables set to ensure it is loaded preferentially when the target application starts. During the initialization phase, the instruction flow interception intermediate layer obtains the addresses of all graphics application programming interface functions in the graphics interface runtime library and replaces these function addresses with the corresponding stub function addresses in the instruction flow interception intermediate layer, thereby establishing interception points for graphics application programming interface calls.

[0015] Step S112: When the target application calls the graphics application programming interface function of the first graphics interface specification, the instruction stream interception intermediate layer captures the input parameter data structure and the output return value data structure of the graphics application programming interface function. The input parameter data structure includes the drawing command opcode, vertex data buffer pointer, index data buffer pointer, texture sampler handle and shader program handle.

[0016] When the target application calls a graphics application programming interface function conforming to the first graphics interface specification, the execution flow is redirected to the stub function installed in step S111. This stub function saves the current CPU register state and extracts the input parameter data structure from the call stack according to the application binary interface layout defined by the first graphics interface specification. The input parameter data structure contains a drawing command opcode identifying the type of this call, such as drawing a triangle mesh or drawing index primitives. It also extracts the vertex data buffer pointer, the index data buffer pointer, the texture sampler handle, and the shader program handle. The stub function also reserves storage space for receiving function return values ​​and captures the exit return value data structure after the original function execution.

[0017] Step S113: Store the drawing command operation codes contained in the input parameter data structure as a graphics drawing command sequence unit according to the calling sequence. Each drawing command element in the graphics drawing command sequence unit retains the original timing relationship and command dependency relationship when the target application initiates the call.

[0018] The instruction stream interception intermediate layer maintains a first-in-first-out (FIFO) queue in memory. For each intercepted graphics application programming interface (API) call, the drawing command opcode is extracted from the input parameter data structure. This opcode, along with the current call sequence number counter and the dependency identifier of the preceding drawing command opcode, is encapsulated into a drawing command element. This drawing command element is then appended to the FIFO queue in the order of the calls, forming a graphics drawing command sequence unit. The sequence number counter records the original timing relationship when the target application initiates the call, and the dependency identifier records the framebuffer read / write dependencies between different drawing command opcodes.

[0019] Step S114: Extract the vertex data memory block pointed to by the vertex data buffer pointer, the index data memory block pointed to by the index data buffer pointer, the texture image data block associated with the texture sampler handle, and the shader binary code block associated with the shader program handle contained in the input parameter data structure, and encapsulate the vertex data memory block, the index data memory block, the texture image data block, and the shader binary code block into the graphics resource descriptor set.

[0020] The instruction stream interception layer, based on the vertex data buffer pointer, calls the cross-address access function for process memory space provided by the operating system kernel to read the vertex data memory block from the target application's process address space. Similarly, it reads the index data memory block based on the index data buffer pointer. Based on the texture sampler handle, it obtains the memory address and data size of the texture image data through the texture object query interface provided by the graphics interface runtime library, and reads the texture image data block. Based on the shader program handle, it reads the shader binary code block through the shader program binary code retrieval interface provided by the graphics interface runtime library. The four data blocks read above, along with their respective data type identifiers, data sizes, and memory layout format information, are encapsulated into a graphics resource descriptor set data structure.

[0021] Step S115: Establish a resource reference binding relationship between each drawing command element in the graphics drawing command sequence unit and the corresponding resource descriptor entry in the graphics resource descriptor set. The resource reference binding relationship is used to maintain the consistency of association between drawing commands and graphics resources in the subsequent instruction stream processing stage.

[0022] Iterate through each drawing command element in the graphics drawing command sequence unit. For each drawing command element, parse its opcode and parameters to identify the vertex data memory block, index data memory block, texture image data block, and shader binary code block referenced by the drawing command element. Search the graphics resource descriptor set for the resource descriptor entry corresponding to the aforementioned memory block or code block. Add a resource reference binding record field to the drawing command element, which contains a pointer or index identifier to the corresponding resource descriptor entry. After all bindings are completed, each drawing command element is associated with its dependent graphics resource descriptor entry, forming a resource reference binding relationship.

[0023] Step S116: Perform data integrity encapsulation processing on the graphics drawing command sequence unit and the set of graphics resource descriptors carrying the resource reference binding relationship, and attach a metadata header containing an interception timestamp, an application process identifier, and a source computing node hardware architecture identifier to generate a complete initial graphics instruction stream. The initial graphics instruction stream is temporarily stored in the memory buffer queue of the source computing node. The memory buffer queue adopts a circular queue management strategy to coordinate the timing difference between the graphics instruction generation rate of the target application and the processing rate of subsequent instruction stream semantic parsing processing.

[0024] Create a data encapsulation container. Serialize the graphics drawing command sequence unit into a binary data block and write it to the data encapsulation container. Serialize the set of graphics resource descriptors carrying resource reference bindings into another binary data block and write it to the data encapsulation container. Generate a metadata header, which includes an intercept timestamp recording the current system clock tick count, a unique process identifier identifying the target application, and a source compute node hardware architecture identifier identifying the source compute node's CPU architecture type (e.g., RISC-MA or Complex Instruction Set Computing). Append the metadata header to the beginning of the data encapsulation container to form a complete initial graphics command stream. Allocate a circular queue buffer in the source compute node's memory, containing a fixed number of buffer slots. Write the complete initial graphics command stream to the buffer slot currently pointed to by the write pointer, then move the write pointer to the next buffer slot. When the write pointer reaches the end of the circular queue buffer, it wraps back to the beginning. The circular queue buffer adopts a circular queue management strategy. When the write pointer catches up with the read pointer, it means that the buffer is full. At this time, the generation of graphics instructions in the target application is blocked or the oldest unprocessed instruction stream is discarded to coordinate the timing difference between the generation rate of graphics instructions in the target application and the processing rate of semantic parsing of subsequent instruction streams.

[0025] Step S120: Obtain the second operating system architecture environment type and the hardware capability description information of the second graphics processing hardware set of the target computing node. Based on the second operating system architecture environment type and the hardware capability description information of the second graphics processing hardware set of the target computing node, perform instruction flow semantic parsing processing on the initial graphics instruction flow to obtain the instruction flow semantic unit sequence. The target computing node and the source computing node are in the same physical network or virtual network layer and the target computing node runs in the second operating system architecture environment.

[0026] Step S120 obtains the second operating system architecture environment type and hardware capability description information of the second graphics processing hardware set of the target computing node. The target computing node and the source computing node are located in the same physical network or virtual network layer, and the target computing node runs on a second operating system architecture environment. This second operating system architecture environment can be a second operating system based on a second instruction set architecture, such as a second operating system running on a complex instruction set architecture CPU. The second operating system architecture environment type and the first operating system architecture environment type can be different operating systems and different CPU instruction set architectures. Step S120, based on the obtained second operating system architecture environment type and hardware capability description information of the second graphics processing hardware set, performs instruction flow semantic parsing processing on the initial graphics instruction stream generated in step S110 to obtain an instruction flow semantic unit sequence.

[0027] Step S121: Receive the second operating system architecture environment type identifier and the hardware capability description information set of the second graphics processing hardware set from the target computing node in advance. The hardware capability description information set includes the graphics processor core architecture code name, the highest supported graphics interface specification version number, the maximum number limit value of texture samplers, the number of available unified computing architecture cores, and the value of dedicated graphics memory capacity.

[0028] During the initial connection establishment phase between the source and target compute nodes, the source compute node sends a capability query request message to the target compute node. Upon receiving this request message, the target compute node invokes its local Hardware Abstraction Layer (HAL) interface to obtain a second operating system architecture environment type identifier. This identifier identifies the type of operating system (e.g., a second operating system or a variant of the second operating system) and the CPU instruction set architecture type (e.g., Complex Instruction Set Architecture or Reduced Instruction Set Computing Machine Architecture). The target compute node also obtains a set of hardware capability description information for the second graphics processing hardware set by invoking the graphics driver interface. This set of hardware capability description information includes the graphics processor core architecture codename (identifying the generation of the graphics processor microarchitecture), the highest supported graphics library version number, the highest supported graphics interface specification version number, the maximum number of texture samplers that a single texture sampler unit can bind simultaneously, the number of unified computing architecture cores available for general-purpose computing tasks in the graphics processor, and the dedicated graphics memory capacity (identifying the onboard dedicated video memory capacity of the graphics processor). The target compute node encapsulates this information into a capability response message and returns it to the source compute node via a Transmission Control Protocol (TCP) connection.

[0029] Step S122: Load an instruction stream semantic parser instance that matches the second operating system architecture environment type according to the second operating system architecture environment type identifier. The instruction stream semantic parser instance is pre-set with memory alignment rule sets, data byte order conversion rule sets and system call convention rule sets corresponding to different operating system architecture environments.

[0030] After receiving the capability response message returned in step S121, the source compute node extracts the second operating system architecture environment type identifier. Internally, the source compute node maintains a registry of instruction stream semantic parsers (ISJs) instances, which maps IJs to corresponding IJs instances. Based on the second IJs instance type identifier, the source compute node searches the registry and loads the matching IJs instance. This IJs instance pre-configures memory alignment rule sets, data byte order conversion rule sets, and system call convention rule sets for different operating system architecture environments. The memory alignment rule set defines the number of bytes aligned to the starting address of basic data types (e.g., integers, floating-point numbers, pointers) in memory under different operating system architecture environments. The data byte order conversion rule set defines whether the high-order byte of multi-byte data is stored at a low address or a high address under different operating system architecture environments. The system call convention rule set defines whether function parameters are passed via registers or the stack, and the stack frame layout.

[0031] Step S123: Perform byte order conversion processing on all binary data fields in the initial graphics instruction stream according to the data byte order conversion rule set configured in the instruction stream semantic parser instance. If the first operating system architecture environment and the second operating system architecture environment adopt different data byte order representation methods, convert the multi-byte data fields in the initial graphics instruction stream from the first byte order representation format to the second byte order representation format.

[0032] The instruction stream semantic parser instance reads the source compute node hardware architecture identifier from the metadata header of the initial graphics instruction stream. Combined with the second operating system architecture environment type identifier, it determines whether the first and second operating system architecture environments use the same data byte order representation. If they use different data byte order representations—for example, the first operating system architecture environment uses little-endian byte order with the least significant byte stored at the lowest address, while the second operating system architecture environment uses big-endian byte order with the most significant byte stored at the lowest address—then the instruction stream semantic parser instance traverses all binary data fields in the initial graphics instruction stream. For each multi-byte data field, according to the byte order conversion algorithm in the data byte order conversion rule set, the byte order of the multi-byte data field is reversed from the first byte order representation format and rearranged into the second byte order representation format. For single-byte data fields, the original value remains unchanged.

[0033] Step S124: Based on the memory alignment rule set configured in the instruction stream semantic parser instance, perform memory alignment rearrangement processing on the initial graphics instruction stream after byte order conversion, and adjust the starting memory addresses of the vertex data memory block, index data memory block and texture image data block contained in the initial graphics instruction stream to aligned memory addresses that meet the memory alignment granularity requirements of the second operating system architecture environment.

[0034] The instruction stream semantic parser instance obtains the memory alignment granularity value of the second operating system architecture environment from the memory alignment rule set. For the initial graphics instruction stream after the byte order conversion processing in step S123, it parses the starting memory address and data size of the vertex data memory block, index data memory block, and texture image data block recorded in its graphics resource descriptor set. If the starting memory address of a data block is not an integer multiple of the memory alignment granularity value, the number of bytes to be offset is calculated to make the new starting memory address an integer multiple of the memory alignment granularity value. The corresponding number of padding bytes are filled before the data block, and the data block content is shifted backward. At the same time, the starting memory address field and memory layout format information field of the corresponding resource descriptor entry in the graphics resource descriptor set are updated. After traversing all data blocks and completing the above alignment adjustment, the initial graphics instruction stream after memory alignment rearrangement is output.

[0035] Step S125: Perform command sequence destructuring on the graphics drawing command sequence units in the initial graphics instruction stream after memory alignment and rearrangement, and extract the drawing command opcode, associated vertex data range descriptor, associated index data range descriptor and associated texture state binding descriptor of each drawing command element in the graphics drawing command sequence unit.

[0036] The instruction stream semantic parser instance traverses the graphics drawing command sequence units in the initial graphics instruction stream after memory alignment and rearrangement processing in step S124. For each drawing command element in the graphics drawing command sequence unit, its binary representation is parsed, and the drawing command opcode is extracted according to the command format defined in the first graphics interface specification. Simultaneously, from the resource reference binding record field of the drawing command element, the corresponding resource descriptor entry in the graphics resource descriptor set is found by following the pointer or index identifier. The associated vertex data range descriptor is extracted from the resource descriptor entry corresponding to the vertex data memory block. This associated vertex data range descriptor contains the starting address, number of vertices, and vertex step size of the vertex data memory block. The associated index data range descriptor is extracted from the resource descriptor entry corresponding to the index data memory block. This associated index data range descriptor contains the starting address, number of indices, and index type of the index data memory block. The associated texture state binding descriptor is extracted from the resource descriptor entry corresponding to the texture image data block. This associated texture state binding descriptor contains the texture sampler identifier, texture addressing mode, and texture filtering method.

[0037] Step S126: Encapsulate the drawing command opcode, the associated vertex data range descriptor, the associated index data range descriptor, and the associated texture state binding descriptor into an independent instruction stream semantic unit, and arrange the instruction stream semantic unit sequentially according to the original timing relationship in the graphics drawing command sequence unit to form the instruction stream semantic unit sequence.

[0038] For each drawing command element, the drawing command opcode extracted in step S125, the associated vertex data range descriptor, the associated index data range descriptor, and the associated texture state binding descriptor are combined and encapsulated into an independent data structure, which is a command stream semantic unit. The command stream semantic unit also retains the sequence counter value of the drawing command element in the original graphics drawing command sequence unit. After all drawing command elements are processed, multiple command stream semantic units are obtained. These command stream semantic units are sorted and arranged according to the ascending order of the sequence counter values ​​stored in each unit, forming an ordered sequence of command stream semantic units.

[0039] Step S127: Based on the highest supported graphics interface specification version number in the hardware capability description information set, the maximum number limit of the texture sampler, and the number of available cores of the unified computing architecture, perform hardware capability adaptation and pruning processing on the instruction stream semantic unit sequence, removing or replacing opcode elements and resource binding elements in the instruction stream semantic unit sequence that exceed the hardware capability range of the second graphics processing hardware set.

[0040] The instruction stream semantic parser instance reads the highest supported graphics interface specification version number, the maximum number limit for texture samplers, and the number of available unified computing architecture cores from the hardware capability description information set received in step S121. It iterates through each instruction stream semantic unit in the instruction stream semantic unit sequence. For each drawing command opcode in an instruction stream semantic unit, it checks whether the lowest graphics interface specification version required for the corresponding graphics function is less than or equal to the highest supported graphics interface specification version number. If it is greater, the opcode is marked as exceeding the hardware capability range, and it is replaced with multiple equivalent opcode sequences or the instruction stream semantic unit is directly removed according to a preset degradation strategy. For the associated texture state binding descriptor in the instruction stream semantic unit, it checks whether the number of texture samplers simultaneously bound is less than or equal to the maximum number limit for texture samplers. If it exceeds this limit, the associated texture state binding descriptor is split into multiple instruction stream semantic units, with each unit binding no more than the limit number of texture samplers. For drawing command opcodes involving general computing capabilities, it assesses whether the granularity of the computing task needs to be adjusted based on the number of available unified computing architecture cores. After completing the above trimming process, the adapted instruction stream semantic unit sequence is output.

[0041] Step S130: Construct a graphics pipeline intermediate representation structure containing a drawing state machine description block and a resource binding relationship description block based on the instruction stream semantic unit sequence, and perform hardware instruction translation processing on the graphics pipeline intermediate representation structure based on the hardware instruction encoding rules of the second graphics processing hardware set to generate a target hardware executable graphics instruction stream.

[0042] Step S130 constructs an intermediate representation structure for the graphics pipeline based on the instruction stream semantic unit sequence obtained in step S120. The intermediate representation structure for the graphics pipeline includes a drawing state machine description block and a resource binding relationship description block. Then, step S130 performs hardware instruction translation processing on the intermediate representation structure for the graphics pipeline based on the hardware instruction encoding rules of the second graphics processing hardware set to generate a target hardware executable graphics instruction stream.

[0043] Step S131: Create a blank graphics pipeline intermediate representation structure container, which consists of a state machine description block storage area, a resource binding relationship description block storage area, and a pipeline stage dependency relationship graph storage area.

[0044] A blank graphics pipeline intermediate representation structure container is allocated in the memory of the source compute node. This blank graphics pipeline intermediate representation structure container contains three independent memory areas: a drawing state machine description block storage area, used to store a series of drawing state machine description blocks describing the state transitions of the graphics pipeline; a resource binding relationship description block storage area, used to store resource binding relationship description blocks describing the binding relationships between graphics resources and pipeline stages; and a pipeline stage dependency graph storage area, used to store directed graph structure data, which represents the data dependencies and execution order between the processing stages in the graphics pipeline.

[0045] Step S132: Starting from the first instruction stream semantic unit in the instruction stream semantic unit sequence, traverse the instruction stream semantic unit sequence backwards. Perform a drawing state machine tracking and update operation on each instruction stream semantic unit. When the drawing command opcode in the instruction stream semantic unit belongs to the state setting type opcode, update the corresponding state register entry in the drawing state machine description block storage area. When the drawing command opcode in the instruction stream semantic unit belongs to the drawing call type opcode, encapsulate the complete state register snapshot in the current drawing state machine description block storage area with the drawing command opcode into a drawing state machine description block.

[0046] Initialize a simulated rendering state machine, which internally maintains a set of state register entries. Each state register entry corresponds to a configurable state parameter in the graphics pipeline, such as a depth test enable bit, a blending mode enumeration value, or viewport transformation matrix coefficients. Starting from the first instruction stream semantic unit in the instruction stream semantic unit sequence, traverse each instruction stream semantic unit sequentially. For the current instruction stream semantic unit, extract its rendering command opcode. Determine whether the rendering command opcode is a state setting type or a rendering call type. If it is a state setting type opcode, parse the parameter data in the instruction stream semantic unit and update the value of the corresponding state register entry in the simulated rendering state machine based on the parameter data. If it is a rendering call type opcode, read the current values ​​of all state register entries in the current simulated rendering state machine and combine the current values ​​of these state register entries into a complete state register snapshot data structure. Encapsulate this complete state register snapshot data structure, the rendering command opcode, and the associated vertex data range descriptor, associated index data range descriptor, and associated texture state binding descriptor carried in the instruction stream semantic unit into a rendering state machine description block. The drawing state machine description block is appended to the drawing state machine description block storage area representing the structure container in the middle of the graphics pipeline. The process continues to traverse the next instruction stream semantic unit, repeating the above steps until all instruction stream semantic units have been processed.

[0047] Step S133: Perform resource handle parsing processing on the associated vertex data range descriptor, associated index data range descriptor, and associated texture state binding descriptor in each instruction stream semantic unit. Map the virtual resource handle identifier defined in the first graphics interface specification to the unified resource identifier defined in the intermediate representation structure of the graphics pipeline. Construct the resource binding relationship description block based on the unified resource identifier. The resource binding relationship description block includes the original memory address information of the graphics resource pointed to by each unified resource identifier in the memory space of the source computing node, the target memory layout information of the graphics resource in the target computing node, and the resource access permission attribute information of the graphics resource.

[0048] Iterate through each instruction stream semantic unit in the instruction stream semantic unit sequence. For each instruction stream semantic unit, extract the virtual resource handle identifiers defined in the first graphics interface specification from the associated vertex data range descriptor, associated index data range descriptor, and associated texture state binding descriptor. Query a handle mapping table, which pre-establishes the correspondence between virtual resource handle identifiers and uniform resource identifiers defined in the graphics pipeline intermediate representation structure. If the query matches, directly obtain the corresponding uniform resource identifier. If not, allocate a new uniform resource identifier and record the mapping relationship in the handle mapping table. For each uniform resource identifier, collect the original memory address information of the graphics resource it points to in the source compute node's memory space, the target memory layout information calculated based on the memory layout preferences of the second graphics processing hardware set, and the resource access permission attribute information inherited from the first graphics interface specification. Encapsulate the above information into a resource binding relationship description block and append this resource binding relationship description block to the resource binding relationship description block storage area of ​​the graphics pipeline intermediate representation structure container.

[0049] Step S134: Obtain the hardware instruction encoding rule description table of the second graphics processing hardware set. The hardware instruction encoding rule description table defines the opcode bit field layout format, operand addressing mode encoding format and instruction pipeline delay slot filling rules of the machine-level graphics instructions supported by the second graphics processing hardware set.

[0050] During the capability exchange established in step S121, the source compute node obtains the model information of the second graphics processing hardware set from the target compute node. Based on this model information, the source compute node searches in a local or remote hardware instruction database to obtain the corresponding hardware instruction encoding rule description table. This hardware instruction encoding rule description table contains the following information: opcode bit field layout format, which defines the position and width of the binary bits occupied by the opcode field in machine-level graphics instructions; operand addressing mode encoding format, which defines the encoding values ​​and operand field layout corresponding to different addressing modes such as immediate addressing, register direct addressing, and memory indirect addressing; and instruction pipeline delay slot filling rules, which defines the number of idle instructions that need to be filled after branch or jump instructions and the encoding format of the idle instructions.

[0051] Step S135: Perform an instruction template matching operation on each of the drawing state machine description blocks in the drawing state machine description block storage area. Search the hardware instruction encoding rule description table for a target machine-level graphics instruction template sequence that matches the complete state register snapshot and the drawing command opcode in the drawing state machine description block. According to the operand addressing mode encoding format in the target machine-level graphics instruction template sequence, fill the original memory address information and the target memory layout information in the resource binding relationship description block associated with the drawing state machine description block into the operand field of the target machine-level graphics instruction template sequence to generate a hardware instruction binary code sequence in the target hardware executable graphics instruction stream optimized for the second graphics processing hardware set.

[0052] Traverse each drawing state machine description block in the drawing state machine description block storage area. For the current drawing state machine description block, extract the complete state register snapshot and drawing command opcode. Perform instruction template matching in the hardware instruction encoding rule description table: first, find the corresponding basic instruction template based on the drawing command opcode; then, based on the values ​​of each state parameter in the complete state register snapshot, select variant templates from the variants of the basic instruction templates that match the current state combination, forming a target machine-level graphics instruction template sequence. Each template in the target machine-level graphics instruction template sequence contains the opcode bit field layout and operand addressing mode encoding format. For each template, determine the operand type to be filled according to its operand addressing mode encoding format. Extract the original memory address information and target memory layout information from the resource binding relationship description block associated with the drawing state machine description block. After rearranging the address values ​​in the original memory address information according to the target memory layout information, encode them into the operand field of the template according to the operand addressing mode encoding format. Fill the operand field that does not need to be filled with a default value or reserved bits. The filled template is then concatenated according to the instruction pipeline delay slot filling rules to generate a hardware instruction binary code sequence. This hardware instruction binary code sequence is then added to the end of the target hardware executable graphics instruction stream.

[0053] Step S136: The hardware instruction binary code sequence is spliced ​​and arranged according to the original timing relationship in the instruction stream semantic unit sequence, and delay slot filling instructions that conform to the instruction pipeline delay slot filling rules are inserted between adjacent hardware instruction binary code sequences to generate a complete target hardware executable graphics instruction stream.

[0054] From the binary code sequences of all hardware instructions obtained in step S135, each sequence corresponds to a drawing state machine description block, and each drawing state machine description block corresponds to one or more instruction flow semantic units in the instruction flow semantic unit sequence. The execution order of each hardware instruction binary code sequence is determined according to the original timing relationship in the instruction flow semantic unit sequence. This execution order is used as the concatenation order, and each hardware instruction binary code sequence is written sequentially into the output buffer. Between two adjacent hardware instruction binary code sequences, the instruction pipeline delay slot filling rule defined in the hardware instruction encoding rule description table is queried. If the rule requires inserting a delay slot filling instruction after the last instruction of the previous sequence, one or more delay slot filling instruction binary codes are generated according to the idle instruction encoding format specified in the rule and inserted between the two sequences. After traversing all adjacent positions to complete the insertion of delay slot filling instructions, the content in the output buffer is the complete target hardware executable graphics instruction flow.

[0055] Step S140: The target hardware executable graphics instruction stream is transmitted from the source computing node to the target computing node through a cross-node transmission channel. The cross-node transmission channel is constructed by the network transmission protocol stack and the shared memory mapping space, and the cross-node transmission channel supports binary data exchange between the first operating system architecture environment and the second operating system architecture environment.

[0056] Step S140 transmits the target hardware-executable graphics instruction stream generated in step S130 from the source computing node to the target computing node via a cross-node transmission channel. This cross-node transmission channel is constructed collaboratively by a network transmission protocol stack and a shared memory mapping space, and supports binary data exchange between a first operating system architecture environment and a second operating system architecture environment. That is, it can handle the differences in data representation between different operating systems (e.g., the first operating system and the second operating system) and different central processing unit instruction set architectures (e.g., RISC-MA and Complex Instruction Set Architecture).

[0057] Step S141: Register a physical memory region in the operating system kernel space of the source computing node as the source memory range of the shared memory mapping space, and set the memory page table entry permissions of the source memory range to allow remote direct memory access read operations.

[0058] The source compute node calls the physical memory registration interface provided by the operating system kernel. It passes the starting physical address and region length parameters of the physical memory region to be registered to this interface. After verifying the validity of the physical memory region, the operating system kernel creates a descriptor for the region in its kernel memory management data structure. For each memory page within the region, it retrieves the corresponding page table entry and sets the permission flag in the page table entry to allow remote direct memory access read operations. After setting the permissions for all page table entries, it marks the physical memory region as registered and returns a registration identifier to the caller. This registered physical memory region is the source memory region of the shared memory mapping space.

[0059] Step S142: Register a physical memory region in the operating system kernel space of the target computing node as the target memory region of the shared memory mapping space, and set the memory page table entry permissions of the target memory region to allow remote direct memory access write operations.

[0060] The target compute node performs the operation symmetrical to step S141. It calls the physical memory registration interface provided by the target compute node's operating system kernel, passing the starting physical address and region length parameters of the physical memory region to be registered. After verifying the physical memory region, the target compute node's operating system kernel creates a region descriptor in its kernel memory management data structure. For each memory page within this region, the permission flag in the page table entry is set to allow remote direct memory access write operations. After registration is complete, a registration identifier is returned. This registered physical memory region is the target memory region of the shared memory mapping space.

[0061] Step S143: Establish a reliable connection session between the source computing node and the target computing node through the remote direct memory access control protocol of the network transmission protocol stack. The reliable connection session includes queue pair identifiers, protection domain identifiers, and a set of memory region access keys.

[0062] Both the source and target compute nodes initialize their respective Remote Direct Memory Access (RDA) network interface cards. The source compute node calls the application programming interface of the RDA to create a protection domain, which is used to isolate memory access permissions for different applications. Within the protection domain, a queue pair is created, containing a send queue and a receive queue, for submitting RDA operation requests. A queue pair identifier is assigned to the queue pair. The source compute node sends a connection request message to the target compute node via the Transmission Control Protocol (TCP) connection, carrying the queue pair identifier, protection domain identifier, and the access key for the source memory region. Upon receiving the connection request, the target compute node creates its own protection domain and queue pair and returns an acknowledgment message containing the target's queue pair identifier, protection domain identifier, and set of memory region access keys. After exchanging information, a reliable connection session is established.

[0063] Step S144: Encapsulate the physical memory address information, memory region length information, and source access key from the memory region access key set of the source computing node into a memory region descriptor message, and transmit the memory region descriptor message from the source computing node to the target computing node through the network transmission protocol stack. After the target computing node receives the memory region descriptor message, it uses the physical memory address information, memory region length information, and source access key carried in the memory region descriptor message to establish a one-way memory mapping channel pointing to the source memory region in the remote direct memory access network interface card of the target computing node.

[0064] The source compute node encapsulates the physical memory address information and memory region length information of the source memory region registered in step S141, along with the source access key obtained in step S143, into a memory region descriptor message. The source compute node sends this memory region descriptor message to the target compute node via the network transport protocol stack. Upon receiving the message, the target compute node extracts the physical memory address information, memory region length information, and source access key. The target compute node then calls the driver of its remote direct memory access network interface card (RDBMI) to write the above information into the network interface card's memory region registry. Based on this information, the network interface card creates an entry in its internal memory translation table. This entry maps the virtual address in subsequent remote direct memory access requests to the physical memory address of the source memory region. Once this entry is created, a unidirectional memory mapping channel pointing to the source memory region is formed.

[0065] Step S145: Copy the target hardware executable graphics instruction stream from the memory buffer queue of the source computing node to the registered source memory region, and write the transmission completion flag value into the completion flag field at the end of the source memory region after the copy is completed.

[0066] The source compute node reads the data block of the target hardware-executable graphics instruction stream from the memory buffer queue maintained in step S116. It then calls the direct memory access copy function to copy the data block from the virtual address space where the memory buffer queue resides to the virtual address mapping region of the source memory region registered in step S141. After the copy operation is complete, a preset completion flag value, such as a specific 32-bit unsigned integer value, is written to a variable named "Completion Flag Field" at a predefined offset position at the end of the source memory region.

[0067] Step S146: Initiate a one-sided write operation through the remote direct memory access network interface card of the source computing node, and use the one-way memory mapping channel to directly write the target hardware executable graphics instruction stream in the source memory range into the target memory range of the target computing node. The one-sided write operation bypasses the central processing unit scheduling of the target computing node.

[0068] The source compute node constructs a remote direct memory access one-sided write operation request, which includes the following fields: the operation type field is set to a one-sided write opcode, the source address field is set to the starting virtual address of the source memory region, the destination address field is set to the starting virtual address of the target memory region registered in step S142, the data length field is set to the data block size of the target hardware executable graphics instruction stream, and the access key field is set to the source access key. The source compute node submits this request to the send queue of the queue pair created in step S143. The remote direct memory access network interface card of the source compute node retrieves the request from the send queue and, based on the destination address and access key in the request, initiates a memory write transaction directly to the target memory region of the target compute node through the one-way memory mapping channel established in step S144. This write transaction bypasses the central processing unit scheduling of the target compute node, that is, the central processing unit of the target compute node is unaware of the initiation and completion process of the write operation, and the data is directly transmitted between the network interface cards of the two nodes and written to the target memory region.

[0069] Step S147: After the remote direct memory access network interface card of the target computing node completes the one-sided write operation, a hardware interrupt signal is triggered to the operating system kernel of the target computing node to notify the graphics driver layer of the target computing node to read the target hardware executable graphics instruction stream from the target memory area in preparation for hardware accelerated execution processing.

[0070] After the remote direct memory access network interface card (RDBMI) of the target compute node completes the one-sided write operation in step S146, it checks the completion status of the operation. If the write is successful, the RDBMI generates a hardware interrupt signal and sends it to the interrupt controller of the target compute node via the peripheral component interconnect bus. The interrupt controller routes the signal to the interrupt handling routine of the operating system kernel. After the interrupt service routine of the operating system kernel recognizes that the interrupt comes from the RDBMI, it wakes up or notifies the processing thread in the graphics driver layer that is waiting for the event. After the graphics driver layer is woken up, it reads the written target hardware-executable graphics instruction stream data from the target memory region registered in step S142, copies it to the working buffer inside the graphics driver layer, and prepares it for subsequent hardware acceleration execution processing.

[0071] Step S150: In the target computing node, the second graphics processing hardware set is invoked to perform hardware acceleration execution processing on the target hardware executable graphics instruction stream to obtain graphics rendering result frame data, and the graphics rendering result frame data is returned from the target computing node to the frame buffer display area of ​​the source computing node through the return channel.

[0072] In step S150, the second graphics processing hardware set is invoked in the target computing node to perform hardware-accelerated execution of the target hardware-executable graphics instruction stream transmitted in step S140, thereby obtaining graphics rendering result frame data. Then, the graphics rendering result frame data is returned from the target computing node to the frame buffer display area of ​​the source computing node via the return channel.

[0073] Step S151: Create a hardware acceleration instruction submission queue in the graphics driver layer of the target computing node, which is dedicated to executing the executable graphics instruction stream of the target hardware. The hardware acceleration instruction submission queue is mapped to the hardware command ring buffer of the second graphics processing hardware set.

[0074] The graphics driver layer of the target computing node calls the input / output memory management unit interface provided by the operating system kernel to allocate a contiguous zero-page memory region as storage space for the hardware acceleration instruction submission queue. The graphics driver layer writes the physical address and length of this storage space into the base address register and length register of the hardware command ring buffer of the second graphics processing hardware set, establishing a mapping relationship between the hardware acceleration instruction submission queue and the hardware command ring buffer. The graphics driver layer also initializes the read and write pointers of the hardware acceleration instruction submission queue, initially pointing to the beginning of the queue.

[0075] Step S152: Read the target hardware executable graphics instruction stream from the target end memory area of ​​the target computing node into the hardware acceleration instruction submission queue, and send an instruction execution notification to the second graphics processing hardware set by updating the tail pointer register of the hardware command ring buffer.

[0076] The graphics driver layer parses the binary code sequence of each hardware instruction from the target memory region written in step S147, according to the data format of the executable graphics instruction stream of the target hardware. The binary code sequence of the hardware instructions is then written one by one into the hardware acceleration instruction submission queue created in step S151. For each instruction written, the write pointer of the hardware acceleration instruction submission queue moves forward by the offset of the instruction length. After all instructions have been written, the graphics driver layer writes the updated write pointer value to the tail pointer register of the hardware command circular buffer of the second graphics processing hardware set through memory-mapped input / output operations. Upon detecting the change in the tail pointer register, the instruction fetching unit of the second graphics processing hardware set knows that a new instruction has been submitted.

[0077] Step S153: The graphics instruction scheduling unit in the second graphics processing hardware set sequentially retrieves the hardware instruction binary code sequence from the target hardware executable graphics instruction stream from the hardware command circular buffer, and distributes the hardware instruction binary code sequence to the shader core array, texture mapping unit and rasterization operation unit in the second graphics processing hardware set for parallel pipeline processing.

[0078] The graphics instruction scheduling unit in the second graphics processing hardware assembly reads the read pointer of the hardware command circular buffer and retrieves the next hardware instruction binary code sequence based on the position pointed to by the read pointer. The graphics instruction scheduling unit parses the opcode field and operand field of the hardware instruction binary code sequence and distributes the instruction to the corresponding execution unit according to the opcode type. Instructions involving vertex coordinate transformation and pixel shading calculations are distributed to idle computing cores in the shader core array. Instructions involving texture sampling and filtering are distributed to the texture mapping unit. Instructions involving primitive assembly and rasterization are distributed to the rasterization operation unit. Each execution unit executes its received instructions in parallel, forming a depth pipeline for parallel processing.

[0079] Step S154: During the parallel pipeline processing, the shader core array obtains the shader binary code block from the target memory region of the target computing node according to the shader program start memory address carried in the hardware instruction binary code sequence and loads it into the instruction cache. The texture mapping unit obtains the texture image data block from the target memory region according to the texture data base address carried in the hardware instruction binary code sequence and loads it into the texture cache.

[0080] When an opcode in a hardware instruction binary code sequence indicates that a shader program needs to be executed, the graphics instruction scheduling unit passes the starting memory address of the shader program, contained in the operand field of the instruction, to the shader core array. The direct memory access engine in the shader core array reads the shader binary code block from the target memory region of the target compute node based on this starting memory address and loads it into the instruction buffer within the shader core array. Similarly, when an opcode in a hardware instruction binary code sequence indicates that texture sampling is required, the texture mapping unit extracts the texture data base address from the operand field, reads the texture image data block from the target memory region through its direct memory access engine, and loads it into the texture buffer.

[0081] Step S155: The rasterization operation unit obtains the vertex data memory block and the index data memory block from the target memory region according to the vertex data base address and the index data base address carried in the hardware instruction binary code sequence, and generates the pixel color value and pixel depth value of the graphics rendering result frame data through primitive assembly processing and fragment shading processing.

[0082] The rasterization unit extracts the vertex data base address and index data base address from the operand segment of the hardware instruction binary code sequence. It then reads the vertex data memory block and index data memory block from the target memory region via the direct memory access engine. In the primitive assembly stage, based on the index sequence in the index data memory block, it retrieves the corresponding vertex position coordinates and vertex attributes from the vertex data memory block, assembling the vertices into basic primitives such as triangles, line segments, or points. The rasterization stage converts these primitives into a set of fragments in screen space, with each fragment corresponding to a pixel position in the frame buffer. The fragment shading stage determines the final pixel color value and pixel depth value for each fragment based on texture sampling results, lighting calculations, and other parameters. After all fragment processing is complete, the graphics rendering result frame data is output.

[0083] Step S156: Store the graphics rendering result frame data in the frame buffer memory associated with the second graphics processing hardware set. The frame buffer memory is configured to use a linear color space storage format and the frame buffer data layout is compatible with the second operating system architecture environment.

[0084] The rasterization unit writes the pixel color values ​​and pixel depth values ​​from the graphics rendering result frame data generated in step S155 into the corresponding storage units of the frame buffer memory associated with the second graphics processing hardware set, according to the pixel coordinates. The frame buffer memory is configured to use a linear color space storage format, meaning that the color component value of each pixel is linearly related to the physical brightness. The data layout of the frame buffer memory is organized according to the frame buffer specification of the second operating system architecture environment, for example, storing pixel data in row-major order, and arranging each row of pixel data according to a specific byte alignment method.

[0085] Step S157: Read the complete graphics rendering result frame data from the frame buffer memory, and transmit the graphics rendering result frame data from the target computing node to the source computing node through the socket transmission channel of the network transmission protocol stack or the reverse transmission path of the one-way memory mapping channel.

[0086] The target compute node's graphics driver layer calls the frame buffer memory read interface to read the complete graphics rendering result frame data from the frame buffer memory written in step S156, ordered by pixel coordinates. Based on a pre-negotiated transmission strategy, the graphics driver layer selects either the socket transmission channel of the network transport protocol stack or the reverse transmission path of the unidirectional memory mapping channel established in step S146 for data transmission. If the socket transmission channel is selected, the graphics rendering result frame data is encapsulated into a transmission control protocol message and sent to the source compute node. If the reverse transmission path is selected, a remote direct memory access one-sided write operation is constructed from the target memory region to the source memory region, writing the graphics rendering result frame data back to the source memory region of the source compute node.

[0087] Step S158: After the source computing node receives the graphics rendering result frame data, it calls the frame buffer update function in the system graphics driver layer of the source computing node to directly write the graphics rendering result frame data into the specified memory offset position associated with the target application window handle in the graphics display hardware frame buffer area of ​​the source computing node.

[0088] After receiving the graphics rendering result frame data transmitted in step S157 via the network transport protocol stack, the source compute node extracts the data block. The source compute node calls the frame buffer update function interface provided in the system graphics driver layer. The following parameters are passed to this interface: the target application window handle, the starting memory address of the graphics rendering result frame data, and the width and height information of the graphics rendering result frame data. The system graphics driver layer queries the window attribute table maintained in the window manager based on the target application window handle to obtain the starting memory address and line offset of the graphics display hardware frame buffer area corresponding to that window. The system graphics driver layer copies each row of pixel data from the graphics rendering result frame data to the corresponding memory offset position in the graphics display hardware frame buffer area according to the line offset.

[0089] Step S159: When the next vertical blanking cycle arrives, the display control unit of the source computing node scans and reads the graphics rendering result frame data from the graphics display hardware frame buffer area, and outputs the graphics rendering result frame data to the physical display device connected to the source computing node for visual presentation.

[0090] The display control unit of the source compute node maintains a vertical blanking cycle counter. Upon detecting a vertical blanking signal, the display control unit indicates the end of the current frame display and the start of a new frame scan. When the next vertical blanking cycle arrives, the display control unit initiates the frame scanning process. The address generator of the display control unit generates row and column addresses sequentially according to the pixel clock cycle, starting from the beginning address of the graphics display hardware frame buffer area. The display control unit reads the pixel color value stored at each address via the memory bus. The read pixel color value is converted into an analog voltage signal by a digital-to-analog converter or encoded into a digital video signal through the display port, and output to the physical display device connected to the source compute node. The screen of the physical display device illuminates pixels row by row according to the received signal, completing the visual presentation.

[0091] Step S210: Perform resource usage frequency analysis on the resource binding relationship description block storage area in the intermediate representation structure of the graphics pipeline, count the total number of references of each Uniform Resource Identifier in the resource binding relationship description block storage area by the drawing call type opcode in the drawing state machine description block storage area, sort the Uniform Resource Identifiers in descending order according to the total number of references, and mark the graphics resources pointed to by the Uniform Resource Identifiers whose total number of references is higher than a preset frequency threshold as a set of high-frequency used graphics resources.

[0092] After constructing the intermediate representation structure of the graphics pipeline in step S130 and before transmitting the target hardware-executable graphics instruction stream in step S140, step S210 performs resource usage frequency analysis on the resource binding relationship description block storage area in the intermediate representation structure of the graphics pipeline. All drawing state machine description blocks in the drawing state machine description block storage area are traversed. For each drawing state machine description block, if it belongs to the drawing call type, the Uniform Resource Identifier (URI) in the resource binding relationship description block associated with that drawing state machine description block is parsed. A count is incremented once for each referenced URI in the frequency statistics table. After traversal, the frequency statistics table records the total number of references for each URI. All URIs are sorted in descending order based on this value. A preset frequency threshold is set, and the vertex data memory blocks, index data memory blocks, texture image data blocks, and shader binary code blocks pointed to by URIs with a total reference count higher than the preset frequency threshold are marked as a set of frequently used graphics resources.

[0093] Step S211: Analyze the original memory address information of each high-frequency graphics resource in the set of high-frequency graphics resources in the memory space of the source computing node. If the memory region pointed to by the original memory address information is located in the pageable system memory region, then request the operating system kernel of the source computing node to lock the memory region in physical memory to prevent it from being swapped out to external storage devices by the memory swapping mechanism.

[0094] For each frequently used graphics resource in the set of frequently used graphics resources obtained in step S210, its original memory address information in the source compute node's memory space is obtained. Based on this original memory address information, the memory region type query interface provided by the operating system kernel is called to determine whether the memory region belongs to a pageable system memory region or a non-pageable memory region. If it belongs to a pageable system memory region, a memory lock request is sent to the operating system kernel. After receiving the request, the operating system kernel traverses all memory pages covered by the memory region, clears the swappable flag in the page table entry of each memory page, and increments the reference count of the memory page to prevent it from being swapped out to external storage devices by the page swapping mechanism. After the memory lock operation is completed, the operating system kernel returns a lock success status.

[0095] Step S212: The vertex data memory block, index data memory block, and texture image data block contained in the high-frequency graphics resource set are pre-transmitted from the source memory range of the source computing node to the target memory range of the target computing node through the unidirectional memory mapping channel established in the cross-node transmission channel. In the target memory range of the target computing node, a contiguous physical memory block is allocated for the high-frequency graphics resource set, and the resource data content in the high-frequency graphics resource set is reorganized and arranged according to the target memory layout information of the target computing node.

[0096] Before the actual transmission of the target hardware-executable graphics instruction stream in step S140, step S212, through a unidirectional memory mapping channel established in the cross-node transmission channel, pre-transmits the vertex data memory block, index data memory block, and texture image data block from the source memory region of the source computing node to the target memory region of the target computing node. The target computing node allocates a contiguous physical memory block for the high-frequency graphics resource set in its target memory region. For each high-frequency graphics resource, the resource data content is reorganized and arranged according to the target memory layout information determined in step S133. The data reorganization and arrangement operation includes rearranging the storage order of data elements, adjusting the field offsets in the data structure, and inserting padding bytes as necessary to meet alignment requirements.

[0097] Step S213: Construct a resource mapping table entry in the target computing node, establish an association mapping relationship between the Uniform Resource Identifier corresponding to the high-frequency graphics resource set and the physical memory start address of the contiguous physical memory block, and persistently store it in the kernel memory management data structure of the target computing node. During the hardware instruction translation process of the intermediate representation structure of the graphics pipeline, if the Uniform Resource Identifier in the high-frequency graphics resource set is encountered, the physical memory start address of the contiguous physical memory block recorded in the resource mapping table entry is used to directly fill the corresponding operand segment in the target hardware executable graphics instruction stream.

[0098] The target compute node allocates a resource mapping table in its kernel memory management data structure. For each frequently used graphics resource pre-transmitted in step S212, an entry is created in the resource mapping table. This resource mapping table entry records the association mapping between the Uniform Resource Identifier (URI) of the frequently used graphics resource and the physical memory start address of the contiguous physical memory block allocated in step S212. This resource mapping table entry is persistently stored in the kernel memory management data structure. During the hardware instruction translation process in step S135, when it is necessary to fill the operand segment in the target hardware executable graphics instruction stream, the URI corresponding to the operand segment is first checked to see if it exists in the resource mapping table. If it exists, the physical memory start address of the pre-recorded contiguous physical memory block is directly read from the resource mapping table entry, and the physical memory start address is written into the corresponding operand segment in the target hardware executable graphics instruction stream, skipping the subsequent memory address resolution and resource data reading process.

[0099] Step S220: Maintain a historical rendering result frame buffer queue corresponding to the target application in the target computing node. The historical rendering result frame buffer queue stores historical copies of the most recently generated multiple consecutive frames of graphics rendering result frame data.

[0100] The graphics driver layer of the target compute node allocates a circular buffer in memory as a historical rendering result frame buffer queue. This circular buffer contains a fixed number of buffer slots, each used to store a historical copy of a complete graphics rendering result frame data. Each time a new graphics rendering result frame data is generated in step S150, before returning the frame data to the source compute node, the graphics driver layer writes a copy of the frame data to the buffer slot currently pointed to by the write pointer in the historical rendering result frame buffer queue, and then moves the write pointer to the next buffer slot. When the write pointer reaches the end of the circular buffer, it wraps back to the beginning, overwriting the oldest historical copy.

[0101] Step S221: Perform pixel-by-pixel difference comparison processing on the currently generated graphics rendering result frame data and the historical copy of the previously generated graphics rendering result frame data, and calculate the absolute difference between the pixel color value component at each pixel position in the current graphics rendering result frame data and the pixel color value component at the corresponding pixel position in the previous historical copy.

[0102] The graphics driver layer reads a historical copy of the previous frame's graphics rendering result data from the buffer slot preceding the write pointer in the historical rendering result frame buffer queue. For each pixel position in the currently generated graphics rendering result frame data, it extracts the red, green, and blue component values ​​of that pixel position in the current frame, as well as the red, green, and blue component values ​​of the same pixel position in the previous frame's historical copy. It then calculates the absolute differences between the red, green, and blue component values, which is essentially the absolute value obtained by subtracting the previous frame's component value from the current frame's component value.

[0103] Step S222: If the absolute difference at a certain pixel position is greater than the preset difference perception threshold, then the pixel position is marked as a changed pixel unit, and the pixel color value component of the changed pixel unit in the current graphics rendering result frame data and the relative intra-frame coordinate offset of the changed pixel unit are recorded in the inter-frame difference data set.

[0104] A preset difference perception threshold is set. For each pixel location, the absolute differences of the red, green, and blue components calculated in step S221 are compared with the preset difference perception threshold. If any one of the absolute differences of the red, green, and blue components is greater than the preset difference perception threshold, the pixel location is determined to be a changed pixel unit. The red, green, and blue component values ​​of this changed pixel unit in the current frame, as well as the horizontal and vertical coordinate offsets of the pixel location relative to the top-left corner of the frame, are recorded in the inter-frame difference data set.

[0105] Step S223: Perform connected component clustering analysis on the changed pixel units in the inter-frame difference data set, and aggregate multiple changed pixel units that are spatially adjacent and have similar pixel color value components into a changed rectangular region block, and calculate the coordinates of the upper left corner and the lower right corner of the bounding box of the changed rectangular region block.

[0106] Iterate through all changed pixel units in the inter-frame difference data set, grouping spatially adjacent changed pixel units into the same connected component. The criteria for spatial adjacency are: the absolute value of the difference between the horizontal coordinate offsets of two changed pixel units is equal to one pixel unit and the absolute value of the difference between their vertical coordinate offsets is zero; or the absolute value of the difference between their horizontal coordinate offsets is zero and the absolute value of the difference between their vertical coordinate offsets is equal to one pixel unit. For each connected component, calculate its bounding box, i.e., find the minimum horizontal coordinate offset, minimum vertical coordinate offset, maximum horizontal coordinate offset, and maximum vertical coordinate offset of all changed pixel units within the connected component. Use the minimum horizontal and minimum vertical coordinate offsets as the top-left corner coordinates of the bounding box, and the maximum horizontal and maximum vertical coordinate offsets as the bottom-right corner coordinates of the bounding box. Treat each connected component and its bounding box as a changed rectangular region.

[0107] Step S224: Perform two-dimensional discrete transform encoding processing on the pixel color value components of the changing pixel units within each changing rectangular region block, converting the pixel color value components within the changing rectangular region block from spatial domain representation to frequency domain transform coefficient representation.

[0108] For each changing rectangular region, extract the red, green, and blue component values ​​corresponding to all pixel positions within that region, forming three two-dimensional matrices of the same size as the region. Perform a two-dimensional discrete transformation on each two-dimensional matrix. The two-dimensional discrete transformation converts the pixel color value component matrix in the spatial domain into a frequency domain transformation coefficient matrix of the same size, where the coefficient in the upper left corner represents the DC component, and the other coefficients represent AC components of different frequencies.

[0109] Step S225: Perform coefficient quantization processing on the frequency domain transform coefficient representation according to the preset quantization parameter table, perform numerical attenuation operation on the high-frequency transform coefficient components in the frequency domain transform coefficient representation, perform entropy coding compression processing on the quantized frequency domain transform coefficient representation, use a variable-length coding table to convert the frequency domain transform coefficient numerical sequence into a compact binary bit stream coding sequence, and append the upper left corner coordinates and lower right corner coordinates of the bounding box of the variable rectangular region block as additional coding information to the binary bit stream coding sequence.

[0110] A preset quantization parameter table is obtained, which defines the quantization step size corresponding to each frequency position in the frequency domain. For each coefficient in the frequency domain transform coefficient matrix obtained in step S224, the coefficient is divided by the quantization step size of the corresponding frequency position in the quantization parameter table, and the integer part is taken to obtain the quantized coefficient value. During the quantization process, high-frequency components are divided by a larger quantization step size to achieve numerical attenuation. The non-zero coefficients in the quantized coefficient matrix are extracted in a zigzag scanning order to form a coefficient value sequence. A variable-length encoding table is used to map each value in the coefficient value sequence to a corresponding binary codeword, and all codewords are concatenated to form a binary bitstream encoding sequence. The coordinates of the upper left corner and the lower right corner of the bounding box of the variable rectangular region block are encoded as fixed-length binary prefixes and appended to the head of the binary bitstream encoding sequence.

[0111] Step S226: The binary bitstream encoded sequence after inter-frame differential coding compression is used as the compressed representation of the graphics rendering result frame data and transmitted from the target computing node to the source computing node through the return channel.

[0112] The target computing node uses the binary bitstream encoded sequence generated in step S225 as a compressed representation of the current frame's graphics rendering result frame data. This binary bitstream encoded sequence is then encapsulated into a transmission data packet and transmitted to the source computing node via the return channel (the reverse transmission path of the network transport protocol stack's socket transmission channel or the one-way memory mapping channel) used in step S150.

[0113] Step S227: After the source computing node receives the compressed representation, it uses the historical copy of the previously generated graphics rendering result frame data stored in the historical rendering result frame buffer queue and the changed rectangular region block data in the received binary bit stream encoding sequence to perform frame reconstruction processing and recover the complete graphics rendering result frame data.

[0114] After receiving the binary bitstream encoded sequence transmitted in step S226, the source compute node parses the additional encoded information in its header to obtain the coordinates of the upper left and lower right corners of the bounding box for each changing rectangular region block. The source compute node reads a historical copy of the previous frame's graphics rendering result frame data from its locally maintained historical rendering result frame buffer queue, using this historical copy as the reconstruction base frame. For each changing rectangular region block in the binary bitstream encoded sequence, the quantized frequency domain transform coefficient sequence is decoded from its binary codeword. After dequantization and inverse two-dimensional discrete transform processing, the original pixel color value components within the region block are recovered. The recovered pixel color value components are written to the pixel positions within the corresponding bounding box coordinate range in the reconstruction base frame. After all changing rectangular region blocks have been processed, the reconstruction base frame is updated to the complete current frame's graphics rendering result frame data.

[0115] Step S230: Deploy a channel state monitoring module in the source computing node. The channel state monitoring module periodically collects the current available bandwidth, current round-trip delay, and current packet loss retransmission rate of the cross-node transmission channel.

[0116] A channel state monitoring module is deployed on the source compute node. This module runs independently with a monitoring thread that periodically performs data collection operations at fixed time intervals. When collecting the current available bandwidth, the monitoring thread records the total number of data bytes successfully transmitted through the cross-node transmission channel within a given time interval, divides this number by the interval length, and obtains the current available bandwidth. When collecting the current round-trip time (RTD), the monitoring thread sends a probe message with a timestamp to the target compute node. Upon receiving the message, the target compute node immediately returns it. The monitoring thread calculates the difference between the system time at the time of receiving the return message and the timestamp recorded at the time of transmission, obtaining the current RTD. When collecting the current packet loss retransmission rate, the monitoring thread calculates the ratio of the number of retransmitted packets reported by the transmission control protocol stack to the total number of transmitted packets within a given time interval, obtaining the current packet loss retransmission rate.

[0117] Step S231: Compare and evaluate the current available bandwidth value, the current round-trip delay value, and the current packet loss retransmission rate value with the corresponding preset thresholds. Generate the current transmission quality evaluation result of the cross-node transmission channel based on the evaluation results of each indicator. Look up the corresponding transmission strategy switching operation type in the preset transmission strategy decision mapping table based on the current transmission quality evaluation result.

[0118] The channel state monitoring module compares the collected current available bandwidth value with a preset lower limit threshold. If the current available bandwidth value is less than the preset lower limit threshold, the bandwidth indicator is marked as poor. It compares the current round-trip time (RTT) value with a preset upper limit threshold. If the current RTT value is greater than the preset upper limit threshold, the latency indicator is marked as poor. It also compares the current packet loss retransmission rate value with a preset upper limit threshold. If the current packet loss retransmission rate value is greater than the preset upper limit threshold, the packet loss indicator is marked as poor. Based on the comprehensive evaluation results of the bandwidth, latency, and packet loss indicators, a current transmission quality assessment result is generated. A preset transmission policy decision mapping table defines the correspondence between different transmission quality assessment results and transmission policy switching operation types. The corresponding transmission policy switching operation type is searched in this mapping table based on the current transmission quality assessment result.

[0119] Step S232: If the transmission strategy switching operation type indicates that the intra-frame block transmission strategy needs to be enabled, the target hardware executable graphics instruction stream to be transmitted is divided into multiple instruction stream data blocks according to the instruction sequence length, and a block sequence number is attached to each instruction stream data block.

[0120] When the transmission strategy switching operation type found in step S231 is intra-frame chunked transmission strategy, the source compute node obtains the total length of the target hardware-executable graphics instruction stream to be transmitted. A chunk size parameter is set, and the target hardware-executable graphics instruction stream is divided into multiple instruction stream data chunks according to this parameter. The length of each chunk is equal to the chunk size parameter, although the length of the last chunk may be less than the chunk size parameter. A chunk sequence number is appended to each instruction stream data chunk, starting from 0 and incrementing, to identify the chunk's sequential position in the original instruction stream.

[0121] Step S233: The multiple instruction stream data blocks are transmitted in parallel from the source computing node to the target computing node through multiple parallel network transmission protocol stack connection sessions of the cross-node transmission channel. After the target computing node receives all the instruction stream data blocks, they are recombined and restored according to the block sequence number marking to restore the target hardware executable graphics instruction stream.

[0122] Multiple parallel Transmission Control Protocol (TCP) connection sessions are pre-established between the source and target computing nodes. The multiple instruction stream data blocks generated in step S232 are allocated to these parallel connection sessions, with each connection session responsible for transmitting one or more blocks. Multiple connection sessions transmit data simultaneously, achieving parallel transmission. The target computing node receives instruction stream data blocks from each connection session and extracts the block sequence number marker carried by each block. The target computing node maintains a reassembly buffer and writes the received instruction stream data blocks sequentially into the reassembly buffer according to the ascending order of the block sequence number markers. When all blocks corresponding to the block sequence number markers have been received, the data in the reassembly buffer constitutes the complete target hardware executable graphics instruction stream.

[0123] Step S234: If the transmission strategy switching operation type indicates that the instruction stream simplification transmission strategy needs to be enabled, then in the source computing node, instruction redundancy elimination processing is performed on the target hardware executable graphics instruction stream, and state redundant setting instruction sequences in the target hardware executable graphics instruction stream that do not contribute to the pixel color values ​​of the final graphics rendering result frame data are removed.

[0124] When the transmission strategy switching operation type found in step S231 is the instruction stream simplified transmission strategy, the source compute node parses the instruction sequence in the target hardware executable graphics instruction stream. It identifies multiple consecutive instruction sequences that set the same status register value, retains the first setting instruction, and removes subsequent redundant instructions that repeatedly set the same value. It identifies no-operation instruction sequences that do not change any status registers between two drawing calls and removes these no-operation instruction sequences directly. It identifies drawing instructions that do not contribute to the final frame buffer pixel color value, such as instructions whose drawing area is completely covered by subsequent drawing, marks these instructions, and removes them. After instruction redundancy elimination processing, the simplified target hardware executable graphics instruction stream is output.

[0125] Step S235: If the transmission strategy switching operation type indicates that a compressed transmission strategy needs to be enabled, the hardware compression acceleration unit in the source computing node is invoked to perform real-time data compression processing on the target hardware executable graphics instruction stream, generating a compressed bitstream of the target hardware executable graphics instruction stream after compression transformation, and transmitting the compressed bitstream to the target computing node through the cross-node transmission channel.

[0126] When the transmission strategy switching operation type found in step S231 is a compressed transmission strategy, the source compute node invokes its hardware compression acceleration unit. The data block of the target hardware executable graphics instruction stream is sent to the input buffer of the hardware compression acceleration unit. The compression algorithm engine inside the hardware compression acceleration unit performs real-time compression processing on the input data, generating a compressed bitstream. The source compute node transmits the compressed bitstream to the target compute node through the network transmission protocol stack.

[0127] Step S236: After the target computing node receives the compressed bitstream, it calls the hardware decompression acceleration unit in the target computing node to perform real-time data decompression processing on the compressed bitstream, and recovers the complete target hardware executable graphics instruction stream for subsequent hardware accelerated execution processing.

[0128] After receiving the compressed bitstream transmitted in step S235, the target computing node sends the compressed bitstream data into the input buffer of its hardware decompression acceleration unit. The decompression algorithm engine inside the hardware decompression acceleration unit performs real-time decompression processing on the compressed bitstream to recover the original target hardware executable graphics instruction stream data. The recovered target hardware executable graphics instruction stream is stored in the target's memory area for use in the hardware acceleration execution processing in step S150.

[0129] For example, the method may further include: step S240: registering multiple target computing nodes with graphics hardware acceleration capabilities in the source computing node, and storing the second operating system architecture environment type identifier and the hardware capability description information set of the second graphics processing hardware set corresponding to each target computing node in the target node capability registry.

[0130] The source compute node maintains a target node capability registry. For each target compute node in the network with graphics hardware acceleration capabilities, the source compute node obtains the second operating system architecture environment type identifier and the hardware capability description information set of the second graphics processing hardware set of the target compute node through the query process described in step S121. This information is then inserted as a record into the target node capability registry. The target node capability registry supports dynamically adding and deleting node records.

[0131] Step S241: Perform graphics task dependency analysis on the initial graphics instruction stream, construct a directed acyclic task relationship graph with drawing command opcodes as nodes and frame buffer read / write dependencies between drawing commands as edges, and call a graph segmentation algorithm to perform task graph segmentation on the directed acyclic task relationship graph, splitting the directed acyclic task relationship graph into multiple drawing task subgraphs.

[0132] The source compute node parses the graphics drawing command sequence units in the initial graphics instruction stream. Each drawing command opcode is treated as a node in a directed acyclic task graph. The framebuffer read / write dependencies between adjacent drawing command opcodes are analyzed: if a subsequent drawing command opcode reads the framebuffer area written by a previous drawing command opcode, a directed edge is established from the previous node to the next node. All nodes and edges constitute the directed acyclic task graph. A graph partitioning algorithm is called to partition the directed acyclic task graph, dividing the nodes into multiple subsets, each subset forming a drawing task subgraph. During the partitioning process, the number of directed edges between different drawing task subgraphs is minimized, and each drawing task subgraph contains multiple drawing command opcodes with tight coupling dependencies.

[0133] Step S242: Based on the available number of unified computing architecture cores and the dedicated graphics memory capacity in the hardware capability description information set of each target computing node in the target node capability registry, perform task subgraph allocation and scheduling processing on the multiple drawing task subgraphs; wherein, the allocation and scheduling is configured to: allocate the drawing task subgraphs with higher computational intensity to the target computing nodes with a larger number of available unified computing architecture cores.

[0134] For each drawing task subgraph obtained in step S241, its computational density is calculated. The computational density is determined by comprehensively evaluating the number of shader program instructions, vertices, and fragments contained in the drawing task subgraph. All registered target compute nodes in the target node capability registry are traversed, and the available number of unified computing architecture cores and dedicated graphics memory capacity of each node are read. The target compute nodes are sorted in descending order according to the available number of unified computing architecture cores. The drawing task subgraphs with higher computational density are assigned to the target compute nodes with higher available number of unified computing architecture cores. Simultaneously, it is checked whether the expected graphics memory capacity occupied by the drawing task subgraph is less than the dedicated graphics memory capacity of the target compute node; if not, the next suitable target compute node is selected.

[0135] Step S243: Convert the instruction stream semantic unit sequence subset corresponding to each drawing task subgraph into a sub-instruction stream fragment adapted to the target hardware executable graphics instruction stream corresponding to the target computing node, and embed a frame buffer synchronization barrier instruction in each sub-instruction stream fragment to coordinate the rendering stage synchronization between different target computing nodes.

[0136] For the rendering task subgraph assigned to each target computing node in step S242, extract the instruction stream semantic unit sequence subset corresponding to the rendering task subgraph. Following the method in step S130, based on the hardware instruction encoding rules of the second graphics processing hardware set for each target computing node, convert the instruction stream semantic unit sequence subset into sub-instruction stream fragments adapted to their respective target computing nodes. Embed a frame buffer synchronization barrier instruction at the end of each sub-instruction stream fragment. The function of the frame buffer synchronization barrier instruction is to notify the graphics processor of the target computing node that subsequent operations are not allowed to read the contents of the current frame buffer region before all rendering operations of the current sub-instruction stream fragment are completed.

[0137] Step S244: Each of the sub-instruction stream fragments is transmitted to the corresponding target computing node through the cross-node transmission channel. Each of the target computing nodes calls its local second graphics processing hardware set in parallel to perform hardware-accelerated execution processing on the received sub-instruction stream fragments, generating their respective corresponding sub-rendering result frame data fragments.

[0138] The source compute node transmits the sub-instruction stream fragments generated in step S243 to their respective target compute nodes via the cross-node transmission channel established in step S140. Upon receiving the sub-instruction stream fragments, each target compute node concurrently invokes its local second graphics processing hardware set to perform hardware-accelerated execution processing according to the method in step S150. Each target compute node independently processes its received sub-instruction stream fragments, generating corresponding sub-rendering result frame data fragments. The sub-rendering result frame data fragments contain only pixel color values ​​and pixel depth values ​​for a portion of the complete frame buffer.

[0139] Step S245: Collect sub-rendering result frame data fragments returned from each of the target computing nodes in the source computing node, and perform fragment combination and splicing processing on the sub-rendering result frame data fragments according to the original frame buffer dependency relationship recorded in the directed acyclic task relationship graph, and fuse them into complete graphics rendering result frame data.

[0140] The source compute node receives sub-rendering result frame data fragments returned by each target compute node via a backhaul channel. Each sub-rendering result frame data fragment carries its corresponding drawing task subgraph identifier and rendering region range descriptor. Based on the original frame buffer dependencies recorded in the directed acyclic task relationship graph, the source compute node determines the writing order and position of each sub-rendering result frame data fragment in the full frame buffer. For fragments without overlap, pixel data is directly written to the corresponding position in the full frame buffer according to their respective region range descriptors. For fragments with overlap, according to the read / write order in the original frame buffer dependencies, the later-written fragment overwrites the pixel values ​​of the corresponding overlapping regions in the earlier-written fragment. After all fragments are processed, the complete graphics rendering result frame data is obtained.

[0141] Step S246: Write the fused complete graphics rendering result frame data into the frame buffer display area of ​​the source computing node to complete the distributed heterogeneous parallel hardware-accelerated rendering processing of the initial graphics instruction stream generated by the target application.

[0142] The source compute node writes the complete graphics rendering result frame data obtained in step S245 into its graphics display hardware frame buffer area according to the methods described in steps S158 and S159, and outputs it to the physical display device for visual presentation. Thus, the distributed heterogeneous parallel hardware-accelerated rendering processing of the initial graphics instruction stream generated by the target application is completed. Multiple compute nodes running different operating system architectures (e.g., a first operating system and a second operating system, a RISC machine architecture and a complex instruction set architecture) and configured with different graphics processing hardware sets collaboratively complete the rendering task.

[0143] For example, after constructing the intermediate representation structure of the graphics pipeline containing drawing state machine description blocks and resource binding relationship description blocks based on the instruction stream semantic unit sequence, and before performing hardware instruction translation processing on the intermediate representation structure of the graphics pipeline based on the hardware instruction encoding rules of the second graphics processing hardware set, the method may further include: step S250: extracting multiple consecutively arranged drawing state machine description blocks in the storage area of ​​the drawing state machine description blocks of the intermediate representation structure of the graphics pipeline, performing read-modify-write data hazard relationship analysis processing on the drawing command opcodes in the multiple drawing state machine description blocks, and identifying instruction pairs with write-before-read data dependency relationships among the multiple drawing state machine description blocks.

[0144] After constructing the intermediate representation structure of the graphics pipeline in step S130 and before performing hardware instruction translation processing on the intermediate representation structure of the graphics pipeline based on the hardware instruction encoding rules of the second graphics processing hardware set in step S140, step S250 extracts multiple consecutively arranged drawing state machine description blocks from the drawing state machine description block storage area. The drawing command opcodes in these drawing state machine description blocks are traversed, and the frame buffer region or status register written to and read by each drawing command opcode is identified. For any two drawing command opcodes, if a resource region written by the first drawing command opcode overlaps with the same resource region read by the second drawing command opcode, and the first drawing command opcode precedes the second drawing command opcode in timing, then the two drawing command opcodes are marked as an instruction pair with a write-before-read data dependency relationship.

[0145] Step S251: Perform dependency distance calculation processing on the instruction dual combination that has the write-then-read data dependency relationship, and obtain the instruction interval count value of the write operation instruction and the read operation instruction in the drawing state machine description block storage area of ​​the instruction dual combination. The instruction interval count value represents the number of other drawing state machine description blocks inserted between the write operation instruction and the read operation instruction.

[0146] For each instruction pair identified in step S250 that has a write-before-read data dependency, the position index of the write operation instruction and the position index of the read operation instruction in the state machine description block storage area are obtained. The difference between the position index of the read operation instruction and the position index of the write operation instruction is calculated, and then 1 is subtracted. The resulting value is the number of other state machine description blocks inserted between the write operation instruction and the read operation instruction. This value is marked as the instruction interval count value.

[0147] Step S252: Obtain the instruction pipeline depth parameter of the second graphics processing hardware set. The instruction pipeline depth parameter represents the number of hardware clock cycles required for the second graphics processing hardware set to send machine-level graphics instructions from the instruction issuing unit to the execution pipeline and generate write-back result data from the machine-level graphics instructions.

[0148] The hardware capability description information set obtained by the source compute node during the capability exchange process in step S121 includes the instruction pipeline depth parameter of the second graphics processing hardware set. This instruction pipeline depth parameter is an integer value representing the number of hardware clock cycles elapsed from the moment the instruction issuer issues a machine-level graphics instruction to the execution pipeline until the execution result of the instruction is written back to the register file or memory.

[0149] Step S253: Compare the instruction interval count value with the instruction pipeline depth parameter. If the instruction interval count value is less than the instruction pipeline depth parameter, mark the instruction pair as a pipeline cavitation risk instruction pair. When the pipeline cavitation risk instruction pair is directly converted into the target hardware executable graphics instruction stream, it will cause the instruction pipeline of the second graphics processing hardware set to insert a waiting cavitation cycle.

[0150] The instruction interval count obtained in step S251 is compared with the instruction pipeline depth parameter obtained in step S252. If the instruction interval count is less than the instruction pipeline depth parameter, it means that the number of other instructions between the write and read instructions is insufficient to cover the depth of the instruction pipeline. The read instruction starts executing before the result of the write instruction has been written back, which will cause the instruction pipeline to stall and insert a wait-and-bubble cycle. This instruction pair is marked as a pipeline bubbling risk instruction pair.

[0151] Step S254: For the instruction pair combination marked as the pipeline cavitation risk instruction pair, perform an irrelevant instruction migration insertion operation at the position between the write operation instruction and the read operation instruction in the drawing state machine description block storage area representing the structure in the middle of the graphics pipeline. Select other drawing state machine description blocks from subsequent positions in the drawing state machine description block storage area that do not have the read-to-write data hazard relationship with the write operation instruction and the read operation instruction, and migrate and fill them to the position between the write operation instruction and the read operation instruction.

[0152] For each instruction pair marked as a pipeline cavitation risk instruction, locate the positions of the write and read instructions within the drawn state machine description block storage area. From the drawn state machine description block storage area following the read instruction, search for other drawn state machine description blocks that do not have a read-to-write data hazard relationship with either the write or read instruction. Read-to-write data hazard relationships include three types: write-after-read, write-after-write, and read-after-write. Once a matching drawn state machine description block is found, delete it from its original position and insert it between the write and read instructions. Increment the instruction interval count by 1 for each inserted drawn state machine description block. Repeat the migration insertion operation until the instruction interval count is greater than or equal to the instruction pipeline depth parameter, or no other matching drawn state machine description block exists in subsequent positions.

[0153] Step S255: If there are not enough other drawing state machine description blocks in the subsequent positions of the drawing state machine description block storage area that do not have the read-to-write data hazard relationship with the write operation instruction and the read operation instruction, then an idle operation description block is inserted at the position between the write operation instruction and the read operation instruction. When the idle operation description block is converted into the target hardware executable graphics instruction stream, it corresponds to the pipeline idle instruction of the second graphics processing hardware set.

[0154] If, in step S254, a sufficient number of other drawing state machine description blocks that meet the conditions cannot be found to fill the gap between the write and read instructions, a no-operation description block is created at the position between the write and read instructions. The no-operation description block is a virtual instruction that does not perform any actual graphics operation, and its format is compatible with the drawing state machine description block. During the hardware instruction translation process in subsequent step S135, this no-operation description block is converted into pipeline idle instructions for the second graphics processing hardware set. A sufficient number of no-operation description blocks are inserted so that the instruction interval count is greater than or equal to the instruction pipeline depth parameter.

[0155] Step S256: Perform instruction address remapping on the drawing state machine description block storage area after the irrelevant instruction migration and insertion operation, recalculate the offset address index value of each drawing state machine description block in the middle representation structure of the graphics pipeline, and update the jump offset field and branch target address field between the drawing state machine description blocks.

[0156] After completing the migration and insertion operations in steps S254 and S255, the order and position of each drawing state machine description block in the drawing state machine description block storage area have changed. The drawing state machine description block storage area is traversed, and the offset address index value of each drawing state machine description block in the storage area is recalculated according to the new order. For instructions in the drawing state machine description block that contain a jump offset field, the value of the jump offset field is updated according to the new offset address index value of the target drawing state machine description block. For instructions in the drawing state machine description block that contain a branch target address field, the value of the branch target address field is updated according to the new offset address index value of the branch target drawing state machine description block.

[0157] Step S257: The intermediate representation structure of the graphics pipeline after instruction dependency relaxation is used as the input data source for the subsequent hardware instruction translation process. The subsequent hardware instruction translation process generates the target hardware executable graphics instruction stream with reduced pipeline cavitation based on the intermediate representation structure of the graphics pipeline with relaxed instruction dependencies.

[0158] The rendering state machine description block storage area output in step S256 is used as the updated intermediate representation structure of the graphics pipeline. The interval between write and read instructions in this intermediate representation structure has been expanded by migrating irrelevant instructions or inserting empty operation description blocks, ensuring that the instruction interval count is greater than or equal to the instruction pipeline depth parameter. This updated intermediate representation structure is passed to the hardware instruction translation process in step S135 as the input data source. Step S135 performs hardware instruction translation based on this relaxed instruction dependency intermediate representation structure. The generated target hardware executable graphics instruction stream has a sufficient number of padding instructions between write and read instructions, thereby reducing the number of pipeline cavitations during execution by the second graphics processing hardware set.

[0159] For example, the step of performing display timing adaptation processing on the graphics rendering result frame data returned by the target computing node to eliminate cross-operating system vertical synchronization phase deviation, after the graphics rendering result frame data is returned from the target computing node to the source computing node via the backhaul channel and before the graphics rendering result frame data is written into the frame buffer display area of ​​the source computing node, the method further includes: step S260: obtaining the vertical blanking cycle timing parameters of the current display subsystem of the source computing node from the operating system kernel of the source computing node, wherein the vertical blanking cycle timing parameters include the start timestamp value of the current vertical blanking cycle, the end timestamp value, and the expected start timestamp value of the next vertical blanking cycle.

[0160] After step S150, where the graphics rendering result frame data is returned from the target computing node to the source computing node via the backhaul channel, and before step S158, where the graphics rendering result frame data is written to the frame buffer display area of ​​the source computing node, step S260 performs display timing adaptation processing. The source computing node calls the display subsystem query interface provided by the operating system kernel. This display subsystem query interface reads the start and end timestamp values ​​of the current vertical blanking cycle from the registers of the display controller. Based on the fixed interval length of the vertical blanking cycle, a fixed interval length is added to the end timestamp value of the current vertical blanking cycle to calculate the expected start timestamp value of the next vertical blanking cycle. These three timestamp values ​​constitute the vertical blanking cycle timing parameters.

[0161] Step S261: Extract the timestamp of the target end generation completion time attached when the target computing node generates the graphics rendering result frame data from the data packet received from the backhaul channel. The timestamp of the target end generation completion time is generated based on the system clock domain of the target computing node and has a clock offset from the system clock domain of the source computing node.

[0162] After the target computing node generates the graphics rendering result frame data in step S150, it reads the current time value from the target computing node's system clock counter as a timestamp of the target's completion time when encapsulating the frame data into a transmission data packet, and appends it to the packet header. After receiving the data packet from the return channel, the source computing node extracts the timestamp of the target's completion time from the packet header. Because the system clock counters of the source and target computing nodes may use different crystal oscillators and have different startup times, there is a fixed clock offset between the two clock domains.

[0163] Step S262: Perform clock synchronization conversion processing on the timestamp of the target end's completion time and the system clock domain of the source computing node. Using the clock offset calculated in advance through the clock synchronization protocol message exchanged between the source computing node and the target computing node via the cross-node transmission channel, map the timestamp of the target end's completion time to the system clock domain of the source computing node to obtain the equivalent completion timestamp value of the source end. Calculate the time difference between the equivalent completion timestamp value of the source end and the expected start timestamp value of the next vertical blanking cycle. If the time difference value is less than the preset frame submission safety margin time value, it is determined that the graphics rendering result frame data will cause display tearing artifacts or frame loss when submitted to the frame buffer display area.

[0164] The source and target computing nodes pre-calculate the clock offset between them through network time protocol message exchange. This clock offset is added to the extracted timestamp of the target node's completion time to obtain the equivalent completion timestamp value mapped to the source computing node's system clock domain. The difference between the equivalent completion timestamp value and the estimated start timestamp value of the next vertical blanking cycle obtained in step S260 is calculated, and its absolute value is used to obtain the time difference. This time difference is compared with a preset frame submission safety margin time value. If the time difference is less than the preset frame submission safety margin time value, it indicates that the time between the arrival of the graphics rendering result frame data at the source computing node and the start of the next vertical blanking cycle is too short. The system graphics driver layer may not have enough time to complete the frame buffer update operation, which will likely cause display tearing artifacts or frame loss.

[0165] Step S263: If it is determined that the graphics rendering result frame data will cause the display tearing artifact or the frame loss phenomenon, then the graphics rendering result frame data is temporarily stored in the display timing retiming buffer queue of the source computing node, and a frame delay flag and a target display vertical blanking cycle sequence number are added to the graphics rendering result frame data.

[0166] When step S262 determines that display tearing artifacts or frame loss will occur, the source compute node allocates a display timing retiming buffer queue in memory. The graphics rendering result frame data is temporarily stored in this display timing retiming buffer queue instead of being immediately submitted to the frame buffer update function. A frame delay flag is appended to the frame data and set to true. Simultaneously, the target display vertical blanking cycle sequence number is calculated by adding a delay offset to the current vertical blanking cycle sequence number and appended to the metadata of the frame data.

[0167] Step S264: Maintain the mapping relationship between the graphics rendering result frame data and the target display vertical blanking cycle sequence number in the display timing retiming buffer queue. When the current vertical blanking cycle of the source computing node advances to a vertical blanking cycle that matches the target display vertical blanking cycle sequence number, retrieve the graphics rendering result frame data from the display timing retiming buffer queue.

[0168] Each time the display controller of the source compute node triggers a vertical blanking interrupt, the interrupt handler increments the current vertical blanking cycle sequence number by 1. The interrupt handler traverses the display timing retiming buffer queue, searching for frame data stored in the metadata whose target display vertical blanking cycle sequence number is equal to the current vertical blanking cycle sequence number. After finding the matching frame data, the graphics rendering result frame data is retrieved from the display timing retiming buffer queue, and its frame delay flag is cleared.

[0169] Step S265: Perform frame display lifetime extension processing on the graphics rendering result frame data retrieved from the display timing retiming buffer queue, extending the expected display duration parameter of the graphics rendering result frame data from the original single vertical blanking period to multiple consecutive vertical blanking periods. Within the multiple consecutive vertical blanking periods, the display control unit of the source computing node repeatedly scans and outputs the same set of graphics rendering result frame data.

[0170] For the graphics rendering result frame data retrieved from the display timing retiming buffer queue, its frame display lifecycle parameters are modified. The original single vertical blanking cycle display duration is changed to multiple consecutive vertical blanking cycle display durations. The number of extended cycles is determined based on the time difference calculated in step S262; the smaller the time difference, the more extended cycles. In the subsequent multiple consecutive vertical blanking cycles, the display control unit repeatedly scans and reads the same graphics rendering result frame data from the frame buffer area and outputs it to the physical display device until the display lifecycle of that frame data ends.

[0171] Step S266: If the display control unit of the source computing node has not received the next frame of new graphics rendering result frame data from the target computing node within the multiple consecutive vertical blanking cycles, then a rendering rhythm adjustment feedback message is sent to the target computing node. The rendering rhythm adjustment feedback message carries a request identifier that instructs the target computing node to increase the priority of graphics instruction processing.

[0172] During the frame display lifecycle extension process, the display control unit continuously checks whether it has received the next frame of graphics rendering result data from the target compute node. If the next frame data is not received before the end of the current frame display lifecycle, the source compute node sends a rendering pacing adjustment feedback message to the target compute node via the network transport protocol stack. The message body of this rendering pacing adjustment feedback message contains a request identifier, which instructs the target compute node to increase its graphics instruction processing priority. Upon receiving this message, the target compute node adjusts its graphics driver layer scheduling strategy, raising the priority of the graphics instruction submission queue associated with the source compute node to the highest level.

[0173] Step S267: Submit the graphics rendering result frame data after the display timing adaptation process to the frame buffer update function in the system graphics driver layer of the source computing node, and the frame buffer update function writes the graphics rendering result frame data into the graphics display hardware frame buffer area of ​​the source computing node.

[0174] After the display timing adaptation processing in steps S263 to S265, the graphics rendering result frame data is submitted to the frame buffer update function in the system graphics driver layer of the source compute node. The frame buffer update function, as described in step S158, writes the graphics rendering result frame data into the graphics display hardware frame buffer area of ​​the source compute node, preparing it for scanning and output by the display control unit at the start of the next vertical blanking cycle.

[0175] Based on the same inventive concept, please refer to Figure 2 This illustration shows a schematic block diagram of an application graphics interface redirection and hardware acceleration system for heterogeneous computing environments provided in an embodiment of this application. The system includes a central processing unit (CPU), a system memory comprising random access memory (RAM) and read-only memory (ROM), and a system bus connecting the system memory and the CPU. The application graphics interface redirection and hardware acceleration system for heterogeneous computing environments also includes a basic input / output system that facilitates information transfer between various devices within the computer, and a large-capacity storage device for storing the operating system, applications, and other program modules.

[0176] A basic input / output system includes a display for showing information and input devices such as a mouse and keyboard for user input. Both the display and the input devices are connected to the central processing unit via an input / output controller connected to the system bus. The basic input / output system may also include an input / output controller for receiving and processing input from multiple other devices such as a keyboard, mouse, or electronic stylus. Similarly, the input / output controller also provides output to a display screen, printer, or other types of output devices.

[0177] Mass storage devices are connected to the central processing unit via mass storage controllers connected to the system bus. The mass storage devices and their associated computer-readable media provide non-volatile storage for application graphics interface redirection and hardware acceleration in heterogeneous computing environments. Without loss of generality, computer-readable media can include computer storage media and communication media. Computer storage media include volatile and non-volatile, removable and non-removable media implemented using any method or technology for storing information such as computer-readable instructions, data structures, program modules, or other data. According to various embodiments of this application, application graphics interface redirection and hardware acceleration in heterogeneous computing environments can also be implemented by connecting to remote computers on a network, such as the Internet. That is, application graphics interface redirection and hardware acceleration in heterogeneous computing environments can be connected to a network via a network interface unit connected to the system bus, or a network interface unit can be used to connect to other types of networks or remote computer systems.

[0178] In addition, in the specific embodiments of this application, data such as user information are involved. When the above embodiments of this application are applied to specific products or technologies, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0179] The above are merely exemplary embodiments of this application and are not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application shall be included within the protection scope of this application.

Claims

1. A method for redirecting and accelerating application graphics interfaces in a heterogeneous computing environment, characterized in that, The method includes: The source computing node intercepts the initial graphics instruction stream generated by the target application. The initial graphics instruction stream includes a sequence of graphics drawing commands and a set of graphics resource descriptors that conform to the first graphics interface specification. The source computing node runs in a first operating system architecture environment and is configured with a first graphics processing hardware set. The second operating system architecture environment type and the hardware capability description information of the second graphics processing hardware set of the target computing node are obtained. Based on the second operating system architecture environment type and the hardware capability description information of the second graphics processing hardware set of the target computing node, the initial graphics instruction stream is subjected to instruction stream semantic parsing processing to obtain the instruction stream semantic unit sequence. The target computing node and the source computing node are in the same physical network or virtual network layer and the target computing node runs in the second operating system architecture environment. Based on the instruction stream semantic unit sequence, a graphics pipeline intermediate representation structure containing a drawing state machine description block and a resource binding relationship description block is constructed. Based on the hardware instruction encoding rules of the second graphics processing hardware set, the graphics pipeline intermediate representation structure is subjected to hardware instruction translation processing to generate a target hardware executable graphics instruction stream. The target hardware executable graphics instruction stream is transmitted from the source computing node to the target computing node through a cross-node transmission channel. The cross-node transmission channel is constructed by the network transmission protocol stack and the shared memory mapping space, and the cross-node transmission channel supports binary data exchange between the first operating system architecture environment and the second operating system architecture environment. The second graphics processing hardware set is invoked in the target computing node to perform hardware-accelerated execution of the target hardware executable graphics instruction stream to obtain graphics rendering result frame data, and the graphics rendering result frame data is returned from the target computing node to the frame buffer display area of ​​the source computing node through the return channel.

2. The method for application graphics interface redirection and hardware acceleration in a heterogeneous computing environment according to claim 1, characterized in that, The interception of the initial graphics instruction stream generated by the target application on the source computing node includes: An instruction flow interception intermediate layer is implanted between the graphics interface runtime library of the source computing node and the system graphics driver layer. The instruction flow interception intermediate layer is embedded in the software stack of the source computing node in the form of an operating system kernel module or a user-mode dynamic link library. When the target application calls the graphics application programming interface function of the first graphics interface specification, the instruction stream interception intermediate layer captures the input parameter data structure and the output return value data structure of the graphics application programming interface function. The input parameter data structure includes the drawing command opcode, vertex data buffer pointer, index data buffer pointer, texture sampler handle and shader program handle. The drawing command operation codes contained in the input parameter data structure are stored as graphics drawing command sequence units according to the calling sequence. Each drawing command element in the graphics drawing command sequence unit retains the original timing relationship and command dependency relationship when the target application initiates the call. Extract the vertex data memory block pointed to by the vertex data buffer pointer, the index data memory block pointed to by the index data buffer pointer, the texture image data block associated with the texture sampler handle, and the shader binary code block associated with the shader program handle contained in the input parameter data structure, and encapsulate the vertex data memory block, the index data memory block, the texture image data block, and the shader binary code block into the graphics resource descriptor set; Each drawing command element in the graphics drawing command sequence unit is associated with a resource reference binding relationship with the corresponding resource descriptor entry in the graphics resource descriptor set. The resource reference binding relationship is used to maintain the consistency of association between drawing commands and graphics resources in the subsequent instruction stream processing stage. The graphics drawing command sequence unit and the set of graphics resource descriptors carrying the resource reference binding relationship are subjected to data integrity encapsulation processing. Meta-information headers containing interception timestamps, application process identifiers, and source computing node hardware architecture identifiers are appended to generate a complete initial graphics instruction stream. The initial graphics instruction stream is temporarily stored in the memory buffer queue of the source computing node. The memory buffer queue adopts a circular queue management strategy to coordinate the timing difference between the graphics instruction generation rate of the target application and the processing rate of subsequent instruction stream semantic parsing processing.

3. The method for application graphics interface redirection and hardware acceleration in a heterogeneous computing environment according to claim 1, characterized in that, The initial graphics instruction stream is subjected to instruction stream semantic parsing processing based on the second operating system architecture environment type of the target computing node and the hardware capability description information of the second graphics processing hardware set, to obtain an instruction stream semantic unit sequence, including: The target computing node receives in advance the second operating system architecture environment type identifier and the hardware capability description information set of the second graphics processing hardware set. The hardware capability description information set includes the graphics processor core architecture code name, the highest supported graphics interface specification version number, the maximum number of texture samplers, the number of available unified computing architecture cores, and the dedicated graphics memory capacity. Load an instruction stream semantic parser instance that matches the second operating system architecture environment type according to the second operating system architecture environment type identifier. The instruction stream semantic parser instance is pre-set with memory alignment rule sets, data byte order conversion rule sets and system call convention rule sets corresponding to different operating system architecture environments. Based on the data byte order conversion rule set configured in the instruction stream semantic parser instance, all binary data fields in the initial graphics instruction stream are processed for byte order conversion. If the first operating system architecture environment and the second operating system architecture environment use different data byte order representation methods, the multi-byte data fields in the initial graphics instruction stream are converted from the first byte order representation format to the second byte order representation format. Based on the memory alignment rule set configured in the instruction stream semantic parser instance, the initial graphics instruction stream after byte order conversion is memory aligned and rearranged. The starting memory addresses of the vertex data memory blocks, index data memory blocks, and texture image data blocks contained in the initial graphics instruction stream are adjusted to aligned memory addresses that meet the memory alignment granularity requirements of the second operating system architecture environment. The graphics drawing command sequence units in the initial graphics instruction stream after memory alignment and rearrangement are deconstructed, and the drawing command opcode, associated vertex data range descriptor, associated index data range descriptor, and associated texture state binding descriptor of each drawing command element in the graphics drawing command sequence unit are extracted. The drawing command opcode, the associated vertex data range descriptor, the associated index data range descriptor, and the associated texture state binding descriptor are encapsulated into an independent instruction stream semantic unit, and the instruction stream semantic unit is sequentially arranged according to the original timing relationship in the graphics drawing command sequence unit to form the instruction stream semantic unit sequence. Based on the highest supported graphics interface specification version number in the hardware capability description information set, the maximum number limit of the texture sampler, and the number of available cores of the unified computing architecture, the instruction stream semantic unit sequence is subjected to hardware capability adaptation and pruning processing, removing or replacing opcode elements and resource binding elements in the instruction stream semantic unit sequence that exceed the hardware capability range of the second graphics processing hardware set.

4. The method for application graphics interface redirection and hardware acceleration in a heterogeneous computing environment according to claim 1, characterized in that, The process of constructing a graphics pipeline intermediate representation structure containing a drawing state machine description block and a resource binding relationship description block based on the instruction stream semantic unit sequence, and performing hardware instruction translation processing on the graphics pipeline intermediate representation structure based on the hardware instruction encoding rules of the second graphics processing hardware set to generate a target hardware executable graphics instruction stream includes: A blank graphics pipeline intermediate representation structure container is created, which consists of a state machine description block storage area, a resource binding relationship description block storage area, and a pipeline stage dependency relationship graph storage area; Starting from the first instruction stream semantic unit in the instruction stream semantic unit sequence, the sequence is traversed backwards. For each instruction stream semantic unit, a drawing state machine tracking and update operation is performed. When the drawing command opcode in the instruction stream semantic unit is a state setting type opcode, the corresponding state register entry in the drawing state machine description block storage area is updated. When the drawing command opcode in the instruction stream semantic unit is a drawing call type opcode, the complete state register snapshot in the current drawing state machine description block storage area is encapsulated with the drawing command opcode into a drawing state machine description block. For each instruction stream semantic unit, the associated vertex data range descriptor, associated index data range descriptor, and associated texture state binding descriptor are processed by resource handle parsing. The virtual resource handle identifier defined in the first graphics interface specification is mapped to the unified resource identifier defined in the intermediate representation structure of the graphics pipeline. Based on the unified resource identifier, the resource binding relationship description block is constructed. The resource binding relationship description block includes the original memory address information of the graphics resource pointed to by each unified resource identifier in the memory space of the source computing node, the target memory layout information of the graphics resource in the target computing node, and the resource access permission attribute information of the graphics resource. Obtain the hardware instruction encoding rule description table of the second graphics processing hardware set. The hardware instruction encoding rule description table defines the opcode bit field layout format, operand addressing mode encoding format and instruction pipeline delay slot filling rules of the machine-level graphics instructions supported by the second graphics processing hardware set. For each drawing state machine description block in the drawing state machine description block storage area, an instruction template matching operation is performed. The target machine-level graphics instruction template sequence that matches the complete state register snapshot and the drawing command opcode in the hardware instruction encoding rule description table is searched. According to the operand addressing mode encoding format in the target machine-level graphics instruction template sequence, the original memory address information and the target memory layout information in the resource binding relationship description block associated with the drawing state machine description block are filled into the operand field of the target machine-level graphics instruction template sequence to generate a hardware instruction binary code sequence in the target hardware executable graphics instruction stream optimized for the second graphics processing hardware set. The hardware instruction binary code sequence is concatenated and arranged according to the original timing relationship in the instruction stream semantic unit sequence, and delay slot filling instructions that conform to the instruction pipeline delay slot filling rules are inserted between adjacent hardware instruction binary code sequences to generate a complete target hardware executable graphics instruction stream.

5. The method for application graphics interface redirection and hardware acceleration in a heterogeneous computing environment according to claim 1, characterized in that, The step of transmitting the target hardware-executable graphics instruction stream from the source computing node to the target computing node via a cross-node transmission channel includes: Register a physical memory region in the operating system kernel space of the source computing node as the source memory range of the shared memory mapping space, and set the memory page table entry permissions of the source memory range to allow remote direct memory access read operations; Register a physical memory region in the operating system kernel space of the target computing node as the target memory region of the shared memory mapping space, and set the memory page table entry permissions of the target memory region to allow remote direct memory access write operations. The remote direct memory access control protocol of the network transport protocol stack establishes a reliable connection session between the source computing node and the target computing node. The reliable connection session includes queue pair identifiers, protection domain identifiers and memory region access key sets. The physical memory address information, memory region length information, and source access key from the memory region access key set of the source computing node are encapsulated into a memory region descriptor message. The memory region descriptor message is then transmitted from the source computing node to the target computing node through the network transport protocol stack. After the target computing node receives the memory region descriptor message, it uses the physical memory address information, memory region length information, and source access key carried in the memory region descriptor message to establish a one-way memory mapping channel pointing to the source memory region in the remote direct memory access network interface card of the target computing node. The target hardware executable graphics instruction stream is copied from the memory buffer queue of the source computing node to the registered source memory region, and after the copy is completed, a transmission completion flag value is written to the completion flag field at the end of the source memory region. A one-sided write operation is initiated through the remote direct memory access network interface card of the source computing node. The target hardware executable graphics instruction stream in the source memory range is directly written into the target memory range of the target computing node using the one-way memory mapping channel. The one-sided write operation bypasses the central processing unit scheduling of the target computing node. After the remote direct memory access network interface card of the target computing node completes the one-sided write operation, a hardware interrupt signal is triggered to the operating system kernel of the target computing node, notifying the graphics driver layer of the target computing node to read the target hardware executable graphics instruction stream from the target memory area in preparation for hardware accelerated execution processing.

6. The method for application graphics interface redirection and hardware acceleration in a heterogeneous computing environment according to claim 5, characterized in that, The step of invoking the second graphics processing hardware set in the target computing node to perform hardware-accelerated execution of the target hardware executable graphics instruction stream, obtaining graphics rendering result frame data, and returning the graphics rendering result frame data from the target computing node to the frame buffer display area of ​​the source computing node through a return channel includes: In the graphics driver layer of the target computing node, a hardware-accelerated instruction submission queue is created specifically for executing the executable graphics instruction stream of the target hardware. The hardware-accelerated instruction submission queue is mapped to the hardware command ring buffer of the second graphics processing hardware set. The target hardware executable graphics instruction stream is read from the target end memory area of ​​the target computing node into the hardware acceleration instruction submission queue, and an instruction execution notification is sent to the second graphics processing hardware set by updating the tail pointer register of the hardware command ring buffer. The graphics instruction scheduling unit in the second graphics processing hardware set sequentially retrieves the hardware instruction binary code sequence from the target hardware executable graphics instruction stream from the hardware command circular buffer, and distributes the hardware instruction binary code sequence to the shader core array, texture mapping unit and rasterization operation unit in the second graphics processing hardware set to perform parallel pipeline processing. During the parallel pipeline processing, the shader core array obtains the shader binary code block from the target memory region of the target computing node according to the shader program start memory address carried in the hardware instruction binary code sequence and loads it into the instruction cache area. The texture mapping unit obtains the texture image data block from the target memory region according to the texture data base address carried in the hardware instruction binary code sequence and loads it into the texture cache area. The rasterization operation unit obtains the vertex data memory block and the index data memory block from the target memory region based on the vertex data base address and the index data base address carried in the hardware instruction binary code sequence, and generates the pixel color value and pixel depth value of the graphics rendering result frame data through primitive assembly processing and fragment shading processing. The graphics rendering result frame data is stored in the frame buffer memory associated with the second graphics processing hardware set. The frame buffer memory is configured to use a linear color space storage format and the frame buffer data layout is compatible with the second operating system architecture environment. The complete graphics rendering result frame data is read from the frame buffer memory, and the graphics rendering result frame data is transmitted from the target computing node to the source computing node through the socket transmission channel of the network transmission protocol stack or the reverse transmission path of the one-way memory mapping channel. After the source computing node receives the graphics rendering result frame data, it calls the frame buffer update function in the system graphics driver layer of the source computing node to directly write the graphics rendering result frame data into the specified memory offset position associated with the target application window handle in the graphics display hardware frame buffer area of ​​the source computing node. When the next vertical blanking cycle arrives, the display control unit of the source computing node scans and reads the graphics rendering result frame data from the graphics display hardware frame buffer area, and outputs the graphics rendering result frame data to the physical display device connected to the source computing node for visual presentation.

7. The method for application graphics interface redirection and hardware acceleration in a heterogeneous computing environment according to claim 1, characterized in that, After constructing the intermediate representation structure of the graphics pipeline containing the drawing state machine description block and the resource binding relationship description block based on the instruction stream semantic unit sequence, and before transmitting the target hardware-executable graphics instruction stream from the source computing node to the target computing node through the cross-node transmission channel, the method further includes: The resource usage frequency analysis is performed on the resource binding relationship description block storage area in the intermediate representation structure of the graphics pipeline. The total number of references of each Uniform Resource Identifier in the resource binding relationship description block storage area by the drawing call type opcode in the drawing state machine description block storage area is counted. The Uniform Resource Identifiers are sorted in descending order according to the total number of references. The graphics resources pointed to by the Uniform Resource Identifiers whose total number of references is higher than a preset frequency threshold are marked as a set of high-frequency used graphics resources. The original memory address information of each high-frequency graphics resource in the set of high-frequency graphics resources in the memory space of the source computing node is analyzed. If the memory region pointed to by the original memory address information is located in the pageable system memory region, the operating system kernel of the source computing node is requested to lock the memory region in physical memory to prevent it from being swapped out to external storage devices by the memory swapping mechanism. The vertex data memory blocks, index data memory blocks, and texture image data blocks contained in the high-frequency graphics resource set are pre-transmitted from the source memory range of the source computing node to the target memory range of the target computing node through the one-way memory mapping channel established in the cross-node transmission channel. In the target memory range of the target computing node, a contiguous physical memory block is allocated for the high-frequency graphics resource set, and the resource data content in the high-frequency graphics resource set is reorganized and arranged according to the target memory layout information of the target computing node. In the target computing node, a resource mapping table entry is constructed, and an association mapping relationship is established between the Uniform Resource Identifier corresponding to the set of frequently used graphics resources and the physical memory start address of the contiguous physical memory block, and the association mapping relationship is persisted in the kernel memory management data structure of the target computing node. During the hardware instruction translation process of the intermediate representation structure of the graphics pipeline, if the Uniform Resource Identifier in the set of frequently used graphics resources is encountered, the physical memory start address of the contiguous physical memory block recorded in the resource mapping table entry is used to directly fill the corresponding operand segment in the target hardware executable graphics instruction stream.

8. The method for application graphics interface redirection and hardware acceleration in a heterogeneous computing environment according to claim 7, characterized in that, After obtaining the graphics rendering result frame data, and before returning the graphics rendering result frame data from the target computing node to the source computing node via the return channel, the method further includes: The target computing node maintains a historical rendering result frame buffer queue corresponding to the target application, and the historical rendering result frame buffer queue stores historical copies of the most recently generated multiple consecutive frames of the graphics rendering result frame data. Perform pixel-by-pixel difference comparison processing on the currently generated graphics rendering result frame data and the historical copy of the previously generated graphics rendering result frame data, and calculate the absolute difference between the pixel color value component at each pixel position in the current graphics rendering result frame data and the pixel color value component at the corresponding pixel position in the previous historical copy. If the absolute difference at a certain pixel position is greater than a preset difference perception threshold, then the pixel position is marked as a changed pixel unit, and the pixel color value component of the changed pixel unit in the current graphics rendering result frame data and the relative intra-frame coordinate offset of the changed pixel unit are recorded in the inter-frame difference data set. Connected component clustering analysis is performed on the changed pixel units in the inter-frame difference data set, and multiple changed pixel units that are spatially adjacent and have similar pixel color value components are aggregated into a changed rectangular region block, and the coordinates of the upper left corner and the lower right corner of the bounding box of the changed rectangular region block are calculated. Two-dimensional discrete transform encoding is performed on the pixel color value components of the changing pixel units within each changing rectangular region block to convert the pixel color value components within the changing rectangular region block from spatial domain representation to frequency domain transform coefficient representation. The frequency domain transform coefficient representation is quantized according to a preset quantization parameter table. After the high-frequency transform coefficient components in the frequency domain transform coefficient representation are attenuated, the quantized frequency domain transform coefficient representation is entropy-encoded and compressed. A variable-length encoding table is used to convert the frequency domain transform coefficient numerical sequence into a compact binary bitstream encoding sequence. The coordinates of the upper left corner and the lower right corner of the bounding box of the variable rectangular region block are added as additional encoding information to the binary bitstream encoding sequence. The binary bitstream encoded sequence after inter-frame differential coding compression is used as the compressed representation of the graphics rendering result frame data, and is transmitted from the target computing node to the source computing node through the backhaul channel; After the source computing node receives the compressed representation, it uses the historical copy of the previously generated graphics rendering result frame data stored in the historical rendering result frame buffer queue and the changed rectangular region block data in the received binary bit stream encoding sequence to perform frame reconstruction processing, thereby recovering the complete graphics rendering result frame data.

9. The method for application graphics interface redirection and hardware acceleration in a heterogeneous computing environment according to claim 8, characterized in that, The method further includes: A channel state monitoring module is deployed in the source computing node. The channel state monitoring module periodically collects the current available bandwidth, current round-trip delay, and current packet loss retransmission rate of the cross-node transmission channel. The current available bandwidth value, the current round-trip delay value, and the current packet loss retransmission rate value are compared and evaluated with preset corresponding thresholds. Based on the evaluation results of each indicator, the current transmission quality evaluation result of the cross-node transmission channel is generated. Based on the current transmission quality evaluation result, the corresponding transmission strategy switching operation type is found in a preset transmission strategy decision mapping table. The transmission strategy decision mapping table defines the corresponding relationship between different transmission quality evaluation results and transmission strategy switching operation types. If the transmission strategy switching operation type indicates that the intra-frame block transmission strategy needs to be enabled, the target hardware executable graphics instruction stream to be transmitted is divided into multiple instruction stream data blocks according to the instruction sequence length, and a block sequence number is attached to each instruction stream data block. The multiple instruction stream data blocks are transmitted in parallel from the source computing node to the target computing node through multiple parallel network transport protocol stack connection sessions via the cross-node transmission channel. After the target computing node receives all the instruction stream data blocks, they are reassembled and restored according to the block sequence number marking to restore the target hardware executable graphics instruction stream. If the transmission strategy switching operation type indicates that the instruction stream simplification transmission strategy needs to be enabled, then the target hardware executable graphics instruction stream is subjected to instruction redundancy elimination processing in the source computing node, and the state redundant setting instruction sequence in the target hardware executable graphics instruction stream that does not contribute to the pixel color value of the final graphics rendering result frame data is removed. If the transmission strategy switching operation type indicates that a compressed transmission strategy needs to be enabled, then the hardware compression acceleration unit in the source computing node is invoked to perform real-time data compression processing on the target hardware executable graphics instruction stream, generating a compressed bitstream of the target hardware executable graphics instruction stream after compression transformation, and transmitting the compressed bitstream to the target computing node through the cross-node transmission channel. After the target computing node receives the compressed bitstream, it calls the hardware decompression acceleration unit in the target computing node to perform real-time data decompression processing on the compressed bitstream, and recovers the complete target hardware executable graphics instruction stream for subsequent hardware accelerated execution processing.

10. A system for redirecting and accelerating application graphics interfaces in a heterogeneous computing environment, characterized in that, include: processor; A machine-readable storage medium for storing machine-executable instructions of the processor; The processor is configured to execute the application graphics interface redirection and hardware acceleration method in a heterogeneous computing environment as described in any one of claims 1 to 9 by executing the machine-executable instructions.