Compiling optimization method, device, equipment, medium and program product of private array

By analyzing and converting access to private arrays into on-chip register access, the problem of high latency in private array access was solved, resulting in performance improvement and reduced bandwidth contention.

CN122309449APending Publication Date: 2026-06-30GLENFLY TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GLENFLY TECH CO LTD
Filing Date
2026-03-20
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, access to private arrays has high latency, leading to performance bottlenecks, especially with significant performance loss during frequent access and severe bandwidth contention.

Method used

By analyzing the computational tasks to be processed, the number of on-chip registers required for the private array is determined, and access to the private array is converted into access to on-chip registers in the processor, thereby reducing memory access latency.

Benefits of technology

Moving private arrays from the memory subsystem to on-chip registers reduces access latency from tens or even hundreds of cycles to one or a few cycles, significantly improving execution performance and reducing bandwidth contention.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309449A_ABST
    Figure CN122309449A_ABST
Patent Text Reader

Abstract

This application relates to a compilation optimization method, apparatus, device, medium, and program product for private arrays. The method includes: analyzing the computational task to be processed to obtain each private array and the number of on-chip registers required for register mapping of each private array; traversing each private array to obtain target private arrays that can be registered; and performing code transformation on each target private array based on the number of on-chip registers required for register mapping, so as to convert access to the target private arrays into access to on-chip registers in the processor. This method can move private arrays from the memory subsystem to the processor's on-chip registers, reducing access latency from tens or even hundreds of cycles to one or a few cycles, fundamentally solving the memory access bottleneck problem caused by private arrays.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a compilation optimization method, apparatus, device, medium, and program product for private arrays. Background Technology

[0002] Graphics processor-based general-purpose computing is used to perform general-purpose scientific computing and parallel processing tasks that do not require graphics rendering. To make it easier for developers to use GPUs for general-purpose computing, several important programming models and platforms have emerged, such as NVIDIA CUDA, OpenCL (an open, cross-vendor general-purpose parallel computing standard), DirectCompute (part of the Microsoft DirectX API, primarily used for GPU computing on the Windows platform), and ROCm (an open-source general-purpose computing programming model launched by AMD).

[0003] In the general-purpose computing programming model, a work item is the smallest unit of computation, typically mapped to the smallest processing unit of a general-purpose computing GPU for execution. In the general-purpose computing memory model, memory space is mainly divided into global memory, local memory, constant memory, and private memory.

[0004] Private memory is a private space allocated to each work item, primarily used to store its private variables and private arrays (called Private Arrays). Private variables are typically implemented as on-chip registers. However, in traditional implementations, due to limitations in compiler optimization strategies or hardware resource constraints, compilers often adopt a conservative strategy for private arrays of uncertain size or dynamically changing indices, allocating them in external global memory and accessing them through memory load / store operations. If private arrays are placed in off-chip global memory, the access latency will be significantly higher, becoming a performance bottleneck. Summary of the Invention

[0005] Therefore, it is necessary to provide a compilation optimization method, apparatus, device, medium, and program product for private arrays that can reduce access latency and improve access efficiency in response to the above-mentioned technical problems.

[0006] Firstly, this application provides a compilation optimization method for private arrays, the method comprising:

[0007] The computational task to be processed is analyzed to obtain the number of private arrays and the number of on-chip registers required for register mapping of each private array;

[0008] Traverse each of the aforementioned private arrays to obtain the target private array that can be used for register mapping;

[0009] Based on the number of on-chip registers required for register mapping of each of the aforementioned private arrays, code conversion is performed on each of the aforementioned target private arrays to convert access to the target private arrays into access to on-chip registers in the processor.

[0010] In one embodiment, the analysis of the computational task to be processed to obtain each private array and the number of on-chip registers required for register mapping of each private array includes:

[0011] Static analysis is performed on the input kernel function corresponding to the computational task to be processed to obtain each private array;

[0012] Extract the key attributes of each of the private arrays, including element type and array size;

[0013] The number of registers occupied by each array element is determined based on the element type;

[0014] Based on the array size and the number of registers occupied by each array element, the number of on-chip registers required for register mapping of the private array is obtained.

[0015] In one embodiment, traversing each of the private arrays to obtain a target private array that can be used for register mapping includes:

[0016] Get the access pattern and / or access frequency for each private array;

[0017] Based on the access patterns and / or access frequencies of each private array, the target private arrays that can be used for register mapping are obtained.

[0018] In one embodiment, obtaining the target private array that can be used for register mapping based on the access patterns and / or access frequencies corresponding to each private array includes:

[0019] Select a private array whose indices are primarily constant or linear, and which is not a dynamically accessed array, as the target private array; and / or

[0020] If the number of on-chip registers available for this optimization in the processor is less than the total number of on-chip registers required for register mapping of each of the private arrays, a target private array is selected based on the access frequency of each of the private arrays that need to be mapped, wherein the access frequency of the target private array is greater than the access frequency of the other private arrays.

[0021] In one embodiment, the method further includes:

[0022] Obtain the total number of on-chip registers in the processor and the number of registers required for other non-private arrays;

[0023] Based on the total number of on-chip registers in the processor and the number of registers required by other non-private arrays, the number of on-chip registers in the processor that can be used for this optimization is obtained.

[0024] In one embodiment, the step of performing code translation on each target private array based on the number of on-chip registers required for register mapping of each of the private arrays, to convert access to the target private arrays into access to on-chip registers in the processor, includes:

[0025] Based on the number of on-chip registers required for register mapping of each of the aforementioned private arrays, the target on-chip registers are determined for the target private array;

[0026] Access to the target private array is converted into access to the target on-chip register.

[0027] In one embodiment, converting access to the target private array into access to the target on-chip register includes at least one of the following:

[0028] The index access to the target private array is converted into an index access to the target on-chip register;

[0029] Write operations on elements of the target private array are converted into MOV instructions with the target on-chip register as the destination operand, or merged and mapped into ALU instructions with the target on-chip register as the destination operand.

[0030] The read operation on the elements of the target private array is converted into a read operation with the target on-chip register as the source operand.

[0031] In one embodiment, the method further includes:

[0032] Record the lifecycle of the target private array, and determine the occupancy time of the target on-chip register based on the lifecycle of the target private array;

[0033] The step of determining the number of on-chip registers required for register mapping based on each of the private arrays, and determining the target on-chip registers for the target private array, includes:

[0034] Based on the occupancy time of each target on-chip register, select the available on-chip register;

[0035] Based on the number of on-chip registers required for register mapping of each of the private arrays, the target on-chip register corresponding to the target private array is selected from the available on-chip registers.

[0036] Secondly, this application also provides a compilation optimization apparatus for private arrays, the apparatus comprising:

[0037] The input module is used to analyze the computational task to be processed, and to obtain the number of private arrays and the number of on-chip registers required for register mapping of each private array;

[0038] The target private array determination module is used to traverse each of the private arrays to obtain the target private array that can be used for register mapping.

[0039] The code conversion module is used to perform code conversion on each of the target private arrays based on the number of on-chip registers required for register mapping of each of the private arrays, so as to convert access to the target private arrays into access to on-chip registers in the processor.

[0040] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the method in any of the above embodiments.

[0041] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the methods in any of the above embodiments.

[0042] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the method in any of the above embodiments.

[0043] The aforementioned compilation optimization method, apparatus, device, medium, and program product for private arrays analyze the computational task to be processed, obtaining each private array and the number of on-chip registers required for register mapping of each private array; traversing each private array to obtain target private arrays that can be mapped to registers; based on the number of on-chip registers required for register mapping of each private array, performing code conversion on each target private array to convert access to the target private array into access to on-chip registers in the processor, moving the private array from the memory subsystem to the on-chip registers of the processor, reducing the access latency from tens or even hundreds of cycles to one or a few cycles, fundamentally solving the memory access bottleneck problem caused by private arrays. Attached Figure Description

[0044] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0045] Figure 1 This is a flowchart illustrating the compilation optimization method for private arrays in one embodiment;

[0046] Figure 2 This is a schematic diagram of a code transformation method for fixing n consecutive on-chip registers for PA in one embodiment;

[0047] Figure 3 This is a flowchart illustrating the compilation optimization method for private arrays in another embodiment;

[0048] Figure 4 This is a block diagram of the compilation optimization device for a private array in one embodiment;

[0049] Figure 5 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0050] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0051] It should be noted that the terms "first," "second," etc., used in this application can be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from the second element. The terms "comprising" and "having," and any variations thereof, used in this application, are intended to cover non-exclusive inclusion. The term "multiple" used in this application refers to two or more. The term "and / or" used in this application refers to one of the embodiments, or any combination of multiple embodiments.

[0052] Private memory is a private space allocated to each work item, primarily used to store its private variables and private arrays (called Private Arrays). Private variables are typically implemented as on-chip registers. However, in current implementations, due to limitations in compiler optimization strategies or hardware resource constraints, compilers often adopt a conservative strategy for private arrays of uncertain size or dynamically changing indices. They allocate these arrays in external global memory and access them via memory load / store operations, rather than directly mapping them to the extremely fast on-chip registers.

[0053] Let's take OpenCL as an example.

[0054] For example, the private arrays temp[8] and temp2

[16] in the OpenCL kernel code snippet-1 below will be mapped to Memory Store / Load operations in the relevant technical implementation schemes:

[0055]

[0056] However, the write operations to the private array described above are typically implemented as memory store operations, while accessing the private array by index is implemented as memory load operations. This implementation method has significant drawbacks:

[0057] High access latency: Even on-chip caches have access latency that is several or even tens of cycles higher than the access speed of on-chip registers in a single cycle. If private arrays are placed in off-chip global memory, the access latency will be even greater, becoming a performance bottleneck.

[0058] Bandwidth contention: When a large number of work items access the memory subsystem simultaneously, they will compete for limited memory bandwidth, which may further exacerbate access latency.

[0059] Therefore, the above-mentioned solution for implementing Private Array based on memory operations severely restricts the execution efficiency of OpenCL kernel programs due to its high access latency and energy consumption, especially in computationally intensive kernels where access to private arrays is frequent, the performance loss is particularly obvious.

[0060] To address at least one of the aforementioned technical problems, in one embodiment, such as Figure 1 As shown, a compilation optimization method for private arrays is provided. This embodiment illustrates the method's application to a terminal. It is understood that this method can also be applied to a server, and to a system including both a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the following steps:

[0061] S102: Analyze the computational task to be processed to obtain the number of on-chip registers required for each private array and register mapping of each private array.

[0062] The on-chip register (CRF, Common Register File) is located on the processor. In this application, the private array is used as the processing unit on-chip register (CRF, Common Register File) mapping optimization, thereby avoiding high-latency memory access operations.

[0063] The analysis of the computational task to be processed and the collection of private arrays mainly involves static analysis of the kernel function of the computational task to be processed and the collection of private arrays in the kernel function.

[0064] Once the private arrays are determined, the number of on-chip registers required for register mapping of each private array is obtained; that is, the number of on-chip registers corresponding to each private array. In this application, the number of on-chip registers can be determined based on the attributes of each private array, such as the array size and the number of on-chip registers occupied by each element in the private array.

[0065] S104: Traverse each private array to obtain the target private array that can be used for register mapping.

[0066] Based on the valid private arrays obtained from the static analysis of the general computing kernel function, the compiler will further perform registerability evaluation and filtering to determine the set of private arrays that can be registered. For private arrays that are determined not to be registerable, they will be processed according to the conventional private array processing scheme, that is, mapped to memory load / store operations.

[0067] In some optional embodiments, traversing each private array to obtain a target private array that can be used for register mapping includes: obtaining the access mode and / or access frequency corresponding to each private array; and obtaining a target private array that can be used for register mapping based on the access mode and / or access frequency corresponding to each private array.

[0068] Optionally, the criteria for evaluating and filtering private arrays may include at least one of the private array's access pattern and access frequency. The suitability of a private array for register mapping is determined by including at least one of the private array's access pattern and access frequency; if so, it is selected as the target private array.

[0069] S106: Based on the number of on-chip registers required for register mapping of each private array, perform code conversion on each target private array to convert access to the target private array into access to on-chip registers in the processor.

[0070] Once the target private array is determined, register mapping can be performed on the target private array based on the number of on-chip registers required by the target private array. This involves code conversion of the code associated with the target private array to convert access to the target private array into access to on-chip registers in the processor.

[0071] The aforementioned compilation optimization method for private arrays analyzes the computational task to be processed, obtaining each private array and the number of on-chip registers required for register mapping of each private array; iterates through each private array to obtain the target private arrays that can be mapped to registers; based on the number of on-chip registers required for register mapping of each private array, code transformation is performed on each target private array to convert access to the target private arrays into access to on-chip registers in the processor, moving the private arrays from the memory subsystem to the on-chip registers of the processor. The access latency is reduced from tens or even hundreds of cycles to one or a few cycles, fundamentally solving the memory access bottleneck problem caused by private arrays.

[0072] In some optional embodiments, the computational task to be processed is analyzed to obtain each private array and the number of on-chip registers required for register mapping of each private array. This includes: performing static analysis on the input kernel function corresponding to the computational task to be processed to obtain each private array; extracting key attributes of each private array, including element type and array size; determining the number of registers occupied by each array element based on the element type; and obtaining the number of on-chip registers required for register mapping of the private array based on the array size and the number of registers occupied by each array element.

[0073] Specifically, the compiler performs static analysis on the input kernel function to identify valid private arrays within the kernel function.

[0074] Then extract the key attributes of each private array, including element type, array size, etc.

[0075] Finally, calculate the number of registers N required to perform register mapping on the private array, where N = array size × number of registers occupied by each array element, and the number of registers occupied by each array element is related to the element type of the array element.

[0076] In some optional embodiments, a target private array that can be registered is obtained based on the access mode and / or access frequency corresponding to each private array, including: selecting a private array whose index is a major constant or linear array and which is not a dynamically random access array as the target private array; and / or when the number of on-chip registers in the processor is less than the total number of on-chip registers required for register mapping of each private array, selecting a target private array based on the access frequency of each private array that needs to be registered, wherein the access frequency of the target private array is greater than the access frequency of the other private arrays.

[0077] This application allows for the analysis of access patterns of various private array indices. Target private arrays are typically selected from those whose indices are primarily constant or linear (linearly related to circular indices) and are easy to analyze. Register mapping is performed, and arrays with dynamic random access are avoided to ensure access efficiency.

[0078] This application can also analyze the read and write operations of kernel functions on each private array to determine whether they are on performance-critical paths, so as to obtain the access frequency of each private array and thus determine which private arrays are frequently accessed. When the registers of the target hardware architecture are insufficient to accommodate all valid private arrays, the frequently accessed private arrays are selected for register mapping first, that is, the target private array with a higher access frequency than other private arrays is selected to maximize the performance advantages brought by register mapping.

[0079] In some optional embodiments, the method further includes: obtaining the total number of on-chip registers in the processor and the number of registers required by other non-private arrays; and obtaining the number of on-chip registers in the processor that can be used for this optimization based on the total number of on-chip registers in the processor and the number of registers required by other non-private arrays.

[0080] In this application, the compiler pre-evaluates the total number of registers required by other non-private array variables in the kernel and compares it with the total capacity of the on-chip registers (CRF) of the target processing unit (such as the GPU computing core, i.e., the processor in this case) to determine the number K of available on-chip registers that can be used for private array registerization. This ensures that the register mapping optimization of the private array will not lead to additional register allocation overflow.

[0081] In the above embodiments, a set of private arrays for on-chip register mapping is determined to ensure that, given limited on-chip register resources, the most critical private array is selected for on-chip register mapping.

[0082] In some optional embodiments, code conversion is performed on each target private array based on the number of on-chip registers required for register mapping of each private array, so as to convert access to the target private array into access to on-chip registers in the processor, including: determining target on-chip registers for the target private array based on the number of on-chip registers required for register mapping of each private array; and converting access to the target private array into access to the target on-chip registers.

[0083] Once the set of private arrays to be registered is determined, the compiler will traverse the set of private arrays, perform code transformations, and perform private array registerization.

[0084] Optionally, converting access to the target private array into access to the target on-chip register includes at least one of the following: converting an index access to the target private array into an index access to the target on-chip register; converting a write operation to an element of the target private array into a MOV instruction with the target on-chip register as the destination operand, or merging and mapping it into an ALU instruction with the target on-chip register as the destination operand; and converting a read operation to an element of the target private array into a read operation with the target on-chip register as the source operand.

[0085] Among them, combined Figure 2 As shown, Figure 2 This is a schematic diagram illustrating a code transformation method for fixing n consecutive on-chip registers for a PA in one embodiment. In this embodiment, taking a private array PA to be registered as an example, assuming PA needs to use n on-chip registers for mapping, the code transformation needs to ensure that runtime read and write index operations on PA can correctly map to the indexes and operations on the target on-chip registers. N consecutive registers are fixed for PA: Rb, Rb+1, ..., Rb+n-1 (consecutive on-chip register segments), allocated to PA for on-chip register mapping during its lifetime.

[0086] For example, array index access to PA can be converted into index access to Rb, Rb+1, ..., Rb+n-1. The mapping scheme is related to the number of on-chip registers s occupied by a single array element of PA. For example, in the original kernel function, accessing PA by index: PA[index] will be mapped to: R[b+index*s].

[0087] Optionally, read and write operations on PA elements within the kernel function can be distinguished and mapped to different code transformation schemes:

[0088] A write operation to PA[index] will be mapped to a MOV instruction with the on-chip register R[b+index*s] as the destination operand, or it can be merged and mapped to a regular ALU instruction with R[b+index*s] as the destination operand. Here, we illustrate this with a standalone MOV instruction, for example: MOV R[b+index*s], vreg1, which means writing the source operand vreg1 to the on-chip register R[b+index*s]. If the source operand vreg1 comes from a regular ALU ADD instruction: vreg1 = ADD vreg10, vreg11, it can also be merged and mapped to R[b+index*s] = ADD vreg10, vreg11.

[0089] For reading operations on PA[index], no additional instructions are needed; R[b+index*s] can be directly used as the source operand.

[0090] In particular, if all accesses to PA are constants or linearly related to the circular index, they can be directly converted to accesses to Rb, Rb+1, ..., Rb+n-1. For example, PA[0] will be mapped to Rb, and PA[3] will be mapped to Rb+3.

[0091] In some optional embodiments, the method further includes: recording the lifetime of the target private array and determining the occupancy time of the target on-chip register based on the lifetime of the target private array; determining the target on-chip register for the target private array based on the number of on-chip registers required for register mapping of each private array, including: selecting available on-chip registers based on the occupancy time of each target on-chip register; and selecting the target on-chip register corresponding to the target private array from the available on-chip registers based on the number of on-chip registers required for register mapping of each private array.

[0092] In this application, after allocating on-chip registers for PA, the lifetime of the private array PA must also be recorded. During the on-chip register allocation phase, register allocation optimization will be performed based on this information.

[0093] In the subsequent on-chip register allocation phase, the compiler should consider the on-chip registers occupied by private arrays. While avoiding on-chip register allocation conflicts, the compiler should reuse the on-chip registers occupied by the private array register mappings outside the lifetime of the private array. Specifically: for each private array already mapped to an on-chip register, during its lifetime, avoid allocating contiguous on-chip registers mapped to that array to other variables; for each private array already mapped to an on-chip register, outside its lifetime, contiguous on-chip registers mapped to that array can be allocated to other variables.

[0094] For ease of understanding, taking the OpenCL kernel code above as an example, in traditional technology, both temp and temp2 are mapped to memory access operations of external memory, which will generate machine code with the following structure (for simplicity, the following text uses a mixed description of pseudocode, virtual registers, and physical registers):

[0095] For example, STORE vreg0, vreg1 means storing the data vreg0 into the address pointed to by vreg1;

[0096] For example, `vreg4 = LOAD vreg3` means loading data from address `vreg3`, denoted as `vreg4`.

[0097]

[0098] In this application, combined Figure 3 As shown, during the static analysis phase of the kernel function, it was found that there are two valid private arrays temp[8] and temp2

[16] in the current kernel function; assuming that a single array element corresponds to one on-chip register, the number of registers required for register mapping of temp[8] and temp2

[16] are 8CRF and 16CRF, respectively.

[0099] During the private array registerization screening stage, assuming that both private arrays temp[8] and temp2

[16] can be mapped to on-chip registers, they are both recorded in the private array set S1 to be registered (in this example, S2 = {}): S1 = { temp, temp2}

[0100] The code transformation stage of private array registerization: Traverse set S1, assuming that temp occupies a continuous register segment R0~R7 and temp2 occupies a continuous register segment R8~R23. Based on this, a register mapping scheme can be implemented, using the ALU MOV instruction to replace the original Memory Store instruction and the CRF-Index operation to replace the Memory Load instruction (comparison table-1), thereby improving code execution speed.

[0101] The optimized code is as follows:

[0102]

[0103] Furthermore, this application uses on-chip registers instead of off-chip or on-chip caches (such as the L1 / L2 cache of a GPU) for mapping private arrays to execute short-cycle arithmetic instructions, replacing conventional Memory Store / Reload instructions, which can bring the following performance advantages:

[0104] Reduce access latency: By moving the private array from the memory subsystem to on-chip registers, the access latency is reduced from tens or even hundreds of cycles to one or a few cycles, fundamentally solving the memory access bottleneck problem caused by the private array.

[0105] Significantly improve execution performance: For OpenCL kernels that rely heavily on private arrays for temporary data storage and computation (such as encryption algorithms, filters in image processing, temporary matrices in mathematical transformations, etc.), the execution speed of kernel programs can be improved by orders of magnitude.

[0106] Reduced bandwidth contention: This reduces the access pressure on the memory subsystem, freeing up bandwidth for other memory access operations, thus indirectly improving overall system performance.

[0107] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps. It is understood that the steps in different embodiments can be freely combined as needed, and all non-contradictory solutions formed by such combinations are within the scope of protection of this application.

[0108] Based on the same inventive concept, this application also provides a private array compilation optimization apparatus for implementing the aforementioned private array compilation optimization method. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in one or more private array compilation optimization apparatus embodiments provided below can be found in the limitations of the private array compilation optimization method described above, and will not be repeated here.

[0109] In one exemplary embodiment, such as Figure 4 As shown, a compilation optimization device for private arrays is provided, including: an input module 401, a target private array determination module 402, and a code conversion module 403, wherein:

[0110] Input module 401 is used to analyze the computing task to be processed and obtain the number of private arrays and the number of on-chip registers required for register mapping of each private array;

[0111] The target private array determination module 402 is used to traverse each private array to obtain the target private array that can be used for register mapping.

[0112] Compiler module 403 is used to perform code conversion on each target private array based on the number of on-chip registers required for register mapping of each private array, so as to convert access to the target private array into access to on-chip registers in the processor.

[0113] In some optional embodiments, the input module 401 is specifically used to perform static analysis on the input kernel function corresponding to the computational task to be processed, to obtain each private array; extract the key attributes of each private array, including element type and array size, and determine the number of registers occupied by each array element based on the element type; and obtain the number of on-chip registers required for register mapping of the private array based on the array size and the number of registers occupied by each array element.

[0114] In some optional embodiments, the target private array determination module 402 is specifically used to obtain the access mode and / or access frequency corresponding to each private array; and based on the access mode and / or access frequency corresponding to each private array, to obtain the target private array that can be used for register mapping.

[0115] In some optional embodiments, the target private array determination module 402 is specifically used to select a private array whose index is a major constant or linear and whose private array is not a dynamically randomly accessed array as a target private array; and / or, when the number of on-chip registers available in the processor for this optimization is less than the total number of on-chip registers required for register mapping of each private array, select a target private array based on the access frequency of each private array that needs to be registered, wherein the access frequency of the target private array is greater than the access frequency of other private arrays.

[0116] In some optional embodiments, the target private array determination module 402 is specifically used to obtain the total number of on-chip registers in the processor and the number of registers required by other non-private arrays; based on the total number of on-chip registers in the processor and the number of registers required by other non-private arrays, the number of on-chip registers in the processor that can be used for this optimization is obtained.

[0117] In some optional embodiments, the code conversion module 403 is specifically used to determine the target on-chip registers for the target private array based on the number of on-chip registers required for register mapping of each private array; and to convert access to the target private array into access to the target on-chip registers.

[0118] In some optional embodiments, the code conversion module 403 specifically converts access to the target private array into access to the target on-chip register according to at least one of the following: converting index access to the target private array into index access to the target on-chip register; converting write operations to elements of the target private array into MOV instructions with the target on-chip register as the destination operand, or merging and mapping them into ALU instructions with the target on-chip register as the destination operand; and converting read operations to elements of the target private array into read operations with the target on-chip register as the source operand.

[0119] In some optional embodiments, the above apparatus further includes: a recording module for recording the lifecycle of the target private array and determining the occupancy time of the target on-chip register based on the lifecycle of the target private array; the above code conversion module 403 is specifically used to select available on-chip registers based on the occupancy time of each target on-chip register; and select the target on-chip register corresponding to the target private array from the available on-chip registers based on the number of on-chip registers required for register mapping of each private array.

[0120] The modules in the aforementioned compiler optimization device for private arrays can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0121] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 5 As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores the data involved in the aforementioned methods. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When the computer program is executed by the processor, it implements a compilation optimization method for a private array.

[0122] Those skilled in the art will understand that Figure 5The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0123] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

[0124] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.

[0125] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0126] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0127] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0128] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0129] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A compilation optimization method for private arrays, characterized in that, The method includes: The computational task to be processed is analyzed to obtain the number of private arrays and the number of on-chip registers required for register mapping of each private array; Traverse each of the aforementioned private arrays to obtain the target private array that can be used for register mapping; Based on the number of on-chip registers required for register mapping of each of the aforementioned private arrays, code conversion is performed on each of the aforementioned target private arrays to convert access to the target private arrays into access to on-chip registers in the processor.

2. The method of claim 1, wherein, The analysis of the computational task to be processed yields the number of private arrays and the number of on-chip registers required for register mapping of each private array, including: Static analysis is performed on the input kernel function corresponding to the computational task to be processed to obtain each private array; Extract the key attributes of each of the private arrays, including element type and array size; The number of registers occupied by each array element is determined based on the element type; Based on the array size and the number of registers occupied by each array element, the number of on-chip registers required for register mapping of the private array is obtained.

3. The method according to claim 1, characterized in that, The process of traversing each of the private arrays to obtain a target private array that can be used for register mapping includes: Get the access pattern and / or access frequency for each private array; Based on the access patterns and / or access frequencies of each private array, the target private arrays that can be used for register mapping are obtained.

4. The method according to claim 3, characterized in that, The process of obtaining the target private array that can be used for register mapping based on the access mode and / or access frequency corresponding to each private array includes: Select a private array whose indices are primarily constant or linear, and which is not a dynamically accessed array, as the target private array; and / or If the number of on-chip registers available for this optimization in the processor is less than the total number of on-chip registers required for register mapping of each of the private arrays, a target private array is selected based on the access frequency of each of the private arrays that need to be mapped, wherein the access frequency of the target private array is greater than the access frequency of the other private arrays.

5. The method according to claim 4, characterized in that, The method further includes: Obtain the total number of on-chip registers in the processor and the number of registers required for other non-private arrays; Based on the total number of on-chip registers in the processor and the number of registers required by other non-private arrays, the number of on-chip registers in the processor that can be used for this optimization is obtained.

6. The method according to any one of claims 1 to 5, characterized in that, The step of performing code conversion on each target private array based on the number of on-chip registers required for register mapping of each private array, to convert access to the target private array into access to on-chip registers in the processor, includes: Based on the number of on-chip registers required for register mapping of each of the aforementioned private arrays, the target on-chip registers are determined for the target private array; Access to the target private array is converted into access to the target on-chip register.

7. The method according to claim 6, characterized in that, The step of converting access to the target private array into access to the target on-chip register includes at least one of the following: The index access to the target private array is converted into an index access to the target on-chip register; Write operations on elements of the target private array are converted into MOV instructions with the target on-chip register as the destination operand, or merged and mapped into ALU instructions with the target on-chip register as the destination operand. The read operation on the elements of the target private array is converted into a read operation with the target on-chip register as the source operand.

8. The method according to claim 6, characterized in that, The method further includes: Record the lifecycle of the target private array, and determine the occupancy time of the target on-chip register based on the lifecycle of the target private array; The step of determining the number of on-chip registers required for register mapping based on each of the private arrays, and determining the target on-chip registers for the target private array, includes: Based on the occupancy time of each target on-chip register, select the available on-chip register; Based on the number of on-chip registers required for register mapping of each of the private arrays, the target on-chip register corresponding to the target private array is selected from the available on-chip registers.

9. A compiler optimization device for private arrays, characterized in that, The device includes: The input module is used to analyze the computational task to be processed, and to obtain the number of private arrays and the number of on-chip registers required for register mapping of each private array; The target private array determination module is used to traverse each of the private arrays to obtain the target private array that can be used for register mapping. The code conversion module is used to perform code conversion on each of the target private arrays based on the number of on-chip registers required for register mapping of each of the private arrays, so as to convert access to the target private arrays into access to on-chip registers in the processor.

10. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 8.

11. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.

12. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.