Code compiling method, electronic device and storage medium
By employing a runtime dynamic address space allocation mechanism on the GPU architecture, the lack of support for thread-local storage on the GPU is resolved, achieving efficient utilization and performance optimization of thread-local memory space.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI BIREN TECH CO LTD
- Filing Date
- 2025-04-27
- Publication Date
- 2026-06-19
Smart Images

Figure CN120215957B_ABST
Abstract
Description
Technical Field
[0001] Embodiments of this disclosure relate to a code compilation method, an electronic device, and a storage medium. Background Technology
[0002] The Low-Level Virtual Machine (LLVM) is an open-source compiler framework designed as a modular, reusable collection of compiler and toolchain technologies. LLVM supports a ThreadLocal Storage (TLS) model, which allows the definition of thread-local storage variables, ensuring that each thread has its own independent copy of the variable, thereby avoiding data race issues between threads. Summary of the Invention
[0003] This disclosure provides at least one embodiment of a code compilation method, wherein the code includes at least one thread-local storage variable, and the code compilation method includes: obtaining a record of the thread-local storage variable used by each first function in the code, and passing the record to a driver, so that the driver allocates each thread-local storage variable to thread-local memory according to the record at runtime; generating machine code corresponding to the code according to the intermediate representation corresponding to the code, wherein the machine code includes a load instruction, and the load instruction, when executed, is used to obtain the thread-local memory address corresponding to each thread-local storage variable based on the allocation result of the driver.
[0004] In at least one embodiment of the code compilation method provided in this disclosure, the step of generating machine code corresponding to the code based on the intermediate representation corresponding to the code includes: converting the instruction in the intermediate representation corresponding to the code used to obtain the address of the thread local storage variable into the load instruction.
[0005] The code compilation method provided in at least one embodiment of this disclosure further includes: inserting instructions to initialize the thread-local storage variables used by each first function into the intermediate representation corresponding to the code.
[0006] In the code compilation method provided in at least one embodiment of this disclosure, obtaining the record of thread-local storage variables used by each first function in the code includes: for each thread-local storage variable, determining the first function that uses the thread-local storage variable to construct a thread-local storage variable-first function mapping relationship; performing reverse mapping on the mapping relationship to obtain the record of thread-local storage variables used by each first function in the code.
[0007] In the code compilation method provided in at least one embodiment of this disclosure, the first function directly uses the thread-local storage variable, or uses the thread-local storage variable by calling a second function.
[0008] In the code compilation method provided in at least one embodiment of this disclosure, the first function is a function called by the host and executed by the device, and the second function is a function called by the device and executed by the device.
[0009] The code compilation method provided in at least one embodiment of this disclosure further includes: for unused thread-local storage variables, removing instructions for obtaining the address of the unused thread-local storage variable from the intermediate representation corresponding to the code.
[0010] The code compilation method provided in at least one embodiment of this disclosure further includes: removing duplicate instructions for obtaining the address of thread-local storage variables from the intermediate representation corresponding to the code, so that the address of each thread-local storage variable in each first function is obtained only once.
[0011] In the code compilation method provided in at least one embodiment of this disclosure, the loading instruction is further configured to: for each first function, read the address table of thread-local storage variables used by the first function from the register corresponding to the first function; obtain the thread-local memory address corresponding to each thread-local storage variable used by the first function based on the address table, wherein the address table is created by the driver in the register corresponding to the first function at runtime according to the address allocation result.
[0012] In at least one embodiment of the code compilation method provided in this disclosure, the step of obtaining the thread-local memory address corresponding to each thread-local storage variable used by the first function based on the address table includes: calculating the offset of the thread-local memory address corresponding to each thread-local storage variable used by the first function in the address table according to the order of each thread-local storage variable in the record corresponding to the first function and the address occupancy size of each thread-local storage variable used by the first function; and determining the thread-local memory address corresponding to each thread-local storage variable used by the first function based on the offset.
[0013] In the code compilation method provided in at least one embodiment of this disclosure, the register includes a constant scalar register.
[0014] At least one embodiment of this disclosure provides a code compilation apparatus, wherein the code includes at least one thread-local storage variable, and the code compilation apparatus includes: an acquisition module configured to acquire a record of the thread-local storage variable used by each first function in the code, and pass the record to a driver so that the driver allocates each thread-local storage variable to thread-local memory according to the record at runtime; and a generation module configured to generate machine code corresponding to the code according to an intermediate representation corresponding to the code, wherein the machine code includes a load instruction, and the load instruction, when executed, is used to acquire the thread-local memory address corresponding to each thread-local storage variable based on the allocation result of the driver.
[0015] At least one embodiment of this disclosure provides an electronic device, the electronic device comprising: at least one processor; at least one memory including one or more computer program modules; wherein the one or more computer program modules are stored in the at least one memory and configured to be executed by the at least one processor, the one or more computer program modules being used to implement the code compilation method provided in at least one embodiment of this disclosure.
[0016] At least one embodiment of this disclosure provides a non-transitory computer-readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by at least one processor, perform the code compilation method provided in at least one embodiment of this disclosure.
[0017] The code compilation method, code compilation apparatus, electronic device, and non-transitory computer-readable storage medium provided in at least one embodiment of this disclosure adopt a runtime dynamic address space allocation mechanism. The driver allocates the corresponding amount of thread-local memory space to the kernel function based on the number of thread-local storage variables it uses, thus avoiding waste of thread-local memory space and improving the utilization rate of thread-local memory space. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings of the embodiments will be briefly described below. Obviously, the drawings described below only relate to some embodiments of this disclosure and are not intended to limit this disclosure.
[0019] Figure 1 A flowchart illustrating a code compilation method provided in at least one embodiment of this disclosure;
[0020] Figure 2 A flowchart illustrating a code compilation method provided in at least one embodiment of this disclosure;
[0021] Figure 3 A flowchart illustrating a code compilation method provided in at least one embodiment of this disclosure;
[0022] Figure 4 This is an exemplary schematic diagram of a compile-time static address space allocation scheme;
[0023] Figure 5 An exemplary schematic diagram illustrating a runtime dynamic address space allocation scheme provided for at least one embodiment of this disclosure;
[0024] Figure 6 A schematic block diagram of a code compilation apparatus provided for at least one embodiment of this disclosure;
[0025] Figure 7 A schematic block diagram of an electronic device provided for at least one embodiment of this disclosure;
[0026] Figure 8 A schematic block diagram of another electronic device provided for at least one embodiment of this disclosure; and
[0027] Figure 9 This is a schematic block diagram of a non-transitory computer-readable storage medium provided for at least one embodiment of the present disclosure. Detailed Implementation
[0028] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this disclosure. All other embodiments obtained by those skilled in the art based on the described embodiments of this disclosure without creative effort are within the scope of protection of this disclosure.
[0029] This disclosure uses flowcharts to illustrate the operations performed by the system according to embodiments of this application. It should be understood that the preceding or following operations are not necessarily performed in exact order. Instead, various steps can be processed in reverse order or simultaneously, as needed. Furthermore, other operations can be added to these processes, or one or more steps can be removed from them.
[0030] Unless otherwise defined, the technical or scientific terms used in this disclosure shall have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms “first,” “second,” and similar terms used in this disclosure do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as “comprising” or “including” mean that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, without excluding other elements or objects. Terms such as “connected” or “linked” are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as “upper,” “lower,” “left,” and “right” are used only to indicate relative positional relationships, and these relative positional relationships may change accordingly when the absolute position of the described objects changes.
[0031] The present disclosure will now be described through several specific embodiments. To keep the following description of the embodiments of the present disclosure clear and concise, detailed descriptions of known functions and known components may be omitted. When any component of an embodiment of the present disclosure appears in more than one drawing, the component is represented by the same or similar reference numerals in each drawing.
[0032] LLVM is an open-source compiler framework designed as a modular, reusable collection of compiler and toolchain technologies. LLVM consists of a frontend, middleware, and backend.
[0033] The LLVM frontend is responsible for converting source code from different programming languages or programming models into a common intermediate representation (IR). For example, Clang is the compiler frontend that LLVM natively supports for C, C++, and Objective-C. When a user declares a variable using `thread_local` (C++11) or `__thread` (GCC extension), Clang can mark that variable as a thread-local storage (TLS) variable, for example, by using the `thread_local` attribute and a specific linking type (such as `dso_local`) in the generated intermediate representation.
[0034] The middleware of LLVM is responsible for optimizing the intermediate representation generated by the frontend to improve the performance of the final generated code. For example, it can simplify thread-local memory access patterns by merging multiple accesses into a single address calculation. It can also implicitly select the thread-local memory model based on the scope of a variable, or explicitly specify the thread-local memory model in the code.
[0035] The LLVM backend is responsible for converting the optimized intermediate representation into machine code or assembly code based on the characteristics of the target hardware platform. Specific tasks of the backend can include instruction selection, register allocation, instruction scheduling, and code generation. For example, instruction selection refers to mapping the intermediate representation to the instruction set of the target hardware platform; register allocation refers to mapping the virtual registers in the intermediate representation to the actual registers of the target hardware; instruction scheduling refers to rearranging the instruction order to maximize the utilization of hardware resources; and code generation refers to generating machine code or assembly code corresponding to the target hardware platform.
[0036] LLVM supports a thread-local memory model. Using the `thread_local` keyword, thread-local variables can be defined, ensuring each thread has its own independent copy of the variable, thus avoiding data races between threads. In other words, thread-local variables are not shared between threads, meaning that even if multiple threads access the same variable name, they are actually operating on their own independent memory regions.
[0037] However, the inventors of this disclosure note that the implementation of thread-local memory depends on the support of the target platform. Currently, in traditional Central Processing Unit (CPU) architectures (e.g., x86 and ARM), the operating system and Application Binary Interface (ABI) provide support for thread-local memory, for example, through specific sections in executable and linkable format (ELF) files (e.g., the .tdata section for storing uninitialized thread-local memory variables and the .tbss section for storing initialized thread-local memory variables). However, there is no similar mechanism supporting thread-local memory implementation in the backends of Graphics Processing Units (GPUs), General-Purpose Graphics Processing Units (GPGPUs), or other parallel computing architectures (e.g., CUDA and OpenCL), which poses a challenge to applications requiring massive parallel processing, such as high-performance computing and machine learning.
[0038] GPU parallel thread models (such as CUDA and OpenCL) typically require each thread to access its private data independently. However, traditional thread-local memory implementations (such as the CPU's `__thread` keyword) cannot be directly mapped to the GPU architecture, making it difficult to meet the needs of creating thread-local data and supporting related data management and optimization. Furthermore, the lack of hardware-level thread-local memory support in GPUs forces developers to manually manage thread-local data (e.g., calculating offsets using thread indices), increasing development complexity and increasing error rates. In current GPU programming practices, developers need to explicitly pass thread indices (e.g., `threadIdx.x` in CUDA) and manually allocate global memory to simulate the storage of thread-private data, leading to code redundancy and performance degradation. In addition, existing LLVM compilers lack sufficient support for GPU thread-local memory and cannot automatically generate efficient memory allocation and access code.
[0039] To address the aforementioned issues, the inventors of this disclosure have proposed a code compilation method that provides support for a thread-local memory model under parallel computing architectures, solving the problem that traditional thread-local memory implementations cannot be directly applied to GPU architectures. This method provides a compile-time static address space allocation scheme. Through static allocation and address fixation, it eliminates the need to load the thread-local memory addresses corresponding to each thread-local memory variable from registers, effectively reducing register and move instruction overhead. However, the inventors of this disclosure have noted that in this method, if a thread-local memory variable is shared by multiple kernel functions, even kernel functions that do not use the thread-local memory variable will be forcibly allocated the thread-local memory space corresponding to that variable, leading to a waste of thread-local memory space in some scenarios.
[0040] At least one embodiment of this disclosure provides a code compilation method for code including at least one thread-local storage variable. The code compilation method includes: obtaining a record of the thread-local storage variable used by each first function in the code, and passing the record to a driver so that the driver allocates each thread-local storage variable to thread-local memory according to the record at runtime; generating machine code corresponding to the code according to the intermediate representation corresponding to the code, wherein the machine code includes a load instruction, which, when executed, is used to obtain the thread-local memory address corresponding to each thread-local storage variable based on the allocation result of the driver.
[0041] The code compilation method provided in at least one embodiment of this disclosure can be executed by a compiler. For example, the compiler can be located on the host side, and the code compilation method can be executed on the host side.
[0042] In the code compilation method provided in at least one embodiment of this disclosure, a mechanism of dynamic address space allocation at runtime is adopted. The driver will allocate the corresponding amount of thread-local memory space for the kernel function based on the number of thread-local storage variables it uses, thus avoiding waste of thread-local memory space and improving the utilization rate of thread-local memory space.
[0043] Figure 1 The flowchart illustrates a code compilation method provided for at least one embodiment of this disclosure, used to compile code including at least one thread-local storage variable. For example, as... Figure 1 As shown, the code compilation method provided in at least one embodiment of this disclosure includes the following steps S101~S102.
[0044] Step S101: Obtain a record of the thread-local storage variables used by each first function in the code, and pass the record to the driver so that the driver allocates each thread-local storage variable to thread-local memory according to the record at runtime.
[0045] For example, the phrase "at least one thread-local storage variable" mentioned above may include thread-local storage variables used by the first function, or it may include thread-local storage variables that are not used by any first function. The phrase "each thread-local storage variable" mentioned later refers to used thread-local storage variables; unused thread-local storage variables are not allocated to thread-local memory.
[0046] Thread Local Memory (TLM) refers to a memory region private to each thread, primarily used to store thread-specific data that cannot be directly accessed by other threads. For example, TLM can be the thread-local memory within a GPU; by storing thread-local variables in TLM, the burden on GPU registers can be reduced. Alternatively, TLM can also be the thread-local memory in other parallel computing architectures; this disclosure does not limit the specific implementation.
[0047] It should be noted that thread-local memory and thread-local storage, as described above, are two different concepts. Thread-local memory refers to a memory region, while thread-local storage refers to a way of storing variables.
[0048] In the code compilation method provided in at least one embodiment of this disclosure, the code may include multiple first functions and multiple second functions. The first functions are functions called by the host and executed by the device, and the second functions are functions called by the device and executed by the device.
[0049] For example, in at least one embodiment of this disclosure, the host side may include a central processing unit (CPU), and the device side may include a graphics processing unit (GPU), a general-purpose graphics processing unit (GPGPU), etc.
[0050] For example, in at least one embodiment of this disclosure, the first function can be a kernel function, and the second function can be a device function. A kernel function is a function called by the host and executed by the device, and a device function is a function called and executed by the device. For example, a kernel function can be called by a host function running on the CPU (e.g., the host function), while a device function can be called by a kernel function or other device function executed on the GPU. The code typically includes multiple kernel functions and device functions; each kernel function may call multiple device functions, and each device function may also be called by multiple kernel functions.
[0051] "Thread-local storage variables used by the first function" refers to thread-local storage variables directly used by the first function, as well as thread-local storage variables used by the first function through calling the second function.
[0052] For example, when generating the executable file corresponding to the first function, the compiler typically embeds metadata. Metadata can be considered as additional information that can guide compiler optimization, debugging, code analysis, or runtime behavior. Metadata can provide the runtime environment with necessary information to guide the driver to perform appropriate processing, such as allocating thread-local storage variables to thread-local memory.
[0053] The thread-local storage variables used by each first function in the code can be recorded in the metadata and passed to the driver, so that the driver can allocate each thread-local storage variable to thread-local memory at runtime according to the records in the metadata.
[0054] A driver, such as a programmable device, is software that runs on the host (e.g., the CPU) to interact with the hardware and manage its physical resources.
[0055] Step S102: Generate machine code corresponding to the code based on the intermediate representation of the code. The machine code includes load instructions. When the load instructions are executed, they are used to obtain the thread-local memory address corresponding to each thread-local storage variable based on the driver allocation result.
[0056] For example, in step S102, the intermediate representation corresponding to the code can be converted into machine code suitable for the target hardware platform through steps such as instruction selection, register allocation, instruction scheduling, and code generation described above. Here, the intermediate representation corresponding to the code can be the initial intermediate representation automatically generated by the LLVM Clang frontend, or it can be an intermediate representation obtained after optimizing the initial intermediate representation.
[0057] For example, during the instruction selection phase, the instructions in the intermediate representation of the code used to obtain the addresses of thread-local memory variables are converted into load instructions, which are then embedded into the final generated machine code. When executed, these load instructions are used to obtain the thread-local memory address corresponding to each thread-local memory variable based on the driver's allocation result. When the machine code is executed, the load instructions contained within it are also executed, thus achieving the goal of obtaining the thread-local memory address corresponding to each thread-local memory variable based on the driver's allocation result.
[0058] An example of an instruction used in an intermediate representation to obtain the address of a thread-local storage variable is as follows:
[0059] @llvm.threadlocal.address.p0(ptr addrspacecast (ptr addrspace(1) @ato ptr)
[0060] In the above intermediate representation, the llvm.threadlocal.address.p0 instruction is used to obtain the address of the thread-local storage variable a.
[0061] In the code compilation method provided in at least one embodiment of this disclosure, during compilation, the compiler obtains a record of the thread-local storage variables used by each first function in the code and passes the record to the driver, enabling the driver to allocate each thread-local storage variable to thread-local memory at runtime based on the record. Furthermore, the machine code generated by the compiler includes a load instruction, which, when executed, can obtain the thread-local memory address corresponding to each thread-local storage variable based on the driver's allocation result. Therefore, the record obtained in step S101 is the basis for the driver to perform address allocation at runtime, while the driver's address allocation result is the basis for the load instruction generated in step S102 to obtain the address.
[0062] In the code compilation method provided in at least one embodiment of this disclosure, through the runtime dynamic allocation mechanism, the driver will allocate the corresponding amount of thread-local memory space to the first function based on the number of thread-local storage variables it uses, thus avoiding waste of thread-local memory space.
[0063] At runtime, the driver allocates each thread-local storage variable to thread-local memory based on the records of the thread-local storage variables used by each first function. Then, based on the address allocation results, it creates an address table for each first function, which records the thread-local memory address corresponding to each thread-local storage variable in that first function. For example, the order in which the thread-local memory addresses corresponding to each thread-local storage variable are recorded in the address table can be determined by the order of the thread-local storage variables recorded in the metadata.
[0064] It should be noted that the size of the thread-local memory address corresponding to the thread-local storage variable and the size of the thread-local storage variable are two different concepts.
[0065] For example, the size of the thread-local memory address corresponding to a thread-local storage variable (hereinafter referred to as address size) refers to the size of the space occupied by the address in the address table, not the size of the thread-local storage variable itself. For example, different hardware unit addresses can correspond to different sizes, and the size of a thread-local memory address can be, for example, 4 bytes. For example, the offset of the thread-local memory address corresponding to the thread-local storage variable in the address table can be determined based on the address size.
[0066] For example, the size of a thread-local storage variable can be understood as the amount of storage space it requires, which is related to the variable's type and is usually measured in bytes. For instance, a thread-local storage variable of type `short` has a size of 2 bytes, while a thread-local storage variable of type `int` has a size of 4 bytes. It should be noted that the size of thread-local storage variables of the same type may differ across different systems or compilers. For example, the size of a thread-local storage variable can be determined by the `size` field in the metadata, allowing the driver to allocate each thread-local storage variable to thread-local memory at runtime based on this information.
[0067] For example, a corresponding register can be set up for each first function, and the driver can create an address table in the register. One example of a register is a constant scalar register (CSR).
[0068] Correspondingly, in the code compilation method provided in at least one embodiment of this disclosure, the loading instruction is further used to perform the following steps S110 to S120.
[0069] Step S110: For each first function, read the address table of the thread-local storage variables used by the first function from the register corresponding to the first function.
[0070] For example, in step S110, the address table is created by the driver at runtime in the register corresponding to the first function based on the address allocation result.
[0071] Step S120: Obtain the thread-local memory address corresponding to each thread-local storage variable used by the first function based on the address table.
[0072] For example, in step S120, the thread-local memory address corresponding to each thread-local storage variable used by the first function can be read from the address table.
[0073] In the code compilation method provided in at least one embodiment of this disclosure, an example of step S120 may include steps S121 to S122.
[0074] Step S121: Based on the order of each thread-local storage variable in the record corresponding to the first function and the address size of each thread-local storage variable used by the first function, calculate the offset of the thread-local memory address corresponding to each thread-local storage variable used by the first function in the address table.
[0075] Since the driver determines the order in which the thread-local memory addresses corresponding to each thread-local storage variable are recorded in the address table according to the order of the thread-local storage variables recorded in the metadata when creating the address table, step S121 also needs to find the storage location of the thread-local memory address corresponding to each thread-local storage variable in the address table according to this order. For example, for each thread-local storage variable, the offset relative to the starting position of the address table can be calculated sequentially according to its recorded order in the metadata and the size of its corresponding thread-local memory address.
[0076] In some examples, the metadata records information about thread-local storage variables a, b, and c in sequence. Assuming that the address of each thread-local storage variable occupies 4 bytes, then the offset of the thread-local memory address corresponding to thread-local storage variable a in the address table is 0, the offset of the thread-local memory address corresponding to thread-local storage variable b in the address table is 4, and the offset of the thread-local memory address corresponding to thread-local storage variable c in the address table is 8.
[0077] Step S122: Determine the thread-local memory address corresponding to each thread-local storage variable used by the first function based on the offset.
[0078] For example, in step S122, for each thread-local storage variable, the thread-local memory address corresponding to the thread-local storage variable can be found in the address table based on the calculated offset.
[0079] In the example above, the thread-local memory address corresponding to thread-local storage variable 'a' can be found at the beginning of the address table, the thread-local memory address corresponding to thread-local storage variable 'b' can be found at the beginning of the address table plus 4, and the thread-local memory address corresponding to thread-local storage variable 'c' can be found at the beginning of the address table plus 8.
[0080] Figure 2 A flowchart illustrating a code compilation method provided for at least one embodiment of this disclosure.
[0081] like Figure 2 As shown, in the code compilation method provided in at least one embodiment of this disclosure, an example of "obtaining the record of thread-local storage variables used by each first function in the code" in step S101 may include the following steps S201 to S202.
[0082] Step S201: For each thread-local storage variable, determine the first function that uses the thread-local storage variable to construct the thread-local storage variable-first function mapping relationship.
[0083] In step S201, the propagation path of variables within functions can be traced using a call graph (also known as a call relationship graph), thereby determining which functions use the thread-local storage variable. A call graph is a directed graph that represents the call relationships between functions in a program. The compiler can generate a call graph, for example, through static analysis. For instance, a call graph builder integrated into the compiler can parse the call relationships between functions and generate a call graph. For example, the thread-local storage variable-first function mapping records which first functions use the thread-local storage variable. The thread-local storage variable-first function mapping can be represented as a mapping graph.
[0084] It should be noted that, since some thread-local storage variables in the code may only be defined but not used by any first function, not every thread-local storage variable can find a corresponding first function in the thread-local storage variable-first function mapping relationship.
[0085] In step S201, "the first function that uses the thread-local storage variable" refers to the first function that uses the thread-local storage variable directly or indirectly. For example, "direct use" means that the first function directly uses the thread-local storage variable, and "indirect use" means that the first function uses the thread-local storage variable by calling a second function.
[0086] For example, one example of step S201 could be: for each thread-local storage variable, determine the kernel function that directly or indirectly uses the thread-local storage variable to construct a thread-local storage variable-first function mapping relationship.
[0087] Step S202: Perform reverse mapping on the mapping relationship to obtain the record of the thread-local storage variable used by each first function in the code.
[0088] For example, the mapping relationship obtained in step S201 is the mapping relationship between thread-local storage variables and first functions, that is, which first functions use each thread-local storage variable. In step S202, by performing the reverse mapping, we can obtain the reverse mapping relationship between first functions and thread-local storage variables, that is, which thread-local storage variables each first function uses.
[0089] In step S202, the reverse mapping relationship obtained through reverse mapping can be recorded in the metadata.
[0090] The code compilation method provided in at least one embodiment of this disclosure may further include the following step S103. Step S103 may be performed before step S101.
[0091] Step S103: For unused thread-local storage variables, remove the instruction used to obtain the address of the unused thread-local storage variable from the intermediate representation corresponding to the code.
[0092] For example, the initial intermediate representation automatically generated by the Clang frontend of LLVM can be analyzed, that is, the usage of thread-local storage variables for each first function can be analyzed. For thread-local storage variables that are not used, the instructions used to obtain the address of the thread-local storage variable are removed from the intermediate representation corresponding to the first function.
[0093] The code compilation method provided in at least one embodiment of this disclosure may further include the following step S104. Step S104 may be performed before step S101.
[0094] Step S104: Remove duplicate instructions for obtaining the address of thread-local storage variables from the intermediate representation corresponding to the code, so that the address is obtained only once for each thread-local storage variable in each first function.
[0095] For example, the instructions used to obtain the address of thread-local storage variables in the intermediate representation of the code (e.g., llvm.threadlocal.address.p0) can be optimized by removing duplicate instructions, so that the address of each thread-local storage variable in each first function (e.g., kernel function) is obtained only once.
[0096] For example, if there are multiple identical instructions in the intermediate representation corresponding to a certain first function for obtaining the address of a thread-local storage variable, then the duplicate instructions need to be removed, and only the first instruction needs to be kept.
[0097] A special example is as follows: Suppose c is an array type variable. For kernel functions that use thread-local storage variables c[1] and c[2], it is necessary to remove the instructions in the intermediate representation used to obtain the addresses of thread-local storage variables c[1] and c[2], and only retain the instructions used to obtain the base address of thread-local storage variable c. The addresses of c[1] and c[2] can be calculated by adding the corresponding offset to the base address of c.
[0098] For example, the code compilation method provided in at least one embodiment of this disclosure may include steps S103, S101, and S102 executed sequentially. Alternatively, the code compilation method provided in at least one embodiment of this disclosure may include steps S104, S101, and S102 executed sequentially.
[0099] In other examples, the code compilation method provided in at least one embodiment of this disclosure may include steps S103, S104, S101, and S102 executed sequentially. Alternatively, the code compilation method provided in at least one embodiment of this disclosure may include steps S104, S103, S101, and S102 executed sequentially.
[0100] By using the above methods, repeated address fetching operations can be avoided, significantly improving operating efficiency and reducing the overhead of redundant instructions.
[0101] The code compilation method provided in at least one embodiment of this disclosure may further include the following step S105. Step S105 may be performed before steps S103 and S104.
[0102] Step S105: Insert instructions to initialize the thread-local storage variables used by each first function into the intermediate representation corresponding to the code.
[0103] For example, in step S105, the code may include multiple first functions and multiple second functions. The thread-local storage variables directly or indirectly used by each first function can be determined based on the call graph. For each first function, instructions to initialize the thread-local storage variables it directly or indirectly uses can be inserted at the beginning of its corresponding intermediate representation.
[0104] For example, taking a thread-local storage variable 'a' as an example, here is an example of an instruction to initialize the thread-local storage variable 'a':
[0105] %0 = call ptr @llvm.threadlocal.address.p0(ptr addrspacecast (ptraddrspace(1) @a to ptr))
[0106] store i32 1, ptr %0, align 4
[0107] The above intermediate meaning is that the address of the thread-local storage variable 'a' is first obtained using the llvm.threadlocal.address.p0 instruction, and then the initial value of variable 'a' (1 in this example) is stored in that address using the store instruction. It should be noted that the above initialization instruction is only an example, and different initialization instructions can be inserted according to actual needs.
[0108] It should be noted that the above initialization instructions will also be optimized in steps S103 and S104, that is, the duplicate llvm.threadlocal.address.p0 instructions will be removed.
[0109] In CPU architecture, thread-local storage variables are typically initialized by the runtime environment when the program starts. Unlike the initialization methods described above, the code compilation method provided in at least one embodiment of this disclosure inserts instructions to initialize thread-local storage variables during compilation, rather than initializing them at runtime. This effectively avoids a runtime refresh of the entire thread-local memory, thus optimizing performance. Furthermore, this approach increases opportunities for compilation optimization, reduces the coupling between compilation and runtime, and decreases the workload for both compilation and runtime.
[0110] Figure 3 The flowchart illustrates a code compilation method provided for at least one embodiment of this disclosure, used to compile code including at least one thread-local storage variable. For example, as... Figure 3 As shown, the code compilation method provided in at least one embodiment of this disclosure includes the following steps S301 to S305.
[0111] Step S301: Insert instructions to initialize the thread-local storage variables used by each first function into the intermediate representation corresponding to the code. For a description of step S301, please refer to the description of step S105 in the above embodiments; it will not be repeated here.
[0112] Step S302: For unused thread-local storage variables, remove the instructions used to obtain the addresses of the unused thread-local storage variables from the intermediate representation corresponding to the code. A description of step S302 can be found in the description of step S103 in the above embodiments, and will not be repeated here.
[0113] Step S303: Remove duplicate instructions for obtaining the address of thread-local storage variables from the intermediate representation corresponding to the code, so that the address of each thread-local storage variable in each first function is obtained only once. For a description of step S303, please refer to the description of step S104 in the above embodiments, which will not be repeated here.
[0114] Step S304: Obtain a record of the thread-local storage variables used by each first function in the code, and pass the record to the driver so that the driver allocates each thread-local storage variable to thread-local memory according to the record at runtime. A description of step S304 can be found in the description of step S101 in the above embodiments, and will not be repeated here.
[0115] Step S305: Generate machine code corresponding to the code based on the intermediate representation of the code. The machine code includes load instructions, which, when executed, are used to obtain the thread-local memory address corresponding to each thread-local storage variable based on the driver's allocation result. A description of step S305 can be found in the description of step S102 in the above embodiments, and will not be repeated here.
[0116] The following examples illustrate the differences between a scheme for statically allocating address space at compile time and a scheme for dynamically allocating address space at runtime, as proposed in at least one embodiment of this disclosure.
[0117] Figure 4 This is an exemplary schematic diagram of a compile-time static address space allocation scheme.
[0118] For example, such as Figure 4 As shown, the code includes three kernel functions and six device functions. The three kernel functions are, for example, represented as Kernel 1 to Kernel 3, and the six device functions are, for example, represented as Func 1 to Func 6. Figure 4 As shown in the call diagram on the right, kernel function Kernel 1 calls device functions Func 1 and Func 2, kernel function Kernel 2 calls device functions Func 3 and Func 4, and kernel function Kernel 3 calls device functions Func 5 and Func 6. Specifically, device function Func 1 uses thread-local storage variable 'a', device function Func 2 uses thread-local storage variable 'b', device function Func 3 uses thread-local storage variable 'b', device function Func 4 uses thread-local storage variable 'c', device function Func 5 uses thread-local storage variable 'c', and device function Func 6 uses thread-local storage variable 'a'.
[0119] First, thread-local memory variable 'a' is allocated. Thread-local memory variable 'a' is used by kernel functions Kernel 1 and Kernel 3, and the allocated offsets for both Kernel 1 and Kernel 3 are 0 bytes, meaning the maximum allocated offset is 0 bytes. Therefore, the starting address for the allocation of thread-local memory variable 'a' in thread-local memory is 0, and 4 bytes of thread-local memory space are allocated.
[0120] Next, the thread-local storage variable `b` is allocated. `b` is used by kernel functions `Kernel 1` and `Kernel 2`. The allocated offset for `Kernel 1` is 4 bytes, and the allocated offset for `Kernel 2` is 0 bytes, meaning the maximum allocated offset is 4 bytes. Therefore, the starting address for the allocation of thread-local storage variable `b` in thread-local memory is 4, and 4 bytes of thread-local memory space are allocated.
[0121] Finally, the thread-local storage variable `c` is allocated. `c` is used by kernel functions `Kernel 2` and `Kernel 3`. The allocated offset for `Kernel 2` is 8 bytes, and the allocated offset for `Kernel 3` is 4 bytes, meaning the maximum allocated offset is 8 bytes. Therefore, the starting address for the allocation of the thread-local storage variable `c` in thread-local memory is 8, and 4 bytes of thread-local memory space are allocated.
[0122] After address allocation is completed, this scheme optimizes the intermediate representation (directly replacing the instruction used to obtain the address of the thread-local storage variable with the thread-local memory address corresponding to the thread-local storage variable), and fixes the corresponding thread-local memory address directly during compilation. This eliminates the need to load the thread-local memory address corresponding to each thread-local storage variable from the register, effectively reducing the overhead of registers and move instructions.
[0123] In this scheme, a separate thread-local memory space is allocated for each kernel function. Therefore, the "starting address of the allocation in thread-local memory" mentioned above refers to the offset relative to the starting address of each thread-local memory space. It should be noted that... Figure 4 The stack is used to illustrate the allocation of thread-local storage variables, and can be considered a virtual concept. In LLVM, when the intermediate representation is converted into machine code, the allocation on the stack is mapped to the actual storage space of the target architecture, that is, the thread-local storage variable is mapped to the corresponding thread-local memory space according to its position on the stack.
[0124] In the case of static address space allocation at compile time, since multiple kernel functions and multiple device functions may use the same thread-local memory variable, and device functions cannot determine which kernel functions will call them, it is required that the addresses allocated to identical thread-local memory variables in all kernel functions and device functions be consistent. That is, regardless of which kernel function or device function, the thread-local memory variable can be found in the thread's local memory through the same address. Figure 4 The example above demonstrates how to achieve consistent address allocation for local storage variables within the same thread, as required above.
[0125] However, through Figure 4 As can be seen, in the static allocation scheme, if a thread-local memory variable is shared by multiple kernel functions, even kernel functions that do not use the thread-local memory variable will be forcibly allocated the thread-local memory space corresponding to that variable. For example, kernel function Kernel 2 does not use thread-local memory variable 'a', and kernel function Kernel 3 does not use thread-local memory variable 'b', but they are still allocated the corresponding thread-local memory space, which results in a waste of thread-local memory space.
[0126] The code compilation method provided in at least one embodiment of this disclosure offers a scheme for dynamically allocating address space at runtime. Compared with the aforementioned scheme for statically allocating address space at compile time, this scheme can effectively reduce the waste of thread-local memory space.
[0127] Figure 5 This is an exemplary schematic diagram of a runtime dynamic address space allocation scheme provided for at least one embodiment of the present disclosure.
[0128] For example, such as Figure 5 As shown, the code includes three kernel functions and six device functions. The three kernel functions are, for example, represented as Kernel 1 to Kernel 3, and the six device functions are, for example, represented as Func 1 to Func 6. Figure 5 As shown in the call diagram on the right, kernel function Kernel 1 calls device functions Func 1 and Func 2, kernel function Kernel 2 calls device functions Func 3 and Func 4, and kernel function Kernel 3 calls device functions Func 5 and Func 6. Specifically, device function Func 1 uses thread-local storage variable 'a', device function Func 2 uses thread-local storage variable 'b', device function Func 3 uses thread-local storage variable 'b', device function Func 4 uses thread-local storage variable 'c', device function Func 5 uses thread-local storage variable 'c', and device function Func 6 uses thread-local storage variable 'a'.
[0129] For example, such as Figure 5As shown, the constant scalar register (CSR) corresponding to kernel function Kernel 1 records the address table of thread-local memory variables used by kernel function Kernel 1, which stores the thread-local memory addresses corresponding to thread-local memory variables a and b, respectively. The constant scalar register corresponding to kernel function Kernel 2 records the address table of thread-local memory variables used by kernel function Kernel 2, which stores the thread-local memory addresses corresponding to thread-local memory variables b and c, respectively. The constant scalar register corresponding to kernel function Kernel 3 records the address table of thread-local memory variables used by kernel function Kernel 3, which stores the thread-local memory addresses corresponding to thread-local memory variables a and c, respectively. These address tables are stored in the corresponding constant scalar registers by the driver at runtime. That is, the thread-local memory addresses corresponding to each thread-local memory variable are allocated by the driver at runtime, and the compiler only needs to obtain the thread-local memory addresses corresponding to each thread-local memory variable from the constant scalar registers. Through the runtime dynamic allocation mechanism, the driver allocates the corresponding amount of thread-local memory space to the kernel function based on the number of thread-local storage variables it uses, thus avoiding waste of thread-local memory space.
[0130] It should be noted that, Figure 5 The stack is used to illustrate the allocation of thread-local storage variables, and can be considered a virtual concept. In LLVM, when the intermediate representation is converted into machine code, the allocation on the stack is mapped to the actual storage space of the target architecture, that is, the thread-local storage variable is mapped to the corresponding thread-local memory space according to its position on the stack.
[0131] It should be noted that thread-local memory can also be used to store local variables other than thread-local storage variables. For example, ... Figure 4 and Figure 5 As shown, the stack corresponding to each kernel function can also include local variables.
[0132] Compared to the waste of thread-local memory space caused by the static allocation of address space at compile time, the runtime dynamic allocation of address space provided by at least one embodiment of this disclosure effectively avoids the waste of thread-local memory space.
[0133] It should also be noted that the execution order of the various steps of the code compilation method in the various embodiments of this disclosure is not limited. Although the execution process of each step has been described in a specific order above, this does not constitute a limitation on the embodiments of this disclosure. The various steps in the code compilation method can be executed serially or in parallel, which can be determined according to actual needs.
[0134] For example, compared to the above description, the code compilation method provided in at least one embodiment of this disclosure may include more or fewer steps, and the embodiments of this disclosure do not limit this.
[0135] Figure 6 This is a schematic block diagram of a code compilation apparatus provided for at least one embodiment of the present disclosure. The code compilation apparatus may be, for example, a compiler for compiling code that includes at least one thread-local storage variable.
[0136] For example, such as Figure 6 As shown, at least one embodiment of the present disclosure provides a code compilation apparatus including an acquisition module 601 and a generation module 602.
[0137] For example, the acquisition module 601 is configured to acquire a record of the thread-local storage variables used by each first function in the code, and pass the record to the driver so that the driver allocates each thread-local storage variable to thread-local memory according to the record at runtime. For details regarding the acquisition module 601, please refer to the description of step S101 in the above code compilation method embodiment, which will not be repeated here.
[0138] For example, generation module 602 is configured to generate machine code corresponding to the code based on the intermediate representation of the code. The machine code includes load instructions, which, when executed, are used to obtain the thread-local memory address corresponding to each thread-local storage variable based on the driver's allocation result. For details regarding generation module 602, please refer to the description of step S102 in the above code compilation method embodiment; it will not be repeated here.
[0139] For example, in at least one embodiment of this disclosure, the generation module 602 is further configured to convert the instruction in the intermediate representation corresponding to the code used to obtain the address of the thread local storage variable into a load instruction.
[0140] For example, in at least one embodiment of this disclosure, the code compilation apparatus 600 may further include an initialization module configured to insert instructions for initializing thread-local storage variables used by each first function into an intermediate representation corresponding to the code.
[0141] For example, in at least one embodiment of this disclosure, the acquisition module 601 is further configured to determine, for each thread-local storage variable, a first function that uses the thread-local storage variable to construct a thread-local storage variable-first function mapping relationship; and to perform a reverse mapping of the mapping relationship to obtain a record of the thread-local storage variable used by each first function in the code.
[0142] For example, in at least one embodiment of this disclosure, the first function directly uses the thread-local storage variable, or uses the thread-local storage variable by calling a second function.
[0143] For example, in at least one embodiment of this disclosure, the first function is a function called by the host and executed by the device, and the second function is a function called by the device and executed by the device.
[0144] For example, in at least one embodiment of this disclosure, the code compilation apparatus 600 may further include a first removal module configured to remove instructions for obtaining the address of an unused thread-local storage variable from the intermediate representation corresponding to the code for the unused thread-local storage variable.
[0145] For example, in at least one embodiment of this disclosure, the code compilation apparatus 600 may further include a second removal module configured to remove duplicate instructions for obtaining the address of a thread-local storage variable from the intermediate representation corresponding to the code, such that the address of each thread-local storage variable within each first function is obtained only once.
[0146] For example, in at least one embodiment of this disclosure, the load instruction is further configured to: for each first function, read the address table of thread-local storage variables used by the first function from the register corresponding to the first function; and obtain the thread-local memory address corresponding to each thread-local storage variable used by the first function based on the address table, wherein the address table is created by the driver at runtime in the register corresponding to the first function according to the address allocation result.
[0147] For example, in at least one embodiment of this disclosure, obtaining the thread-local memory address corresponding to each thread-local storage variable used by the first function based on the address table includes: calculating the offset of the thread-local memory address corresponding to each thread-local storage variable used by the first function in the address table according to the order of each thread-local storage variable in the record corresponding to the first function and the address occupancy size of each thread-local storage variable used by the first function; and determining the thread-local memory address corresponding to each thread-local storage variable used by the first function based on the offset.
[0148] For example, in at least one embodiment of this disclosure, the register includes a constant scalar register.
[0149] For example, in at least one embodiment of this disclosure, thread-local memory is located in the graphics processor.
[0150] It should be noted that the various modules and units mentioned above can be implemented by software, hardware, firmware or any combination thereof. For example, the acquisition module and the generation module can be implemented as an acquisition circuit and a generation circuit, respectively. The embodiments of this disclosure do not limit their specific implementation methods.
[0151] It should be understood that the code compilation apparatus 600 provided in at least one embodiment of this disclosure can be used to implement the aforementioned code compilation method and can also achieve similar technical effects as the aforementioned code compilation method, which will not be elaborated here.
[0152] It should be noted that, in the embodiments of this disclosure, the code compilation device 600 may include more or fewer modules or units, and the connection relationship between the various modules or units is not limited and can be determined according to actual needs. The specific configuration of each module or unit is not limited; it can be constructed from analog devices according to circuit principles, or from digital chips, or in other suitable ways.
[0153] Figure 7 This is a schematic block diagram of an electronic device provided for at least one embodiment of the present disclosure.
[0154] For example, such as Figure 7 As shown, the electronic device 700 includes at least one processor 701 and at least one memory 702. The at least one memory 702 includes one or more computer program modules. These computer program modules are stored in the memory 702 and configured to be executed by the at least one processor 701. The one or more computer program modules include instructions for performing the code compilation method described above. When executed by the at least one processor 701, they can perform one or more steps of the code compilation method provided in at least one embodiment of this disclosure. The memory 702 and the processor 701 can be interconnected via a bus system and / or other forms of connection mechanisms (not shown).
[0155] For example, processor 701 can be a central processing unit (CPU), digital signal processor (DSP), graphics processing unit (GPU), general-purpose graphics processing unit (GPGPU), artificial intelligence (AI) accelerator, or other form of processing unit with data processing and / or program execution capabilities, such as a field-programmable gate array (FPGA); for example, the central processing unit (CPU) can be an x86, ARM, or RISC-V architecture. Processor 701 can be a general-purpose processor or a special-purpose processor, capable of controlling other components in electronic device 700 to perform desired functions.
[0156] For example, memory 702 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and / or cache memory. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, etc.
[0157] Figure 8 This is a schematic block diagram of another electronic device provided for at least one embodiment of the present disclosure.
[0158] The electronic devices in at least one embodiment of this disclosure may include, but are not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, personal digital assistants (PDAs), tablet computers (PADs), portable multimedia players (PMPs), in-vehicle terminals (e.g., in-vehicle navigation terminals), wearable electronic devices, and fixed terminals such as digital TVs and desktop computers. Figure 8 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.
[0159] The electronic device includes at least one processor and a memory. The processor may be referred to as processing device 801 as described below, and the memory may include at least one of read-only memory (ROM), random access memory (RAM), and storage device 808 as described below. The memory is used to store programs for performing the methods described in the various method embodiments above; the processor is configured to execute the programs stored in the memory. The processor may include a central processing unit (CPU) or other forms of processing unit having data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
[0160] like Figure 8 As shown, the electronic device 800 may include a processing unit 801 (e.g., a central processing unit, a graphics processor, etc.), which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) or a program loaded from a storage device 808 into a random access memory (RAM). Various programs and data required for the operation of the electronic device 800 are also stored in RAM 803. The processing unit 801, ROM 802, and RAM 803 are interconnected via a bus 804. Input / output (I / O) interfaces are also connected to the bus 804.
[0161] Typically, the following devices can be connected to I / O interface 805: input devices 806 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 807 including, for example, displays, speakers, vibrators, etc.; storage devices 808 including, for example, magnetic tapes, hard disks, etc.; and communication devices 809. Communication device 809 allows electronic device 800 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 8 An electronic device 800 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.
[0162] In particular, according to at least one embodiment of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, at least one embodiment of this disclosure includes a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via a communication device 809, or installed from a storage device 808, or installed from a ROM 802. When the computer program is executed by a processing device 801, it performs the functions defined in the methods of at least one embodiment of this disclosure.
[0163] It should be noted that the computer-readable medium described above in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. A computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In at least one embodiment of this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In at least one embodiment of this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium may be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, radio frequency (RF), etc., or any suitable combination thereof.
[0164] The aforementioned computer-readable medium may be included in the aforementioned electronic device 800; or it may exist independently and not assembled into the electronic device 800.
[0165] Figure 9 This is a schematic block diagram of a non-transitory computer-readable storage medium provided for at least one embodiment of the present disclosure.
[0166] For example, such as Figure 9 As shown, a non-transitory computer-readable storage medium 900 stores computer-readable instructions 901, which, when executed by at least one processor, perform one or more steps of the code compilation method described above.
[0167] For example, the storage medium may include a memory card for a smartphone, a storage component for a tablet computer, a hard drive for a personal computer, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), flash memory, or any combination of the above storage media, or other suitable storage media. For example, the readable storage medium may also be... Figure 7 The memory 702 in the memory is described in the foregoing content and will not be repeated here.
[0168] Although the present disclosure has been described in detail above with general descriptions and specific embodiments, modifications or improvements can be made to the embodiments of the present disclosure, which will be obvious to those skilled in the art. Therefore, all such modifications or improvements made without departing from the spirit of the present disclosure are within the scope of protection claimed by the present disclosure.
[0169] The following points should be noted regarding this disclosure:
[0170] (1) The accompanying drawings of the embodiments of this disclosure only involve the structures involved in the embodiments of this disclosure. Other structures can be referred to the general design.
[0171] (2) For clarity, the thickness of layers or regions in the drawings used to describe embodiments of the present disclosure is enlarged or reduced, i.e., these drawings are not drawn to actual scale.
[0172] (3) Where there is no conflict, the embodiments of this disclosure and the features in the embodiments can be combined with each other to obtain new embodiments.
[0173] The above description is merely a specific embodiment of this disclosure, but the scope of protection of this disclosure is not limited thereto. The scope of protection of this disclosure should be determined by the scope of protection of the claims.
Claims
1. A method of code compilation, wherein, The code includes at least one thread-local storage variable, and the code compilation method includes: The thread-local storage variables used by each first function in the code are recorded in the metadata, and the metadata is passed to the driver so that the driver allocates each thread-local storage variable to thread-local memory according to the metadata in the runtime environment and creates an address table according to the address allocation result; The machine code corresponding to the code is generated based on the intermediate representation of the code. The machine code includes load instructions, which, when executed, are used to obtain the thread-local memory address corresponding to each thread-local storage variable based on the address table created by the driver. The loading instruction is further used for: For each first function, Read the address table of thread-local storage variables used by the first function from the register corresponding to the first function; Based on the address table, obtain the thread-local memory address corresponding to each thread-local storage variable used by the first function. The address table is created by the driver at runtime in the register corresponding to the first function based on the address allocation result. The order in which the thread-local memory addresses corresponding to each thread-local storage variable are recorded in the address table is determined by the order of the thread-local storage variables recorded in the metadata.
2. The code compiling method of claim 1, wherein, The step of generating the machine code corresponding to the code based on the intermediate representation corresponding to the code includes: The instruction used to obtain the address of the thread-local storage variable in the intermediate representation corresponding to the code is converted into the load instruction.
3. The code compilation method according to claim 1 further includes: Insert instructions to initialize the thread-local storage variables used by each first function into the intermediate representation corresponding to the code.
4. The code compiling method of claim 1, wherein, The step of recording the thread-local storage variables used by each first function in the code in the metadata includes: For each thread-local storage variable, determine the first function that uses the thread-local storage variable to construct a thread-local storage variable-first function mapping relationship; The mapping relationship is reversed to obtain the thread-local storage variables used by each first function in the code and recorded in the metadata.
5. The code compiling method of claim 4, wherein, The first function can use the thread-local storage variable directly, or by calling the second function to use the thread-local storage variable.
6. The code compiling method of claim 5, wherein, The first function is a function called by the host and executed by the device, and the second function is a function called by the device and executed by the device.
7. The code compilation method according to claim 1 further includes: For unused thread-local storage variables, remove the instructions used to obtain the address of the unused thread-local storage variable from the intermediate representation corresponding to the code.
8. The code compilation method according to claim 1, further comprising: Remove duplicate instructions for obtaining the address of thread-local storage variables from the intermediate representation corresponding to the code, so that the address is obtained only once for each thread-local storage variable within each first function.
9. The code compiling method of claim 1, wherein, The step of obtaining the thread-local memory address corresponding to each thread-local storage variable used by the first function based on the address table includes: Based on the order of each thread-local storage variable in the metadata corresponding to the first function and the address size of each thread-local storage variable used by the first function, calculate the offset of the thread-local memory address corresponding to each thread-local storage variable used by the first function in the address table. The thread-local memory address corresponding to each thread-local storage variable used by the first function is determined based on the offset.
10. The code compiling method of claim 1, wherein, The registers include constant scalar registers.
11. An electronic device, comprising: At least one processor; At least one memory, including one or more computer program modules; The one or more computer program modules are stored in the at least one memory and configured to be executed by the at least one processor, and the one or more computer program modules are used to implement the code compilation method according to any one of claims 1-10.
12. A non-transitory computer readable storage medium having stored thereon computer instructions, wherein, When the computer instructions are executed by at least one processor, the code compilation method according to any one of claims 1-10 is performed.