Storage space processing method and apparatus, electronic device, and storage medium

By dynamically allocating and releasing shared storage space, the problem of wasted shared memory resources in general-purpose graphics processing units is solved, improving thread-level parallelism and computing performance.

CN122240285APending Publication Date: 2026-06-19KUNLUNXIN TECHNOLOGY (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
KUNLUNXIN TECHNOLOGY (BEIJING) CO LTD
Filing Date
2024-12-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The limited and wasteful shared memory resources of general-purpose graphics processing units restrict thread-level parallelism and affect computing performance.

Method used

By dynamically allocating and releasing shared storage space, the utilization rate of shared memory resources is improved, memory fragmentation is reduced, and a mapping table is used to manage the relationship between shared storage space and thread groups, thereby improving thread-level parallelism.

Benefits of technology

It improves the utilization of shared memory resources, reduces memory fragmentation, and enhances the performance and instruction execution efficiency of artificial intelligence computing units.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240285A_ABST
    Figure CN122240285A_ABST
Patent Text Reader

Abstract

This disclosure provides a method for processing storage space, relating to the field of artificial intelligence technology, and particularly to the fields of chips, scientific computing, general-purpose graphics processing technology, and parallel processing. The specific implementation involves: in response to determining that a first target shared storage space is free among multiple shared storage spaces, allocating the first target shared storage space to a target thread group in at least one thread group to be allocated, wherein the thread group to be allocated includes multiple threads; and executing a shared storage release instruction for the target thread group to release the first target shared storage space. This disclosure also provides an instruction execution device, an electronic device, and a storage medium.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, and more particularly to the fields of chips, scientific computing, general-purpose graphics processing technology, and parallel processing. More specifically, this disclosure provides a method, apparatus, electronic device, and storage medium for processing storage space. Background Technology

[0002] With the development of artificial intelligence technology, the application scenarios of AI computing units are constantly increasing. AI computing units can execute instructions from multiple threads concurrently to improve instruction execution efficiency. Summary of the Invention

[0003] This disclosure provides a method, apparatus, device, and storage medium for processing storage space.

[0004] According to one aspect of this disclosure, a method for processing storage space is provided, the method comprising: in response to determining that a first target shared storage space is free among a plurality of shared storage spaces, allocating the first target shared storage space to a target thread group in at least one thread group to be allocated, wherein the thread group to be allocated includes a plurality of threads; and executing a shared storage release instruction of the target thread group to release the first target shared storage space.

[0005] According to another aspect of this disclosure, an instruction execution apparatus is provided, the apparatus comprising: a shared storage unit including a plurality of shared storage spaces; and an execution unit configured to: in response to determining that a first target shared storage space is free among the plurality of shared storage spaces, allocate the first target shared storage space to a target thread group in at least one thread group to be allocated, wherein the thread group to be allocated includes a plurality of threads; and execute a shared storage release instruction of the target thread group to release the first target shared storage space.

[0006] According to another aspect of this disclosure, an instruction execution device is provided, which includes the instruction execution apparatus provided in this disclosure.

[0007] According to another aspect of this disclosure, an electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform a method provided according to this disclosure.

[0008] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions for causing a computer to perform the methods provided according to this disclosure.

[0009] According to another aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the method provided according to this disclosure.

[0010] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0011] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:

[0012] Figure 1 This is a flowchart of a storage space processing method according to an embodiment of the present disclosure;

[0013] Figure 2 This is a schematic diagram of a mapping table between shared storage space and thread groups according to an embodiment of the present disclosure;

[0014] Figure 3 This is a schematic diagram of a storage space processing method according to an embodiment of the present disclosure;

[0015] Figure 4 This is a schematic diagram illustrating the determination of the physical address of a shared storage space according to an embodiment of this disclosure;

[0016] Figure 5 This is a schematic diagram illustrating the allocation of shared storage space according to an embodiment of this disclosure;

[0017] Figure 6 This is a block diagram of an instruction execution apparatus according to an embodiment of the present disclosure;

[0018] Figure 7 This is a block diagram of an instruction execution device according to an embodiment of the present disclosure; and

[0019] Figure 8 This is a block diagram of an electronic device to which a storage space processing method can be applied according to an embodiment of the present disclosure. Detailed Implementation

[0020] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0021] Artificial intelligence computing units can be various hardware computing units such as general-purpose graphics processing units (GPGPU), tensor processing units (TPU), and neural network processing units (NPU).

[0022] Taking a general-purpose graphics processing unit (GPU) as an example, a GPU can execute a large number of threads concurrently to mask the impact of long latency (such as memory access latency) on computational efficiency, thereby improving throughput. A GPU can include multiple stream multiprocessors (SMs). Stream multiprocessors can include one or more arrays of streaming processors (SPs). The streaming processor array executes multiple threads simultaneously according to the Single Instruction Multiple Threads (SIMT) paradigm. These multiple threads are called a thread bundle. The GPU can employ a hierarchical programming model; the program of a general-purpose GPU is called a kernel. A kernel can consist of thousands or even millions of threads. These threads are organized into multiple thread bundles in a SIMT manner, and these bundles further form thread blocks (TBs). Finally, multiple thread blocks constitute the kernel described above.

[0023] Thread-level parallelism (TLP) is the number of concurrently executing threads within a graphics processing unit (GPU), and it is a crucial factor affecting GPU performance. The magnitude of TLP is influenced by various factors, including register resources, scheduling resources, and shared memory resources. Scheduling resources can include block slots, warp slots, and the number of program counters (PCs). Shared memory resources are important on-chip storage resources on general-purpose GPUs, characterized by low memory access latency and explicit software management. Furthermore, shared memory enables communication and synchronization between threads within the same thread block with relatively low resource overhead.

[0024] However, shared memory for general-purpose graphics processing units (GPUs) is limited and even scarce. For some kernels that prefer to use shared memory, the amount of shared memory resources often becomes a bottleneck limiting the thread-level parallelism of the GPU, thus affecting kernel performance.

[0025] General-purpose graphics processing units (GPUs) can statically allocate and release shared memory resources at the thread block level. A thread block is only assigned to a streaming multiprocessor when the available shared memory resources can meet its needs. The shared memory resources occupied by the thread block are only released after all threads or thread bundles within it have finished executing.

[0026] When shared memory resources are allocated statically, they can be requested when a thread block is issued and released when the thread block finishes execution. The shared memory resources are occupied by that thread block throughout its entire lifecycle. However, only a portion of the thread block's lifecycle is spent using these shared memory resources; the rest of the time, they remain idle, leading to resource waste. Furthermore, general-purpose graphics processing units primarily allocate shared memory resources at the thread block level. Consequently, even after the instructions of a specific thread bundle within a thread block have finished executing, the shared memory resources occupied by that thread bundle are still not released. The shared memory resources are only released after all the instructions of all thread bundles within the thread block have completed execution, resulting in a high time cost.

[0027] Furthermore, if the shared memory used by each thread block in the kernel is not divisible by the total shared memory capacity, memory fragmentation will occur, leading to resource waste. For example, a streaming multiprocessor may have 128 kilobytes (KB) of shared memory. One thread block in this streaming multiprocessor consists of four thread bundles. If the shared memory required by the thread block is 68 kilobytes, even with sufficient other computing resources, at most only one thread block can run on this streaming multiprocessor, resulting in 60 kilobytes of wasted memory.

[0028] Therefore, in order to improve the utilization of shared memory resources and increase the efficiency of artificial intelligence computing units, this disclosure provides a method for processing storage space, which will be described below.

[0029] Figure 1 This is a flowchart of a storage space processing method according to an embodiment of the present disclosure.

[0030] like Figure 1 As shown, the method 100 may include operations S110 to S120.

[0031] In operation S110, in response to determining that there is a free first target shared memory space among a plurality of shared memory spaces, the first target shared memory space is allocated to a target thread group in at least one thread group to be allocated.

[0032] In this embodiment of the disclosure, the multiple shared storage spaces can be storage spaces of a shared storage unit. For example, the shared storage unit can be the shared memory described above. The capacity of the shared storage unit can be 128 kilobytes. The capacity of the shared storage space can be 4 kilobytes. There can be 32 shared storage spaces. It is understood that the capacity of the shared storage unit and the capacity of the shared storage space are merely examples, and this disclosure does not impose any limitations on them. The capacity of the shared storage space can be set when writing the code.

[0033] In this embodiment of the disclosure, the first target shared storage space may be an idle shared storage space among multiple shared storage spaces. The first target shared storage space may be a shared storage space not occupied by the thread group, or it may be a shared storage space released by the thread group.

[0034] In this embodiment of the disclosure, the thread group to be assigned includes multiple threads. The thread group to be assigned can be either the thread block or the thread bundle described above.

[0035] In operation S120, the shared memory release instruction of the target thread group is executed to release the first target shared memory space.

[0036] In this embodiment of the disclosure, one or more instructions for the target thread group include shared memory release instructions. Shared memory release instructions are used to release shared memory space occupied by the thread group.

[0037] Through the embodiments of this disclosure, when shared memory space is idle, it can be allocated to thread groups. After executing a shared memory release instruction, the shared memory space can be released. Dynamic allocation of shared memory resources can be achieved at the thread block or thread bundle granularity, thereby effectively improving the utilization rate of shared memory resources, reducing memory fragmentation, increasing thread-level parallelism, and enhancing the performance of artificial intelligence computing units.

[0038] It is understood that the above description, in conjunction with shared memory release instructions, has illustrated the method of this disclosure. However, this disclosure is not limited thereto; the following description will illustrate the method of this disclosure in conjunction with shared memory allocation instructions.

[0039] In some embodiments, the above method may further include: executing a shared memory allocation instruction for an initial thread group to determine whether a second target shared memory space corresponding to the initial thread group exists among a plurality of shared memory spaces of the shared memory unit.

[0040] In this embodiment of the disclosure, executing the shared memory allocation instruction for the initial thread group may include an instruction to block the initial thread group. For example, the initial thread group may be a first thread bundle. An instruction may be used to block (stall) the first thread bundle.

[0041] In this embodiment of the disclosure, executing the shared memory allocation instruction for the initial thread group may further include: using the identifier of the initial thread group as an index to query the mapping table between the shared memory space and the thread group. The following will combine... Figure 2 The mapping table of this disclosure is explained.

[0042] Figure 2 This is a schematic diagram of a mapping table between shared storage space and thread groups according to an embodiment of the present disclosure.

[0043] In this embodiment of the disclosure, the mapping table includes multiple entries corresponding to multiple thread groups. For example... Figure 2 As shown, mapping table t20 can include multiple entries. These multiple entries can include entries e201, e202, e203, ..., e232. Taking a thread group as an example of a thread bundle, if the shared memory space has a capacity of 4 kilobytes and the shared memory unit of the streaming multiprocessor has a capacity of 128 kilobytes, instructions for 32 thread bundles can be executed concurrently, and the multiple entries can be 32 entries. These 32 entries can correspond to the first to the 32nd thread bundles. Entry e201 can correspond to the first thread bundle among the 32 thread bundles. Entry e202 can correspond to the second thread bundle. Entry e203 can correspond to the third thread bundle. Entry e232 can correspond to the 32nd thread bundle. It can be understood that the first thread bundle mentioned above can, for example, be the entry corresponding to entry e201.

[0044] In this embodiment of the disclosure, an entry may include a shared memory identification (sm_id) field. The shared memory identification field can indicate the shared memory space corresponding to the thread group. For example, the value of the shared memory identification field of entry e201 can be 000000, which can represent the first shared memory space out of 32 shared memory spaces.

[0045] In this embodiment of the disclosure, the entry may further include a validity field. The validity field can indicate whether the correspondence between the thread group and the shared storage space is valid. For example, the value of the validity field can be a valid value or an invalid value. A valid value can be, for example, 1, and an invalid value can be, for example, 0. If the value of the validity field is valid, it indicates that the correspondence between the thread group and the shared storage space is valid, meaning that the shared storage space corresponding to the value of the shared storage space identifier field can be used as the shared storage space already allocated for the thread group. If the value of the validity field is invalid, it indicates that the correspondence between the thread group and the shared storage space is invalid. In this case, regardless of whether the value of the shared storage space identifier field is null, shared storage space needs to be reallocated to the thread group.

[0046] In this embodiment, the entry may further include a pending assignment flag field. The pending assignment flag field indicates whether a thread group is a pending thread group. For example, the value of the pending assignment flag field can be a first preset value or a second preset value. The first preset value can be, for example, 0. The second preset value can be, for example, 1. If the value of the pending assignment flag field is the first preset value, it indicates that the thread group corresponding to the entry is not a pending thread group. If the value of the pending assignment flag field is the second preset value, it indicates that the thread group corresponding to the entry is a pending thread group. Through this embodiment, a mapping table between shared storage space and thread groups is established, which can quickly determine the relationship between shared storage space and thread bundles or thread blocks, and accurately determine whether shared storage space has been allocated to thread bundles or thread blocks, thus helping to improve processing efficiency.

[0047] As can be understood, the mapping table of this disclosure has been explained above, and the method of this disclosure will be further explained below.

[0048] Figure 3 This is a schematic diagram of a storage space processing method according to an embodiment of the present disclosure.

[0049] like Figure 3 As shown, the initial thread group's multiple instructions may include a shared memory allocation instruction i30. The shared memory allocation instruction i30 of the initial thread group can be executed to perform operation S301.

[0050] In operation S301, it is determined whether there is a second target shared memory space corresponding to the initial thread group among the multiple shared memory spaces of the shared memory unit.

[0051] In this embodiment of the disclosure, the second target shared storage space may be a shared storage space that has already been allocated for the initial thread group.

[0052] In this embodiment of the disclosure, determining whether a second target shared storage space corresponding to the initial thread group exists among the multiple shared storage spaces of the shared storage unit may include: determining whether the value of the validity field of the entry corresponding to the initial thread group is a valid value.

[0053] In this embodiment of the disclosure, in response to determining that there is no second target shared storage space corresponding to the initial thread group among the multiple shared storage spaces, the initial thread group can be determined as the thread group to be allocated. For example, taking the initial thread group as the second thread bundle as an example, according to the identifier of the second thread bundle, the above-mentioned mapping table t20 can be queried to determine the value of the validity field in the entry corresponding to the second thread bundle. If the value of the validity field in the entry is invalid, the instruction of the second thread bundle can be blocked, or the value of the allocation flag field in the entry can be set to a second preset value. Next, it can be determined whether there is a free first target shared storage space among the multiple shared storage spaces. If there is a free first target shared storage space among the multiple shared storage spaces, operation S310 can be executed.

[0054] In operation S310, the first target shared memory space is allocated to the target thread group. It is understood that the description of operation S110 above also applies to operation S310, and this disclosure will not repeat it here. For example, if there is at least one thread group to be allocated, and that thread group is the aforementioned second thread bundle, the aforementioned second thread bundle can be used as the target thread group, and the first target shared memory space can be allocated to the second thread bundle.

[0055] In another embodiment of this disclosure, in response to determining that a second target shared storage space corresponding to the initial thread group exists among multiple shared storage spaces, the initial thread group can be used as the target thread group to continue executing the instructions of the initial thread group. For example, taking the initial thread group as the first thread bundle described above as an example, based on the identifier of the first thread bundle, querying the mapping table t20 above, an entry e201 corresponding to the first thread bundle can be found. If the value of the validity field in entry e201 is a valid value, it can be determined that the shared storage space corresponding to the value of the shared storage space identifier field in entry e201 is valid. Thus, it can be determined that a second target shared storage space corresponding to the first thread bundle exists.

[0056] Next, after obtaining the target thread group, the instructions of the target thread group can be executed. The following will explain this in conjunction with operations S321 to S322.

[0057] In operation S321, at least one instruction to be executed from the target thread group is executed, and at least one execution result is obtained.

[0058] In this embodiment, at least one instruction to be executed may include at least one of a first instruction to be executed, a second instruction to be executed, and a third instruction to be executed.

[0059] In this embodiment of the disclosure, the first instruction to be executed is executed based on data in the target shared memory space. For example, when executing the first instruction to be executed, data can be read from the target shared memory space, and then calculations can be performed to obtain the execution result of the first instruction to be executed. This execution result can be written to the L1 cache.

[0060] In this embodiment of the disclosure, the execution result of the second instruction to be executed is written to the target shared memory space. For example, when executing the second instruction to be executed, data can be read from the L1 cache or other storage units, and then calculations can be performed to obtain the execution result of the second instruction to be executed. The execution result is then written to the target shared memory space.

[0061] In this embodiment, the third instruction to be executed is executed based on data in the target shared storage space, and the execution result of the third instruction to be executed is written to the target shared storage space. For example, when executing the third instruction to be executed, data can be read from the target shared storage space, and then calculations can be performed to obtain the execution result of the third instruction to be executed. This execution result can be written to the target shared storage space. Through this embodiment, after allocating shared storage space to the thread group, the instructions of the thread group can be executed using the shared storage space, thereby improving instruction execution efficiency and enhancing the performance of the artificial intelligence computing unit.

[0062] In this embodiment of the disclosure, the target shared storage space can be either the first target shared storage space or the second target shared storage space.

[0063] In operation S322, execute the shared memory release instruction for the target thread group.

[0064] For example, when executing a shared memory release instruction to the first thread bundle, the value of the validity field in entry e201 can be set to an invalid value to release the second target shared memory space.

[0065] As can be understood, the method disclosed above has been explained, and the method for determining the address of the shared storage space will be explained below.

[0066] In this embodiment of the disclosure, the physical address of the shared storage space can be determined based on the logical address of the shared storage unit, the capacity of the shared storage space, and the value of the shared storage space identifier field. For example, the physical address sm_paddr of the shared storage space corresponding to the thread group can be determined using the following formula:

[0067] sm_paddr=sm_addr+sm_size×sm_id (Formula 1)

[0068] `sm_addr` can be the logical address of the shared memory unit, `sm_size` can be the capacity of the shared memory space, and `sm_id` can be the value of the shared memory space identifier field. The following will combine... Figure 4 Please provide an explanation.

[0069] Figure 4 This is a schematic diagram illustrating the determination of the physical address of a shared storage space according to an embodiment of the present disclosure.

[0070] In this embodiment of the disclosure, when reading data from or writing data to the target shared storage space of a target thread group, the value of the shared storage space identifier field corresponding to the target thread group can be obtained by querying a mapping table based on the identifier of the target thread group. For example... Figure 4 As shown, taking the target thread group as an example of a target thread bundle, the value of the shared storage space identifier field in the corresponding entry of the target thread bundle can be obtained by querying the mapping table t40 based on the identifier warp_slot_id of the target thread bundle. It can be understood that the explanation of mapping table t20 above also applies to mapping table t40, and will not be repeated here.

[0071] In this embodiment of the disclosure, the logical address of the shared memory unit can be generated using an address generation unit (AGU). For example... Figure 4 As shown, the logical address sm_addr of the shared memory unit can be generated using the address generation unit agu40.

[0072] In this embodiment of the disclosure, the physical address of the shared memory space can be determined based on the logical address of the shared memory unit, the capacity of the shared memory space, and the value of the shared memory space identifier field. For example, the capacity of the shared memory space can be multiplied by the value of the shared memory space identifier field corresponding to the target thread bundle to obtain the multiplication result. Then, the adder add40 is used to add the multiplication result and the logical address to obtain the physical address of the shared memory space.

[0073] As we have explained above, the methods for determining the address of shared storage space will now be further explained below.

[0074] Figure 5 This is a schematic diagram illustrating the allocation of shared storage space according to an embodiment of the present disclosure.

[0075] like Figure 5As shown, method 510 can allocate at least one target thread group to the first target shared memory space. For example, the kernel described above may include 128 thread bundles. Method 510 can be executed if shared memory allocation instructions for one or more thread bundles from the 33rd to the 128th thread bundles are pending execution, during the execution of the 1st to 32nd thread bundles or after the execution of shared memory release instructions for one or more of the 32 thread bundles. The following description will be in conjunction with operations S511 to S515.

[0076] In operation S511, query shared storage indication information.

[0077] In this embodiment of the disclosure, the shared storage space indication information includes multiple indicator bits corresponding to multiple shared storage spaces. For example, taking the aforementioned 32 shared storage spaces as an example, the shared storage space indication information may include 32 indicator bits. The aforementioned first to 32nd thread bundles may correspond to the first to 32nd indicator bits, respectively. The first indicator bit may be the least significant bit in the shared storage space indication information. The 32nd indicator bit may be the most significant bit in the shared storage space indication information. For another example, if there are 64 shared storage spaces, the shared storage space indication information may include 64 indicator bits. In this case, the data size of the shared storage space indication information may be 64 bits. It is understood that the capacity of the shared storage space can be set to various values, such as 2 kilobytes or 4 kilobytes. An address space can be provided to the shared storage indication information. The capacity of this address space may be 64 bits. Correspondingly, the shared storage indication information may be 64 bits. In the case of 32 shared storage spaces, the lower 32 bits (bits 1 to 32) of the 64-bit shared storage indication information are valid, and the higher 32 bits are invalid. When there are 64 shared memory spaces, the lower 32 bits (bits 1 to 32) and the higher 32 bits (bits 33 to 64) of the 64-bit shared memory indication information are valid. Through the embodiments of this disclosure, the existence of free shared memory space can be quickly determined using the shared memory space indication information, which can effectively improve the processing efficiency of the chip.

[0078] In this embodiment of the disclosure, the indicator bit can indicate whether the shared storage space is free. For example, the value of the indicator bit can be a first indicator value or a second indicator value. The first indicator value can indicate that the shared storage space corresponding to the indicator bit is free. The second indicator value can indicate that the shared storage space corresponding to the indicator bit is occupied. The first indicator value can be, for example, 0. The second indicator value can be, for example, 1. As another example, when executing the shared storage release instruction of the first thread bundle described above, the value of the validity field in entry e201 can be set to an invalid value, and the value of an indicator bit in the shared storage indicator information can be set to the first indicator value. This indicator bit can be, for example, the first bit of the shared storage indicator information. The identifier of the shared storage space corresponding to this indicator bit can be 000000.

[0079] In this embodiment of the disclosure, shared storage space indication information can be queried according to a preset query order. For example, shared storage space indication information can be queried in order from least significant bit to most significant bit.

[0080] In operation S512, it is determined whether a first indicator bit exists in the shared storage indication information.

[0081] In this embodiment of the disclosure, in response to determining that there is no first indication value in the shared storage indication information, the process can return to operation S511. For example, before the respective shared storage release instructions of the first to the 32nd thread bundles are executed, multiple shared storage spaces are occupied. Accordingly, the values ​​of multiple indication bits in the shared storage space indication information can be second indication values. In this case, since there is no free first target shared storage space among the multiple shared storage spaces, the process can return to operation S511.

[0082] In this embodiment of the disclosure, operation S513 can be performed in response to determining that a first indicator value exists in the shared storage indication information. For example, after the shared storage release instruction of one or more thread bundles from the first to the 32nd thread bundle is executed, one or more shared storage spaces are released. Accordingly, the value of one or more indicator bits in the shared storage space indication information can be set to the first indicator value. After the shared storage space of a thread bundle is released, the mapping table can be updated so that the entries in the mapping table correspond to the thread bundles for which the shared storage allocation instruction is to be executed. Next, in this case, a free first target shared storage space may exist among the multiple shared storage spaces. As described above, by querying the shared storage space instruction information in ascending order of the least significant bit, it can be determined that the value of the kth indicator bit of the shared storage indication information is the first indicator value, and the corresponding shared storage space can be used as the first target shared storage space. k can be an integer greater than or equal to 1 and less than or equal to 32. k can be, for example, 1.

[0083] In operation S513, query the allocation tag information.

[0084] In this embodiment of the disclosure, the marking information may include multiple allocation marker bits. The values ​​of the multiple allocation marker bits correspond to the values ​​of the multiple allocation marker fields mentioned above. For example, the column in the mapping table t20 that corresponds to the allocation marker field can be used as the allocation marker information.

[0085] In operation S514, it is determined whether a second preset value exists in the allocation tag information.

[0086] In this embodiment of the disclosure, in response to determining that a second preset value exists in the allocation tag information, a target thread group is determined from at least one thread group to be allocated corresponding to at least one second preset value. For example, the value of the allocation tag field of the second thread bundle can be the second preset value. A polling arbitration strategy can be used to determine the m-th thread bundle as the target thread bundle from at least one thread bundle to be allocated. m can be, for example, an integer greater than 32. The m-th thread bundle can be the second thread bundle mentioned above. Next, operation S515 can be performed.

[0087] In operation S515, the first target shared memory space is allocated to the target thread group.

[0088] For example, taking the determination of the second thread bundle as the target thread bundle and the identifier of the first target shared storage space as 000000 as an example, the value of the shared storage space identifier field of the entry corresponding to the second thread bundle in the mapping table can be set to 000000, the value of the validity field of the entry can be set to a valid value, and the value of the allocation flag field of the entry can be set to a first preset value. Alternatively, the indicator bit in the shared storage space indication information corresponding to the first target shared storage space can be set to a second indicator value.

[0089] In another embodiment of this disclosure, in response to determining that there is no second preset value in the allocation tag information, the process returns to operation S513.

[0090] It is understood that the above description uses the example of a thread group being a thread bundle to illustrate this disclosure. However, this disclosure is not limited to this. The following description will use the example of a thread group being a thread block to illustrate this disclosure.

[0091] In this embodiment of the disclosure, the shared memory allocation instruction for executing a thread group may include: blocking the instruction of the at least one thread bundle in response to a shared memory allocation instruction for executing a thread block, until all thread bundles of the thread block have been executed to the shared memory allocation instruction. Next, it can be determined whether a second target shared memory space corresponding to the thread block exists among the plurality of shared memory spaces.

[0092] In this embodiment of the disclosure, executing the shared memory release instruction of a thread group may include: releasing the shared memory space corresponding to the thread block in response to determining that multiple thread bundles in the thread block have executed the shared memory space release instruction. It is understood that when the thread group is a thread block, it is necessary to wait for multiple thread bundles in the thread block to execute the shared memory allocation instruction or the shared memory release instruction. Further descriptions of the method of this disclosure based on thread blocks are the same as or similar to the method of this disclosure based on thread bundles described above, and will not be repeated here.

[0093] In embodiments of this disclosure, when communication is carried out between different thread bundles of a thread block using a shared memory unit, the method of this disclosure can be executed based on the thread block.

[0094] It is understood that the method of this disclosure has been described above, and the apparatus of this disclosure will be described below.

[0095] Figure 6 This is a block diagram of an instruction execution apparatus according to an embodiment of the present disclosure.

[0096] like Figure 6 As shown, the device 600 may include a shared storage unit 610 and an execution unit 620.

[0097] The shared storage unit 610 may include multiple shared storage spaces. For example, the capacity of the shared storage space can be set when writing the program code, and can be various values ​​such as 4 kilobytes, 2 kilobytes, or 64 kilobytes.

[0098] Execution unit 620 can be configured to: in response to determining that a first target shared memory space is free among multiple shared memory spaces, allocate the first target shared memory space to a target thread group in at least one thread group to be allocated, wherein the thread group to be allocated includes multiple threads; and execute a shared memory release instruction of the target thread group to release the first target shared memory space. For example, execution unit 620 can execute the above method 100.

[0099] In some embodiments, the execution unit is further configured to perform the following operations to execute the shared memory release instruction of the target thread group: executing at least one pending instruction of the target thread group to obtain at least one execution result. The at least one pending instruction includes at least one of a first pending instruction, a second pending instruction, and a third pending instruction. The first pending instruction is executed based on data in the first target shared memory space. The execution result of the second pending instruction is used to write to the first target shared memory space. The third pending instruction is executed based on data in the first target shared memory space, and the execution result of the third pending instruction is used to write to the first target shared memory space. The shared memory release instruction of the target thread group is executed.

[0100] In some embodiments, the execution unit is further configured to: execute a shared memory allocation instruction for an initial thread group to determine whether a second target shared memory space corresponding to the initial thread group exists among a plurality of shared memory spaces. In response to determining that no second target shared memory space corresponding to the initial thread group exists among the plurality of shared memory spaces, the initial thread group is determined as a thread group to be allocated. The execution unit also determines whether a free first target shared memory space exists among the plurality of shared memory spaces.

[0101] In some embodiments, the execution unit is further configured to perform the following operation to execute the shared memory allocation instruction of the initial thread group: the instruction to block the initial thread group.

[0102] In some embodiments, the execution unit is further configured to: continue executing the instructions of the initial thread group in response to determining that a second target shared storage space corresponding to the initial thread group exists.

[0103] In some embodiments, the execution unit is further configured to perform the following operation to execute the shared memory allocation instruction for the initial thread group: using the identifier of the initial thread group as an index, querying a mapping table between shared memory spaces and thread groups. The mapping table includes multiple entries corresponding to multiple thread groups. Each entry includes a shared memory space identifier field, a validity field, and an allocation flag field. The shared memory space identifier field indicates the shared memory space corresponding to the thread group. The validity field indicates whether the mapping between the thread group and the shared memory space is valid, and the allocation flag field indicates whether the thread group is a thread group to be allocated.

[0104] In some embodiments, the execution unit is further configured to perform the following operations to determine whether a second target shared storage space corresponding to the initial thread group exists among a plurality of shared storage spaces: determining whether the value of the validity field of the entry corresponding to the initial thread group is a valid value.

[0105] In some embodiments, the execution unit is further configured to perform the following operations to determine whether a free first target shared storage space exists among the plurality of shared storage spaces: determining whether a first indication value exists in shared storage space indication information. The shared storage space indication information includes a plurality of indication bits corresponding to the plurality of shared storage spaces. The value of the indication bit can be a first indication value or a second indication value. The first indication value indicates that the shared storage space corresponding to the indication bit is free, and the second indication value indicates that the shared storage space corresponding to the indication bit is occupied.

[0106] In some embodiments, the thread group to be assigned is a block of threads to be assigned or a bundle of threads to be assigned. The block of threads to be assigned includes multiple bundles of threads, and the bundle of threads to be assigned includes multiple threads.

[0107] It is understood that the apparatus of this disclosure has been described above, and the equipment including the apparatus will be described below.

[0108] Figure 7 This is a block diagram of an instruction execution device according to an embodiment of the present disclosure.

[0109] like Figure 7 As shown, the device 70 may include an instruction execution device 700. The instruction execution device 700 may, for example, be the device 600 described above.

[0110] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0111] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0112] Figure 8 A schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0113] like Figure 8 As shown, device 800 includes a computing unit 801, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 802 or a computer program loaded from storage unit 808 into random access memory (RAM) 803. RAM 803 may also store various programs and data required for the operation of device 800. The computing unit 801, ROM 802, and RAM 803 are interconnected via bus 804. Input / output (I / O) interface 805 is also connected to bus 804.

[0114] Multiple components in device 800 are connected to I / O interface 805, including: input unit 806, such as keyboard, mouse, etc.; output unit 807, such as various types of monitors, speakers, etc.; storage unit 808, such as disk, optical disk, etc.; and communication unit 809, such as network card, modem, wireless transceiver, etc. Communication unit 809 allows device 800 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0115] The computing unit 801 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as the storage space processing method. For example, in some embodiments, the storage space processing method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and / or installed on device 800 via ROM 802 and / or communication unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the storage space processing method described above may be performed. Alternatively, in other embodiments, computing unit 801 may be configured to perform storage space processing methods by any other suitable means (e.g., by means of firmware).

[0116] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard parts (ASSPs), systems-on-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0117] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0118] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory, read-only memory, erasable programmable read-only memory (EPROM) or flash memory, optical fiber, compact disc read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0119] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a cathode ray tube (CRT) monitor or a liquid crystal display (LCD)); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0120] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0121] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other.

[0122] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0123] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A method for processing storage space, comprising: In response to determining that a first target shared storage space is available among a plurality of shared storage spaces, the first target shared storage space is allocated to a target thread group in at least one thread group to be allocated, wherein the thread group to be allocated includes a plurality of threads; Execute the shared storage release instruction of the target thread group to release the first target shared storage space.

2. The method according to claim 1, wherein, The execution of the shared memory release instruction for the target thread group includes: Execute at least one instruction to be executed from the target thread group to obtain at least one execution result, wherein the at least one instruction to be executed includes at least one of a first instruction to be executed, a second instruction to be executed, and a third instruction to be executed, wherein the first instruction to be executed is executed based on data in the first target shared storage space, the execution result of the second instruction to be executed is used to write to the first target shared storage space, and the third instruction to be executed is executed based on data in the first target shared storage space and the execution result of the third instruction to be executed is used to write to the first target shared storage space; Execute the shared memory release instruction for the target thread group.

3. The method according to claim 1, further comprising: Execute the shared memory allocation instruction of the initial thread group to determine whether there is a second target shared memory space corresponding to the initial thread group among the multiple shared memory spaces of the shared memory unit; In response to determining that there is no second target shared storage space corresponding to the initial thread group among the plurality of shared storage spaces, the initial thread group is determined as the thread group to be allocated. Determine whether there is a free first target shared storage space among the multiple shared storage spaces.

4. The method according to claim 3, wherein, The shared memory allocation instructions for executing the initial thread group include: Instructions to block the initial thread group.

5. The method according to claim 3, further comprising: In response to the determination that a second target shared storage space corresponding to the initial thread group exists, the instructions of the initial thread group continue to be executed.

6. The method according to claim 3, wherein, The shared memory allocation instructions for executing the initial thread group include: Using the identifier of the initial thread group as an index, a mapping table between shared storage space and thread group is queried. The mapping table includes multiple entries corresponding to multiple thread groups. Each entry includes a shared storage space identifier field, a validity field, and an allocation flag field. The shared storage space identifier field is used to indicate the shared storage space corresponding to the thread group. The validity field is used to indicate whether the correspondence between the thread group and the shared storage space is valid. The allocation flag field is used to indicate whether the thread group is a thread group to be allocated.

7. The method according to claim 6, wherein, The step of determining whether a second target shared storage space corresponding to the initial thread group exists among the plurality of shared storage spaces of the shared storage unit includes: Determine whether the value of the validity field of the entry corresponding to the initial thread group is a valid value.

8. The method according to claim 3, wherein, Determining whether there is a free first target shared storage space among the plurality of shared storage spaces includes: Determine whether a first indication value exists in the shared storage space indication information, wherein the shared storage space indication information includes multiple indication bits corresponding to multiple shared storage spaces, and the value of the indication bit can be the first indication value or a second indication value. The first indication value is used to indicate that the shared storage space corresponding to the indication bit is free, and the second indication value is used to indicate that the shared storage space corresponding to the indication bit is occupied.

9. The method according to claim 1, wherein, The thread group to be assigned is a block of threads to be assigned or a bundle of threads to be assigned. The block of threads to be assigned includes multiple bundles of threads, and the bundle of threads to be assigned includes multiple threads.

10. An instruction execution device, comprising: A shared storage unit, comprising multiple shared storage spaces; An execution unit is configured to: in response to determining that there is a free first target shared storage space among the plurality of shared storage spaces, allocate the first target shared storage space to a target thread group in at least one thread group to be allocated, wherein the thread group to be allocated includes a plurality of threads; Execute the shared storage release instruction of the target thread group to release the first target shared storage space.

11. The apparatus according to claim 10, wherein, The execution unit is also configured to perform the following operation to execute the shared memory release instruction of the target thread group: Execute at least one instruction to be executed from the target thread group to obtain at least one execution result, wherein the at least one instruction to be executed includes at least one of a first instruction to be executed, a second instruction to be executed, and a third instruction to be executed, wherein the first instruction to be executed is executed based on data in the first target shared storage space, the execution result of the second instruction to be executed is used to write to the first target shared storage space, and the third instruction to be executed is executed based on data in the first target shared storage space and the execution result of the third instruction to be executed is used to write to the first target shared storage space; Execute the shared memory release instruction for the target thread group.

12. The apparatus according to claim 10, wherein, The execution unit is further configured to: Execute the shared memory allocation instruction of the initial thread group to determine whether there is a second target shared memory space corresponding to the initial thread group among the multiple shared memory spaces; In response to determining that there is no second target shared storage space corresponding to the initial thread group among the plurality of shared storage spaces, the initial thread group is determined as the thread group to be allocated. Determine whether there is a free first target shared storage space among the multiple shared storage spaces.

13. The apparatus according to claim 12, wherein, The execution unit is also configured to perform the following operation to execute the shared memory allocation instructions for the initial thread group: Instructions to block the initial thread group.

14. The apparatus according to claim 12, wherein, The execution unit is further configured to: In response to the determination that a second target shared storage space corresponding to the initial thread group exists, the instructions of the initial thread group continue to be executed.

15. The apparatus according to claim 12, wherein, The execution unit is also configured to perform the following operation to execute the shared memory allocation instructions for the initial thread group: Using the identifier of the initial thread group as an index, a mapping table between shared storage space and thread group is queried. The mapping table includes multiple entries corresponding to multiple thread groups. Each entry includes a shared storage space identifier field, a validity field, and an allocation flag field. The shared storage space identifier field is used to indicate the shared storage space corresponding to the thread group. The validity field is used to indicate whether the correspondence between the thread group and the shared storage space is valid. The allocation flag field is used to indicate whether the thread group is a thread group to be allocated.

16. The apparatus according to claim 15, wherein, The execution unit is also configured to perform the following operations to determine whether a second target shared memory space corresponding to the initial thread group exists among the plurality of shared memory spaces: Determine whether the value of the validity field of the entry corresponding to the initial thread group is a valid value.

17. The apparatus according to claim 12, wherein, The execution unit is also configured to perform the following operations to determine whether there is a free first target shared storage space among the plurality of shared storage spaces: Determine whether a first indication value exists in the shared storage space indication information, wherein the shared storage space indication information includes multiple indication bits corresponding to multiple shared storage spaces, and the value of the indication bit can be the first indication value or a second indication value. The first indication value is used to indicate that the shared storage space corresponding to the indication bit is free, and the second indication value is used to indicate that the shared storage space corresponding to the indication bit is occupied.

18. The apparatus according to claim 10, wherein, The thread group to be assigned is a block of threads to be assigned or a bundle of threads to be assigned. The block of threads to be assigned includes multiple bundles of threads, and the bundle of threads to be assigned includes multiple threads.

19. An instruction execution apparatus, comprising the means as described in any one of claims 10 to 18.

20. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.

21. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1 to 9.

22. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1 to 9.