Data processing system
By optimizing thread group synchronization and resource sharing in the graphics processor, allowing some threads to exit early, the problems of high energy consumption and insufficient resource utilization in iterative data processing operations are solved, and more efficient data processing is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ARM LTD
- Filing Date
- 2022-02-25
- Publication Date
- 2026-06-19
AI Technical Summary
The execution of shader programs in existing graphics processors suffers from low efficiency in synchronous thread operations, especially when performing iterative data processing operations, particularly reduction operations, which result in high energy consumption and insufficient resource utilization.
By grouping execution threads and allowing some threads to exit and be deactivated early during iterative data processing operations, unnecessary computations are reduced, thread group synchronization and resource sharing are optimized, and a thread group generator and scheduler are used to achieve more efficient iterative data processing.
It improves the energy efficiency of the graphics processor, reduces the energy consumption of iterative data processing operations, and maintains the accuracy and efficiency of data processing.
Smart Images

Figure CN114971997B_ABST
Abstract
Description
[0001] This invention relates to the operation of a data processing system and a data processor, and more particularly to the operation of a graphics processing system and a graphics processor comprising one or more programmable execution units.
[0002] Many graphics processing units (and their implemented graphics pipelines) now include and / or implement one or more programmable processing stages, often referred to as “shaders”. For example, a graphics processing pipeline will include one or more of the following, and often all of them: geometry shaders, vertex shaders, and fragment (pixel) shaders.
[0003] These shaders are programmable processing stages that execute shader programs on input data values to generate desired output datasets, such as fragment data that is properly shaded and rendered in relation to fragment shaders. The graphics processor and the "shaders" in the processing pipeline may share programmable processing circuitry, or they may be executed by separate programmable processing units.
[0004] It is also known to use graphics processing units (GPUs) and graphics processing pipelines, particularly their shader operations, to perform more general computational tasks, such as in situations requiring operations relative to a large number of distinct input data values. These operations are commonly referred to as "computational shading" operations, and several specific computational shading APIs, such as OpenCL and Vulkan, have been developed for situations where GPUs and graphics processing pipelines are desired to perform more general computational operations. Computational shading is used to compute arbitrary information. It can be used to process graphics-dependent data if needed, but is typically used for tasks not directly related to performing graphics processing.
[0005] Therefore, a graphics processing unit (GPU) shader core is a processing unit that performs processing by running a small program of each "work item" in the output to be generated. Each "work item" in the output to be generated can be, for example, a vertex or fragment (pixel) or a compute shader work item. This typically enables high parallelism, where a typical output (e.g., a frame) is characterized by a fairly large number of vertices and fragments, each of which can be processed independently.
[0006] In graphics shader operations, including computational shader operations, each "work item" is processed using an execution thread. This thread executes the instructions of the shader program for the considered "work item." Multiple execution threads may exist, each executing simultaneously (in parallel), and different threads may be needed to execute instructions that are synchronized with each other. For example, it is often desirable to synchronize each iteration of a data processing operation that is performed iteratively.
[0007] One way to synchronize execution threads is to provide a "barrier" operation. Typically, "work items" (and therefore the threads that process those work items) are grouped into "work groups," and the barrier operation ensures that when a thread in a work group reaches a barrier, it must wait until every other thread in the same work group has reached the barrier before it can proceed to cross it. Barriers can be used, for example, to ensure that all memory access operations pending in the work group before the barrier have been completed before any thread in the work group can proceed to cross it.
[0008] Another way to synchronize execution threads is to group them into thread “groups” or “bundles,” where a group of threads runs in lockstep mode in the hardware, for example, one instruction at a time. This allows instruction fetching and scheduling resources to be shared among all threads in the thread group. Other terms used for such thread groups include “thread bundle” and “thread group.” For convenience, this document will use the term “thread group,” but unless otherwise specified, it is intended to cover all equivalent terms and arrangements. Work items processed by threads in a thread group typically correspond to a subset, or “subgroup,” of work items in a work group.
[0009] The applicant believes that there is still room for improvement in data processing systems and data processors, especially graphics processing systems and the execution of shader programs within graphics processors.
[0010] According to a first aspect of the present invention, a method for operating a data processing system is provided, the data processing system including a data processor operable to execute a program to perform data processing operations, wherein the program can be executed simultaneously by multiple execution threads; the method includes:
[0011] The program to be executed by the data processor includes a set of one or more instructions, which, when executed by an execution thread in a set of multiple execution threads, will cause the execution thread in the set of multiple execution threads to perform iterative data processing operations, wherein:
[0012] Each iteration of the iterative data processing operation will include one or more execution threads from the set of multiple execution threads, each of which executes a corresponding data processing operation; and wherein
[0013] The first iteration of the iterative data processing operation will include all execution threads in the set of multiple execution threads, each of which will perform its own data processing operation.
[0014] Each iteration in one or more subsequent iterations of the iterative data processing operation will include: a subset of one or more execution threads that performed the corresponding data processing operation in the previous iteration, each of the one or more subsets of execution threads performing the corresponding data processing operation; and one or more other execution threads that performed the corresponding data processing operation in the previous iteration, wherein the one or more other execution threads do not perform the corresponding data processing operation; and
[0015] Each iteration of one or more iterations of the iterative data processing operation will include at least one execution thread from the set of multiple execution threads. The at least one execution thread exits the iterative data processing operation when at least one iteration of the iterative data processing operation remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0016] This will cause one of the execution threads in the set of multiple execution threads to perform the iterative data processing operation, such that at least one of the multiple execution threads in the set of multiple execution threads currently performing the iterative data processing operation will:
[0017] The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation;
[0018] The corresponding data processing operation is performed in each of the zeroth or more subsequent iterations relative to the iterative data processing operation; then
[0019] The iterative data processing operation is exited when at least one iteration remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0020] The method further includes responding to the instruction set when a set of multiple execution threads is executing the program:
[0021] The execution thread in the set of multiple execution threads performs the iterative data processing operation; and
[0022] At least one execution thread from the set of the plurality of execution threads currently performing the iterative data processing operation:
[0023] The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation;
[0024] The corresponding data processing operation is performed in each of the zeroth or more subsequent iterations relative to the iterative data processing operation; then
[0025] The iterative data processing operation is terminated when at least one iteration remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0026] According to a second aspect of the present invention, a data processing system is provided, the data processing system comprising:
[0027] A data processor operable to execute a program to perform data processing operations, wherein the program can be executed concurrently by multiple execution threads; and
[0028] A processing circuit configured to include a set of one or more instructions in a program to be executed by the data processor, the instructions, when executed by an execution thread in a set of multiple execution threads, causing the execution thread in the set of multiple execution threads to perform iterative data processing operations, wherein:
[0029] Each iteration of the iterative data processing operation will include one or more execution threads from the set of multiple execution threads, each of which executes a corresponding data processing operation; and wherein
[0030] The first iteration of the iterative data processing operation will include all execution threads in the set of multiple execution threads, each of which will perform its own data processing operation.
[0031] Each iteration in one or more subsequent iterations of the iterative data processing operation will include: a subset of one or more execution threads that performed the corresponding data processing operation in the previous iteration, each of the one or more subsets of execution threads performing the corresponding data processing operation; and one or more other execution threads that performed the corresponding data processing operation in the previous iteration, wherein the one or more other execution threads do not perform the corresponding data processing operation; and
[0032] Each iteration of one or more iterations of the iterative data processing operation will include at least one execution thread from the set of multiple execution threads. The at least one execution thread exits the iterative data processing operation when at least one iteration of the iterative data processing operation remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0033] This will cause one of the execution threads in the set of multiple execution threads to perform the iterative data processing operation, such that at least one of the multiple execution threads in the set of multiple execution threads currently performing the iterative data processing operation will:
[0034] The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation;
[0035] The corresponding data processing operation is performed in each of the zeroth or more subsequent iterations relative to the iterative data processing operation; then
[0036] The iterative data processing operation is exited when at least one iteration remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0037] The data processor is configured such that, in response to the instruction set, when a set of execution threads is executing the program:
[0038] The execution thread in the set of multiple execution threads will perform the iterative data processing operation; and
[0039] At least one execution thread in the set of the plurality of execution threads currently performing the iterative data processing operation will:
[0040] The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation;
[0041] The corresponding data processing operation is performed in each of the zeroth or more subsequent iterations relative to the iterative data processing operation; then
[0042] The iterative data processing operation is terminated when at least one iteration remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0043] The invention is also extended to the operation of a data processor according to the invention in response to a set of one or more instructions.
[0044] Therefore, according to a third aspect of the invention, a method for operating a data processor is provided, the data processor including a programmable execution unit operable to execute a program to perform data processing operations, wherein the programmable execution unit is operable to execute a program to perform data processing operations, and wherein the program can be executed simultaneously by multiple execution threads; the method includes:
[0045] When a set of multiple execution threads is executing a program that includes a set of one or more instructions for performing iterative data processing operations, where:
[0046] Each iteration of the iterative data processing operation will include one or more execution threads from the set of multiple execution threads, each of which executes a corresponding data processing operation; and wherein
[0047] The first iteration of the iterative data processing operation will include all execution threads in the set of multiple execution threads, each of which will perform its own data processing operation.
[0048] Each iteration in one or more subsequent iterations of the iterative data processing operation will include: a subset of one or more execution threads that performed the corresponding data processing operation in the previous iteration, each of the one or more subsets of execution threads performing the corresponding data processing operation; and one or more other execution threads that performed the corresponding data processing operation in the previous iteration, wherein the one or more other execution threads do not perform the corresponding data processing operation; and
[0049] Each iteration of one or more iterations of the iterative data processing operation will include at least one execution thread from the set of multiple execution threads. The at least one execution thread exits the iterative data processing operation while at least one iteration of the iterative data processing operation remains, and will not perform the corresponding data processing operation relative to the at least one iteration; in response to the instruction set:
[0050] The execution thread in the set of multiple execution threads performs the iterative data processing; and
[0051] At least one execution thread from the set of the plurality of execution threads currently performing the iterative data processing operation:
[0052] The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation;
[0053] The corresponding data processing operation is performed in each of the zeroth or more subsequent iterations relative to the iterative data processing operation; then
[0054] The iterative data processing operation is terminated when at least one iteration remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0055] According to a fourth aspect of the present invention, a data processor is provided, the data processor comprising:
[0056] A programmable execution unit, operable to execute a program to perform data processing operations, wherein the program can be executed concurrently by multiple execution threads; and
[0057] A processing circuit, configured such that when a set of multiple execution threads is executing a program comprising a set of one or more instructions for performing iterative data processing operations, wherein:
[0058] Each iteration of the iterative data processing operation will include one or more execution threads from the set of multiple execution threads, each of which executes a corresponding data processing operation; and wherein
[0059] The first iteration of the iterative data processing operation will include all execution threads in the set of multiple execution threads, each of which will perform its own data processing operation.
[0060] Each iteration in one or more subsequent iterations of the iterative data processing operation will include: a subset of one or more execution threads that performed the corresponding data processing operation in the previous iteration, each of the one or more subsets of execution threads performing the corresponding data processing operation; and one or more other execution threads that performed the corresponding data processing operation in the previous iteration, wherein the one or more other execution threads do not perform the corresponding data processing operation; and
[0061] Each iteration of one or more iterations of the iterative data processing operation will include at least one execution thread from the set of multiple execution threads. The at least one execution thread exits the iterative data processing operation while at least one iteration of the iterative data processing operation remains, and will not perform the corresponding data processing operation relative to the at least one iteration; in response to the instruction set:
[0062] The execution thread in the set of multiple execution threads will perform the iterative data processing operation; and
[0063] At least one execution thread in the set of the plurality of execution threads currently performing the iterative data processing operation will:
[0064] The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation;
[0065] The corresponding data processing operation is performed in each of the zeroth or more subsequent iterations relative to the iterative data processing operation; then
[0066] The iterative data processing operation is terminated when at least one iteration remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0067] This invention relates to performing iterative data processing operations in a data processor (and system) capable of executing multiple execution threads simultaneously (in parallel). In this invention, when performing an iterative data processing operation, the first iteration (loop) of the overall iterative data processing operation involves all execution threads of a set of multiple execution threads, each performing a corresponding data processing operation. However, one or more subsequent iterations (loops) of the iterative data processing operation involve fewer execution threads performing the corresponding data processing operation than the previous iteration; that is, the number of execution threads performing the corresponding data processing operation decreases from one iteration (loop) of the iterative data processing operation to the next iteration.
[0068] In other words, in this invention, when performing an iterative data processing operation, there is at least one execution thread that performs a corresponding data processing operation for each iteration of the overall iterative data processing operation; and there is at least one other execution thread that performs a corresponding data processing operation for each iteration of the first one or more iterations of the overall iterative data processing operation, but does not perform a corresponding data processing operation for each iteration of the remaining one or more iterations of the overall iterative data processing operation.
[0069] An example of such iterative data processing operations would be (and in one implementation, is) a so-called "reduction" operation, which combines multiple input data values into a single output data value. Reduction operations are typically performed iteratively, with each iteration involving a decreasing number of execution threads performing the corresponding combination operation. Any binary operation involving exchange and association can be used as a reduction operator. Therefore, the combination operation of a reduction operation could be (and in one implementation, is) a summation operation to sum the data values, or a multiplication operation to multiply the data values. Other operators such as maximum and minimum values are also possible.
[0070] In this invention, one or more iterations of an iterative data processing (e.g., reduction) operation (such as, and preferably, the first iteration or subsequent iterations excluding the last iteration) each involve an execution thread exiting the iterative data processing operation (and, as detailed below, preferably deactivated) when at least one iteration remains, and the execution thread will not perform the corresponding data processing operation relative to that at least one iteration. In other words, when an execution thread in the set of multiple execution threads exits the iterative data processing operation when at least one iteration remains, one or more other execution threads in the set of multiple execution threads will perform the corresponding data processing operation relative to that at least one iteration. Alternatively, when an execution thread in the set of multiple execution threads exits the iterative data processing operation when at least one iteration remains, the execution thread will not perform the corresponding data processing operation relative to that at least one iteration, and one or more other execution threads in the set of multiple execution threads will perform the corresponding data processing operation.
[0071] Therefore, once an execution thread has completed the corresponding data processing (e.g., combination) operation in each of one or more iterations relative to which the execution thread performs the corresponding data processing (e.g., combination) operation, and the execution thread will not perform the corresponding data processing (e.g., combination) operation relative to the subsequent iteration before the execution thread begins the subsequent iteration, the execution thread exits the iterative data processing operation (e.g., deactivates). Thus, in this invention, at least one execution thread in the set of execution threads that started the iterative data processing operation exits the iterative data processing operation "early" (and optionally deactivates), that is, exits before the start of the last or more iterations of the iterative data processing operation.
[0072] The applicant has recognized that when performing iterative data processing operations such as "reduction" operations, only certain execution threads will need to perform the corresponding data processing (e.g., combination) operations relative to a portion, but not all, of the iterative data processing operation. Furthermore, the number of execution threads that do not actually perform data processing operations can increase from one iteration to the next. Additionally, the applicant recognizes that such execution threads can be, for example, deactivated "early" (before the start of the last or subsequent iterations of the iterative data processing operation) from the iterative data processing operation without affecting the final result of the iterative data processing operation.
[0073] Furthermore, for example, by disabling such execution threads "early" from the iterative data processing operation in this way, the overall energy consumption associated with executing the iterative data processing operation can be reduced (e.g., compared to all iterations in which all execution threads participate in the iterative data processing operation), regardless of whether the execution threads actually perform the corresponding data processing operation relative to those iterations.
[0074] Therefore, the present invention allows iterative data processing operations such as reduction operations to be performed in a particularly efficient manner, thereby reducing the energy consumption of the data processor when performing such operations.
[0075] Therefore, it should be understood that the present invention provides an improved data processing system and data processor.
[0076] This invention can be used or applied to any suitable data processor, wherein the program can be executed concurrently by multiple execution threads, such as a (multithreaded) central processing unit (CPU).
[0077] In a preferred embodiment, as described above, the present invention is for a graphics processing unit (GPU) that executes a shader program, and therefore in a preferred embodiment, the data processing system is a graphics processing system, the data processor is a graphics processing unit (GPU), and the program is a (graphics) shader program.
[0078] In this context, the program can be, for example, a fragment (pixel) shader program, a vertex shader program, or a tile shader program. However, in a preferred embodiment, the program to be executed by the graphics processing unit (GPU) is a computation shader program, i.e., a program used to perform more general "computational" processing (rather than graphics processing itself), such as that executed according to the OpenCL or Vulkan API or other forms of kernel.
[0079] To facilitate program execution, in a preferred embodiment, the data (e.g., graphics) processor includes one or more programmable execution units (e.g., shader cores), each operable to execute (e.g., shader) a program and preferably execute multiple execution threads simultaneously (in parallel). Therefore, in one embodiment, an execution thread from a set of multiple execution threads is issued to the one or more programmable execution units (shaders) of the data processor for execution. In one embodiment, the data (e.g., graphics) processor includes thread generation circuitry that generates execution threads and sends them to the one or more programmable execution units (shaders) of the data (e.g., graphics) processor for execution.
[0080] The data (e.g., graphics) processor (programmable execution unit (shader)) should be operable, and in one embodiment, operable to execute instructions in the (shader) program for each processing item received by the data (e.g., graphics) processor (programmable execution unit (shader)) for processing. Therefore, in one embodiment, the thread generation circuitry generates and issues an execution thread for each item to be processed. In one embodiment, each execution thread in the set of multiple execution threads executes a program for a corresponding processing item in the set of multiple processing items to be processed to generate output.
[0081] The processing item targeted by the executor can be any suitable and desired processing item. In the context of graphics processing, a processing item can be, for example, a vertex, primitive, or fragment (pixel). In the context of OpenCL "computation" processing, each processing item is preferably a corresponding (OpenCL) work item. Similarly, a set of multiple processing items can be any suitable and desired set of multiple processing items. The set of multiple processing items is preferably a corresponding (OpenCL) work group. Each processing item (work item) is preferably associated with a corresponding set of one or more initial data values to be processed.
[0082] Therefore, in one embodiment, the set of multiple execution threads corresponds to a set (workgroup) of corresponding multiple processing items, and each execution thread in the set of multiple execution threads corresponds to a corresponding processing item (work item) in the set (workgroup) of multiple processing items. In the presence of multiple sets (workgroups) of multiple processing items (and corresponding sets of multiple execution threads), one or more sets of processing items or all sets of processing items in the multiple sets (workgroups) of multiple processing items (sets of multiple execution threads) can each be processed by the method of the present invention.
[0083] In one implementation, the data (e.g., graphics) processor (programmable execution unit) is operable to group the execution threads of an executor into thread “groups” or “bundles,” where threads within a group execute the program together and in a lockstep manner. This arrangement improves program (shader) execution efficiency because instruction fetching and scheduling resources can be shared, for example, among all threads in the group. (Other terms used for such thread groups include “subgroup,” “warp,” and “wavefront.” For convenience, the term “thread group” will be used herein, but unless otherwise specified, this is intended to cover all equivalent terms and arrangements.)
[0084] In this context, in one implementation, each thread group corresponds to a corresponding subset (part but not all) of a set of multiple processing items (workgroups), and each execution thread in the thread group corresponds to a corresponding processing item (work item) of that subset. For example, in the case of OpenCL, each thread group preferably corresponds to an (OpenCL) subgroup, and each execution thread in the thread group preferably corresponds to a corresponding work item of that subgroup. In one implementation, the set of multiple processing items (workgroups) includes multiple such subsets (subgroups), and therefore the set of multiple execution threads preferably corresponds to multiple thread groups. Thus, in one implementation, thread groups are issued to the one or more programmable execution units of the data (e.g., graphics) processor for execution. In one implementation, the data processor includes a thread group generator and a scheduler that generate and schedule thread groups for execution. In one implementation, where execution threads are grouped into thread groups, the at least one execution thread that "early" exits (and preferably deactivates) the iterative data processing operation from the set of multiple execution threads is the thread group of the execution threads.
[0085] The execution threads can be grouped into thread groups of any suitable and desired size. In a preferred embodiment, there is a fixed thread group size supported by the data processor. These thread groups may contain, for example, 4, 8, 16, or 32 threads (i.e., a “thread bundle width” of 4, 8, 16, or 32). Wider thread groups (thread bundles) are possible if desired.
[0086] To facilitate parallel execution, in one embodiment, the data processor (programmable execution unit) is configured as multiple execution channels, each operable to perform processing operations for execution threads. The number of execution channels (and therefore the number of execution threads that can be processed in parallel) is preferably equal to the number of threads in the thread group. However, other arrangements are possible.
[0087] Therefore, in one implementation, an execution thread from the set of multiple execution threads is published to the execution channel for execution, preferably in a manner suitable for the data processor under consideration. The execution channel is then preferably used to execute the program (including a set of one or more instructions for performing iterative data processing operations) for these execution threads, for example, preferably in a normal manner for the data processor.
[0088] In one implementation, the execution channel is provided by one or more functional units operable to perform data processing operations for instructions being executed by the thread being executed. Each functional unit is preferably capable of processing as many threads in parallel as possible in the presence of the execution channel (therefore each functional unit will include a set of multiple execution channels).
[0089] The one or more functional units may include any desired and suitable one or more functional units operable to perform data processing operations in response to and according to program instructions. Thus, in one embodiment, the one or more functional units include one or more or all of the following: arithmetic units (arithmetic logic units) (addition, subtraction, multiplication, division, etc.), bit manipulation units (inverting, swapping, shifting, etc.), logic manipulation units (AND, OR, NAND, NOR, NOT, XOR, etc.), load-type units (such as change, texture, or load units for a graphics processor), storage-type units (such as blending or storage units), etc. In one embodiment, these functional units (at least) include arithmetic units (i.e., units operable to perform arithmetic (mathematical) operations). Each execution channel in the data processor (programmable execution unit) may (also) have a set of one or more registers associated with it (and available for its use), and preferably a set of multiple registers, for storing data values associated with or used by the execution channel (i.e., for storing data values being processed by the execution thread currently executing the execution channel).
[0090] The data processor (programmable execution unit) preferably also includes appropriate control circuitry (control logic components) for controlling the execution channels (functional units that operate as execution channels) so that they perform desired and appropriate processing operations. This may include any suitable and desired control circuitry, such as appropriate thread issuing circuitry and / or instruction decoding circuitry.
[0091] In one embodiment, the data (e.g., graphics) processing system includes a host processor operable to issue data processing commands and data to the data (e.g., graphics) processor. The host processor can be any suitable and desired processor, such as, in one embodiment, a central processing unit (CPU) of the data (e.g., graphics) processing system.
[0092] In one embodiment, the host processor of the data (e.g., graphics) processing system is operable to generate data (e.g., graphics) processing commands and data for the data (e.g., graphics) processor in response to instructions from an application executing on the host processor. This is accomplished in one embodiment by a driver for the data (e.g., graphics) processor executing on the host processor.
[0093] This includes a set of one or more instructions, which can be any suitable and desired (shader) program that can be executed by the data (e.g., graphics) processor (programmable execution unit).
[0094] It should be understood that, and including a set of one or more instructions for performing the iterative data processing operation in the manner of the invention, the program is capable of and preferably originally includes any other suitable and desired instructions that can be executed by the data (e.g., graphics) processor. The (shader) program may contain only a single set of one or more instructions for performing a single instance of the iterative data processing operation, or multiple such sets of instructions may exist within the program.
[0095] The set (or multiple sets) of one or more instructions may be included in the (shader) program in any suitable and desired manner and by any suitable and desired parts and elements of the data processing system.
[0096] In one implementation, the (shader) program will initially be provided using a high-level (shader) programming language such as GLSL, HLSL, OpenCL, C, etc., for example, through an application that runs on a host processor that requires data processing operations.
[0097] The high-level (shader) program is then preferably translated by a (shader language) compiler into a binary code (shader) program, which includes instructions for execution by the data (e.g., graphics) processor. This compilation process, which converts, for example, shader language expressions into binary code instructions, can be performed via multiple intermediate representations of the program within the compiler. Thus, a program written in the high-level shader language can be translated into an intermediate representation specific to a particular compiler (and several successive intermediate representations may exist within that compiler), wherein the final intermediate representation is translated into these binary code instructions for the target data (e.g., graphics) processor.
[0098] In these arrangements, the compiler (processing circuitry) is preferably part of and executes on the host processor of the data processing system, and preferably part of the driver for the data (e.g., graphics) processor that executes on the host processor. In this case, the compiler and the compiled code will run on a separate processor within the overall data processing system. However, other arrangements are possible, such as the compiler running on the same processor as the compiled code, or the compiler running on a (completely) separate processor, such as the program being pre-compiled on a separate system and distributed in compiled form.
[0099] Therefore, the set of one or more instructions can be included in the (compiled) program by a compiler that compiles the program (application code) from a higher version of the program. Correspondingly, in one embodiment, the processing circuitry of the instruction set included in the program is a compiler for the data (e.g., graphics) processor (programmable execution unit).
[0100] It should be understood that the present invention also extends to the operation of a compiler that compiles a program that includes a set of one or more instructions according to the present invention.
[0101] Therefore, according to a fifth aspect of the present invention, a method for compiling a program to be executed by a data processor is provided, the method comprising:
[0102] The program includes a set of one or more instructions, which, when executed by one of a set of execution threads, cause the execution threads in the set of execution threads to perform iterative data processing operations, wherein:
[0103] Each iteration of the iterative data processing operation will include one or more execution threads from the set of multiple execution threads, each of which executes a corresponding data processing operation; and wherein
[0104] The first iteration of the iterative data processing operation will include all execution threads in the set of multiple execution threads, each of which will perform its own data processing operation.
[0105] Each iteration in one or more subsequent iterations of the iterative data processing operation will include: a subset of one or more execution threads that performed the corresponding data processing operation in the previous iteration, each of the one or more subsets of execution threads performing the corresponding data processing operation; and one or more other execution threads that performed the corresponding data processing operation in the previous iteration, wherein the one or more other execution threads do not perform the corresponding data processing operation; and
[0106] Each iteration of one or more iterations of the iterative data processing operation will include at least one execution thread from the set of multiple execution threads. The at least one execution thread exits the iterative data processing operation when at least one iteration of the iterative data processing operation remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0107] This will cause one of the execution threads in the set of multiple execution threads to perform the iterative data processing operation, such that at least one of the multiple execution threads in the set of multiple execution threads currently performing the iterative data processing operation will:
[0108] The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation;
[0109] The corresponding data processing operation is performed in each of the zeroth or more subsequent iterations relative to the iterative data processing operation; then
[0110] The iterative data processing operation is terminated when at least one iteration remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0111] According to a sixth aspect of the present invention, a compiler is provided for compiling a program to be executed by a data processor, the compiler comprising:
[0112] A processing circuit configured to include a set of one or more instructions in a program to be executed by the data processor, the instructions, when executed by an execution thread in a set of multiple execution threads, causing the execution thread in the set of multiple execution threads to perform iterative data processing operations, wherein:
[0113] Each iteration of the iterative data processing operation will include one or more execution threads from the set of multiple execution threads, each of which executes a corresponding data processing operation; and wherein
[0114] The first iteration of the iterative data processing operation will include all execution threads in the set of multiple execution threads, each of which will perform its own data processing operation.
[0115] Each iteration in one or more subsequent iterations of the iterative data processing operation will include: a subset of one or more execution threads that performed the corresponding data processing operation in the previous iteration, each of the one or more subsets of execution threads performing the corresponding data processing operation; and one or more other execution threads that performed the corresponding data processing operation in the previous iteration, wherein the one or more other execution threads do not perform the corresponding data processing operation; and
[0116] Each iteration of one or more iterations of the iterative data processing operation will include at least one execution thread from the set of multiple execution threads. The at least one execution thread exits the iterative data processing operation when at least one iteration of the iterative data processing operation remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0117] This will cause one of the execution threads in the set of multiple execution threads to perform the iterative data processing operation, such that at least one of the multiple execution threads in the set of multiple execution threads currently performing the iterative data processing operation will:
[0118] The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation;
[0119] The corresponding data processing operation is performed in each of the zeroth or more subsequent iterations relative to the iterative data processing operation; then
[0120] The iterative data processing operation is terminated when at least one iteration remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0121] Those skilled in the art will understand that these aspects and embodiments of the invention may, and indeed do, include one or more features of the invention in one embodiment, and all features of the invention in one embodiment.
[0122] In one implementation, the set of one or more instructions is included in the compiler (implemented by the compiler (processing circuitry) in response to a sequence of one or more instructions in the initially provided (high-level) application code. Preferably, the compiler (processing circuitry) is configured to recognize a specific sequence of one or more instructions in the (high-level) application code and include the set of one or more instructions in the compiler in response to recognizing the instruction sequence.
[0123] The instructions included in the program should, and preferably should, include instructions for causing the execution threads to perform the iterative data processing operation, and instructions for causing at least one of the execution threads to "early" exit the iterative data processing operation (and preferably deactivate it) in the manner of the invention.
[0124] In a preferred embodiment, the instructions included in the program for causing the execution thread to perform an iterative data processing operation cause the execution thread to perform the iterative data processing operation by looping, wherein each iteration of the loop corresponds to an iteration of the iterative data processing operation. In this case, the instructions included in the program for causing the execution thread to "early" exit the iterative data processing operation preferably cause the execution thread to exit (e.g., "interrupt") the loop before the last iteration. The execution thread may exit the loop, for example, by deactivating it or by "interrupting" the loop but subsequently remaining active, such that the execution thread can then continue beyond the loop to perform, for example, some other processing. Thus, the at least one execution thread may exit the iterative data processing operation, keeping it active to perform some other processing, or exit the iterative data processing operation to deactivate it.
[0125] Therefore, in one embodiment, the at least one execution thread exits the iterative data processing operation and performs other processing when at least one iteration of the iterative data processing operation remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration. In another embodiment, the at least one execution thread is deactivated when at least one iteration of the iterative data processing operation remains, and the execution thread will not perform the corresponding data processing operation relative to the at least one iteration.
[0126] The programmer may include an appropriate sequence of instructions in the (high-level) application code, wherein the compiler (processing circuitry) then includes a set of one or more instructions for performing the iterative data processing operation, and in response to which a thread in the (compiler) is triggered to exit (e.g., deactivate) in the manner of the present invention.
[0127] In one implementation, specific instructions are visible to the application programming interface (API) for causing a thread to "early" exit or deactivate in the manner of the invention, such that a programmer can explicitly include "early exit" or "early deactivate" instructions in (high-level) application code, wherein the compiler then responds by including one or more corresponding instructions in the (compiled) program (a set of one or more instructions) to trigger the execution thread to "early" exit or deactivate in the manner of the invention.
[0128] Therefore, in one implementation, the execution thread that "early" exits (e.g., deactivates) in the manner of the invention in response to instructions visible to the program's application programming interface (API) preferably does so (in response to instructions (or more instructions) included in the program).
[0129] In one implementation, the compiler (processing circuitry) is operable to automatically (self-executing) include one or more instructions in the (compiled) program (a set of one or more instructions) to trigger the execution thread to exit (e.g., deactivate) "early" in the manner of the present invention, i.e., before the instructions are explicitly included in the application code (e.g., implemented by the programmer).
[0130] Therefore, in one implementation, the compiler (processing circuitry) for the program automatically inserts instructions (or more instructions) into the (compiled) program (a set of one or more instructions) to trigger the thread to exit "early" (e.g., deactivate) in the manner of the present invention.
[0131] The compiler can automatically include instructions in the (compiled) program in any suitable and desired manner to trigger threads to exit "early" (e.g., deactivate) in the manner of the present invention. For example, the compiler can be configured to identify opportunities to insert "early exit" or "early deactivate" instructions into the program, for example, by analyzing the application code to identify one or more specific steps in the program code that can be executed in the manner of the present invention, when compiling the application code.
[0132] In one implementation, for example, a compiler may be operable to: determine whether a program (operation) (based on application code) enables one or more execution threads to exit or deactivate (“early”) without affecting the iterative data processing operation and / or the program’s output, and if so, preferably determine a set of one or more conditions that would be satisfied when the execution thread executing the program exits or deactivates (“early”) without affecting the iterative data processing operation and / or the program’s output (and these conditions would not be satisfied if the exit or deactivation of the execution thread would affect the iterative data processing operation and / or the program’s output); and include instructions in the program such that the execution thread executing the program exits or deactivates (“early”) when the set of one or more “early exit” or “early deactivation” conditions is satisfied.
[0133] The set of one or more "early exit" or "early deactivation" conditions identified may be satisfied, for example, by the executing thread when the executing thread will no longer contribute to the iterative data processing operation and / or the program's output (and these conditions are not satisfied when the executing thread would contribute to the iterative data processing operation and / or the program's output).
[0134] The compiler may, for example, be configured in one embodiment to recognize that an execution thread may exit (e.g., deactivate) (“early”) an iterative data processing operation when the execution thread will not perform further data processing operations relative to the iterative data processing operation, without affecting the output of the iterative data processing operation. The compiler may, for example, be configured in one embodiment to recognize that an execution thread may deactivate (“early”) a program when the execution thread will not perform further data processing operations relative to the iterative data processing operation and / or other processing operations (such as memory access operations and / or cross-thread operations) that may affect the program's output, without affecting the program's output. Therefore, in one embodiment, the set of one or more “early exit” or “early deactivate” conditions is based on whether the execution thread will perform the corresponding data processing operation relative to one or more remaining iterations of the iterative data processing operation, and / or whether the execution thread will perform any other operations (such as memory access operations and / or cross-thread operations) that may affect the program's output.
[0135] In a preferred embodiment, as detailed below, the program includes one or more barrier instructions that cause execution threads to synchronously perform iterations of the iterative data processing operation (e.g., a loop). In this case, the compiler is preferably operable to identify when an execution thread can omit one or more barrier instructions from the execution barrier instructions without affecting the iterative data processing operation and / or the program's output, and includes instructions in the program such that the execution thread of the program exits the iterative data processing operation (e.g., deactivates) ("early") before executing those "omittable" one or more barrier instructions.
[0136] In one implementation, the compiler can recognize that an execution thread should only exit the iterative data processing operation if other processing does not require synchronization with other execution threads (e.g., if other processing does not include barrier operations), in order to execute other processing "ahead of time".
[0137] The instructions (or instructions) included in the (compiled) program can cause the execution thread to exit "early" (e.g., deactivate) in any suitable and desired manner. In one embodiment, these instructions cause the execution thread to exit the iterative data processing operation ("early") by causing it to skip the remaining instructions associated with the iterative data processing operation, such as not participating in any further iterations of the iterative data processing operation and / or executing any additional obstacle instructions, such as skipping the last obstacle instruction for the iterative data processing operation (loop).
[0138] In embodiments where instructions cause the execution thread to exit (“early”) and perform other processing, these instructions preferably cause the execution thread to jump to the beginning of a set of one or more instructions for performing the other processing.
[0139] In embodiments where instructions cause the execution thread to "early" stop, these instructions preferably cause the execution thread to jump to the end of the program, for example, to not participate in any further iterations of the iterative data processing operation and / or to execute any additional barrier instructions, thus stopping it. Alternatively, these instructions may include "discard" instructions to trigger the execution thread to terminate (directly).
[0140] In one implementation, an execution thread is conditionally executed (e.g., "discard" or "jump") to trigger "early exit" or "early deactivation," wherein the condition is preferably based on an identifier associated with the thread (thread ID) or an identifier associated with the execution channel (execution channel ID) or a variable such as an atomic counter. For example, in one implementation, a conditional branch is preferably included in the procedure such that one or more execution threads associated with a specific identifier or a set of identifiers follow the conditional branch and thus exit the iterative data processing operation (e.g., deactivation) ("early").
[0141] In a preferred embodiment, the execution thread is caused to execute (e.g., "discard" or "jump") an instruction to trigger "early exit" or "early deactivation" when the execution thread satisfies a set of one or more "early exit" or "early deactivation" conditions (determined by the compiler). In one embodiment, where multiple distinct "early exit" or "early deactivation" conditions exist and these conditions are satisfied separately when the execution thread can "early" exit the iterative data processing operation or deactivate, each such "early exit" or "early deactivation" condition is combined using an AND operation to determine the overall "early exit" or "early deactivation" condition that will be satisfied when the execution thread can "early" exit or deactivate.
[0142] Therefore, in a preferred embodiment, the program includes one or more instructions that cause the execution thread of the program to exit the iterative data processing operation (e.g., deactivate) (“early”) if the processing originally performed by the execution thread (for continuing to participate in the iterative data processing operation and / or the execution thread of the program) will not affect the output of the iterative data processing operation and / or the program. In one embodiment, the execution thread of the program exits the iterative data processing operation (e.g., deactivate) in response to one or more instructions if the processing originally performed by the execution thread (for continuing to participate in the iterative data processing operation and / or the execution thread of the program) will not affect the output of the iterative data processing operation and / or the program.
[0143] In one implementation, the one or more instructions included in the program cause the execution thread executing the program to exit the iterative data processing operation (e.g., deactivate) ("early") before executing one or more ("omittable") barrier instructions that would otherwise be executed by the execution thread, which is the execution thread that would continue to participate in the iterative data processing operation and / or execute the program.
[0144] Once the set of one or more instructions has been included in the program, those instructions will be executed by the data processor (programmable execution unit) using multiple execution threads, wherein the execution threads will operate together in response to the instructions to perform iterative data processing operations in the manner of the present invention.
[0145] In particular, in this invention, a set of multiple execution threads performs an iterative data processing operation, which includes multiple iterations, including a first iteration and one or more subsequent iterations (including a final iteration). Preferably, the first iteration of the iterative data processing operation includes an execution thread that processes the initial data values of the iterative data processing operation to generate intermediate data values, and the final iteration of the iterative data processing operation preferably includes one or more execution threads that process the intermediate data values to generate the overall output of the iterative data processing operation. Each subsequent iteration of the iterative data processing operation, excluding the final iteration, preferably includes one or more execution threads that process intermediate data values to generate intermediate data values.
[0146] The first iteration of the iterative data processing operation includes all execution threads in the set of multiple execution threads, each executing a corresponding data processing operation (for example, processing a corresponding set of one or more initial data values). Each iteration in one or more subsequent iterations of the iterative data processing operation includes (only) a subset (i.e., only some but not all) of the execution threads that performed the corresponding data processing operation in the (immediately adjacent) previous iteration, each of which executes a corresponding data processing operation (for example, processing a corresponding set of one or more intermediate data values).
[0147] Therefore, the number of execution threads performing the corresponding data processing operations decreases from one iteration of the iterative data processing operation to the next, and the number of execution threads performing the corresponding data processing operations relative to the last iteration of the iterative data processing operation is less than the number of the set of execution threads (which perform the corresponding data processing operations relative to the first iteration of the iterative data processing operation).
[0148] The number of execution threads performing the corresponding data processing operations can decrease from one iteration to the next, and remain the same from another different iteration to the next. However, in a preferred embodiment, the number of execution threads performing the corresponding data processing operations decreases from each iteration of the iterative data processing operation to the next. That is, each subsequent iteration of the iterative data processing operation preferably includes fewer execution threads in the set of multiple execution threads performing the corresponding data processing operations than the (immediately adjacent) previous iteration. However, it should be understood that there may be subsequent iterations of the iterative data processing operation in which one or more of the same execution threads that performed the corresponding data processing operations in the previous iteration each perform the corresponding data processing operation.
[0149] The number of execution threads performing the corresponding data processing operations can be reduced by any suitable number from one iteration to the next. The number of execution threads can be reduced by a specific number from one iteration of the data processing operation to the next, such as reducing one, two, or four execution threads. In one implementation, the number of execution threads is reduced by a specific percentage from one iteration of the data processing operation to the next, such as reducing by a quarter or half.
[0150] In a preferred embodiment, each subsequent iteration of the iterative data processing operation includes (only) half of the execution threads that performed the corresponding data processing operation in the (immediately adjacent) previous iteration, each of these half of the execution threads performing the corresponding data processing operation (and the other half of the execution threads not performing the corresponding data processing operation).
[0151] In these implementations, the set of multiple execution threads (initiating the iterative data processing operation) preferably includes a power of two number of execution threads (such as 64, 128, or 256), such that each iteration includes (preferably decreasing) a power of two number of execution threads performing the corresponding data processing operation. Correspondingly, the set of multiple processing items being processed (workgroup) preferably includes a power of two number of processing items (work items), such as 64, 128, or 256 processing items (work items). However, other numbers of execution threads and processing items are possible.
[0152] The iterative data processing operation can be any suitable processing operation that can be executed multiple times, wherein the number of execution threads performing the corresponding data processing operation decreases from one iteration to the next. The iterative data processing operation can be (e.g., in one embodiment) a prefix summation operation (e.g., in one embodiment, when generating a graphical mipmap).
[0153] In a preferred embodiment, as described above, the iterative data processing operation is a reduction operation. In this case, the overall iterative data processing operation will preferably be combined with a set of initial data values (for a set of multiple processing items (workgroups)) to produce a single combined overall output data value (for the set of multiple processing items (workgroups)), and will preferably be executed using multiple iterations, wherein each iteration preferably involves one or more execution threads, each of which performs a corresponding combined operation.
[0154] Reduction operations may include, for example, in one implementation, an addition (summation) reduction operation (i.e., adding the (initial) data values of all threads in a set of threads (for a set of multiple processing items (workgroups)) to each other), or a multiplication (product) reduction operation (i.e., multiplying the (initial) data values of all threads in a set of threads (for a set of multiple processing items (workgroups))). It may also include a maximum value operation (i.e., determining the maximum value of the (initial) data values of all threads in a set of threads (for a set of multiple processing items (workgroups))), or a minimum value operation (i.e., determining the minimum value of the (initial) data values of all threads in a set of threads (for a set of multiple processing items (workgroups))). It may also be bitwise operations, such as AND, OR, or XOR, or any other swapping and associative binary operations.
[0155] The data processing operations performed by the execution thread should correspond to the overall iterative data processing operations under consideration, relative to the iteration of this iterative data processing operation. These data processing operations are preferably all of the same type, but each data processing operation is preferably executed for its own corresponding input dataset and produces its own corresponding output from the input data.
[0156] Therefore, each data processing operation preferably includes an execution thread processing a set of one or more input (e.g., initial or intermediate) data values to produce a set of one or more output (e.g., intermediate or overall output) data values. Preferably, each data processing operation includes an execution thread reading a set of one or more input data values from a storage device (e.g., a memory), processing the set of one or more input data values to generate a set of one or more output data values, and then writing the set of one or more output data values into the storage device (e.g., a memory). Preferably, the number of output data values is less than the number of input data values. For example, each data processing (e.g., combination) operation preferably includes an execution thread processing (e.g., combining) two input data values to produce a single (e.g., combined) output data value. The output data value from one iteration preferably becomes the input data value for the next iteration.
[0157] For addition reduction operations, each data processing operation will preferably determine the sum (preferably two) data values. Correspondingly, for multiplication reduction operations, each data processing operation will preferably multiply by (preferably two) data values. For maximum value operations, each data processing operation will preferably determine the maximum value (preferably two) of the data values. Correspondingly, for minimum value operations, each data processing operation will preferably determine the minimum value (preferably two) of the data values. For binary operations, such as bitwise operations, such as AND, OR, and XOR, each data processing operation will preferably therefore (bitwise) combine (preferably two) data values.
[0158] The overall output generated by performing this iterative data processing operation can be used as needed. For example, the result can be output, such as to external memory, and / or it can be provided for and used by additional instructions in the executed program. Of course, other arrangements are possible. In all cases, the overall output of this iterative data processing operation can be used by the data processing system to generate output. The generated output can be any suitable and desired output, such as rendered output for a collection of data values (such as a data array) or other information (such as metadata).
[0159] As described above, in this invention, an iterative data processing operation (such as a reduction operation) is performed such that at least one execution thread exits (e.g., deactivates) the iterative data processing operation when at least one iteration of the iterative data processing operation remains, and the execution thread will not perform the corresponding data processing operation relative to that at least one iteration. In other words, at least one execution thread in the set of a plurality of execution threads exits (e.g., deactivates) the iterative data processing operation before the start of the at least last iteration of the iterative data processing operation.
[0160] Preferably, where possible and appropriate, each of the plurality of execution threads in the set performing the iterative data processing operation exits the iterative data processing operation "early" in such a way as to disable it (e.g., deactivate). Thus, in a preferred embodiment, all execution threads in the set of plurality of execution threads (except those performing the corresponding data processing operation relative to the final iteration of the iterative data processing operation) exit the iterative data processing operation (e.g., deactivate) before the start of the final iteration of the iterative data processing operation. This arrangement preferably causes different execution threads in the set of plurality of execution threads to exit the iterative data processing operation (e.g., deactivate) after participating in different numbers of iterations of the iterative data processing operation.
[0161] In a preferred embodiment, as described above, (preferably each) execution thread exits the iterative data processing operation (e.g., deactivates) before performing any processing that will not affect the iterative data processing operation and / or the program's output (e.g., the compiler has already determined). Therefore, in a preferred embodiment, (preferably each) execution thread exits the iterative data processing operation (e.g., deactivates) before any iteration of the iterative data processing operation begins, and with respect to that iteration, the execution thread will not perform the corresponding data processing operation.
[0162] Therefore, this arrangement preferably causes (preferably each) execution thread to exit (e.g., deactivate) the iterative data processing operation when: after the first iteration in which the iterative data processing operation is participated, and after any (zero or more) subsequent iterations, the execution thread performs the corresponding data processing operation relative to that iteration; and before any iteration in which the iterative data processing operation is not participated, the execution thread will not perform the corresponding data processing operation relative to that iteration. Correspondingly, and as described above, when using barrier instructions to synchronize iterations, (preferably each) execution thread exits the iterative data processing operation (e.g., deactivates) before executing any "omittable" barrier instructions.
[0163] Therefore, if in each subsequent iteration of the iterative data processing operation, fewer execution threads from the set of execution threads performing the corresponding data processing operation are included than in the (immediately preceding) previous iteration, each (i.e., the first and each subsequent) iteration of the iterative data processing operation (optionally except the last iteration) preferably includes at least one execution thread from the set of execution threads that exited the iterative data processing operation (e.g., deactivated) (and optionally one or more other execution threads from the set of execution threads that connected the barrier). However, it should be understood that there may be iterations of the iterative data processing operation in which no execution thread exits the iterative data processing operation (e.g., deactivated).
[0164] Therefore, the execution thread in the set of multiple execution threads preferably performs the iterative data processing operation, such that each execution thread in the set of multiple execution threads that performs the iterative data processing operation exits the iterative data processing operation (e.g., deactivates it) before any (zero or more) iteration of the iterative data processing operation begins, and the execution thread will not perform the corresponding data processing operation for any iteration.
[0165] It should be understood that the execution thread that performs the corresponding data processing operation for the last iteration of the iterative data processing operation should, and preferably should, exit the iterative data processing operation (and deactivate) after the last iteration of the iterative data processing operation has been completed.
[0166] Therefore, in the set of multiple execution threads performing the iterative data processing operation, at least one execution thread should perform the corresponding data processing operation relative to each iteration of the iterative data processing operation, and then exit (and deactivate) the iterative data processing operation after all iterations of the iterative data processing operation have been completed, and at least one other execution thread should perform the corresponding data processing operation relative to at least the first iteration of the iterative data processing operation, and then exit (e.g., deactivate) the iterative data processing operation before all iterations of the iterative data processing operation begin. The latter at least one other execution thread can be, for example: at least one execution thread that performs the corresponding data processing operation only relative to the first iteration of the iterative data processing operation, and then exits (e.g., deactivates); and / or at least one execution thread that performs the corresponding data processing operation only relative to the first iteration and some, but not all, subsequent iterations of the iterative data processing operation, and then exits (e.g., deactivates); and / or at least one execution thread that performs the corresponding data processing operation relative to all iterations of the iterative data processing operation except the last iteration.
[0167] The applicant has recognized that the effect of an execution thread exiting the iterative data processing operation is that the execution thread can then continue to perform other available processing. Similarly, the effect of an execution thread being deactivated is that another execution thread can then be published to the execution channel of the deactivated execution thread and processed by it.
[0168] Therefore, in one implementation, as already mentioned, when an execution thread in a set of multiple execution threads performing an iterable data processing operation exits the iterable data processing operation (“early”), that execution thread then performs other processing during at least one iteration remainder of the subsequent iterative data processing operation. In this case, the other processing is preferably processing that can be executed asynchronously with the other execution threads in the set of multiple execution threads.
[0169] In a preferred embodiment, once an execution thread has been deactivated, another execution thread is posted to the execution channel that is processing the deactivated execution thread for execution. This arrangement preferably ensures that, concurrently with at least the last iteration of the iterative data processing operation, there exists at least one execution channel that processes execution threads deactivated from an iterative data processing operation that is processing another execution thread for another data processing operation.
[0170] Therefore, in a preferred embodiment, when an execution thread in the set of multiple execution threads performing the iterative data processing operation is deactivated, a new execution thread is issued to the execution channel that was previously processing the deactivated execution thread, for performing another data processing operation during at least one iteration remaining of the iterative data processing operation.
[0171] In this context, the other data processing operation executed in parallel with the iterative data processing operation in this manner can be any suitable and desired processing operation, such as, in one embodiment, another instance of the iterative data processing operation, for example and preferably, another set (workgroup) of multiple processing items.
[0172] In fact, the advantage of this invention is that the execution thread can exit the iterative data processing operation earlier than it would normally exit (e.g., deactivate), making the execution thread and execution channel available earlier than they would normally be available, allowing other processing to begin earlier than they would normally begin. This then enables higher efficiency and higher parallelism.
[0173] It should be understood that, in this invention, execution threads can be used to process data values generated by other execution threads in a previous iteration. To illustrate this, iterative synchronization of the iterative data processing operations is preferably achieved to ensure that all data values to be generated in one iteration have been generated before the start of the next iteration, so that the correct data values can then be used for processing in the next iteration.
[0174] Therefore, each subsequent iteration of this iterative data processing operation should not, and preferably should not, begin before the previous iteration has been completed. This can be achieved by synchronizing threads with each other in any suitable and desirable manner.
[0175] In a preferred embodiment, each subsequent iteration of the iterative data processing operation can only begin when a condition is met, wherein the condition is preferably met only once all participating (e.g., active, not deactivated) execution threads in the set of multiple execution threads (i.e., all execution threads in the set of multiple execution threads except those that have exited the iterative data processing operation (e.g., deactivated)) have completed the previous iteration (for the set of multiple processing items (workgroup)). Therefore, each subsequent iteration of the iterative data processing operation is preferably executed (started) in response to the condition being met.
[0176] This can be implemented as needed. In a preferred embodiment, as already mentioned, the arrangement is such that each iteration of the iterative data processing operation involves execution threads connecting “barriers,” wherein an execution thread that has connected a barrier is released from the barrier only once all participating (e.g., activity) execution threads (for the set of processing items (workgroups)) in the set of multiple execution threads have connected the barrier. The released execution thread can then begin the next iteration of the iterative data processing operation, and can then connect barriers relative to the next iteration, and so on. In this case, the execution thread can connect the barrier in response to executing a “barrier” instruction in the program (a set of one or more instructions in it).
[0177] Therefore, in one implementation, the condition for starting a subsequent iteration of the iterative data processing operation is met when all participating (e.g., active) execution threads in the set of multiple execution threads have connected the barrier (relative to the previous iteration of the iterative data processing operation). In one implementation, this arrangement causes each iteration of the iterative data processing operation to involve each participating (e.g., active) execution thread performing a corresponding data processing operation and subsequently exiting the iterative data processing operation (e.g., deactivating) or connecting the barrier.
[0178] In a preferred embodiment, to indicate that the execution thread "early" exits the iterative data processing operation (e.g., deactivates) in the manner of the invention, the condition (e.g., the obstacle) that must be met before the next iteration can begin is changed from one iteration of the iterative data processing operation to the next iteration. Preferably, the condition is changed when the execution thread "early" exits (e.g., deactivates) the iterative data processing operation in the manner of the invention, and preferably, the condition is changed in response to the "early" exit (e.g., deactivation) of the execution thread.
[0179] For example, in one implementation, the condition is such that it is satisfied only when a specific number of execution threads in the set of multiple execution threads have been connected to the barrier, and preferably, the number of execution threads required to satisfy the connection barrier is changed when (and preferably in response to) the execution threads in the set of multiple execution threads exiting the iterative data processing operation (e.g., deactivating) (while at least one iteration of the iterative data processing operation remains). This then allows the barrier operation to account for the “early” exit (e.g., deactivation) of the execution threads.
[0180] It is believed that the idea of altering (e.g., barrier) conditions in response to the execution thread “early” exiting (e.g., deactivating) iterative data processing operations may be novel and inventive in itself.
[0181] Therefore, according to another aspect of the invention, a method for operating a data processor is provided, the data processor being operable as an executable program to perform data processing operations, wherein the program can be executed simultaneously by multiple execution threads; the method includes:
[0182] When a set of multiple execution threads is executing a program to perform an iterative data processing operation including a first iteration and one or more subsequent iterations, wherein each subsequent iteration of the iterative data processing operation is executed in response to a condition being satisfied, and wherein an execution thread in the set of multiple execution threads may exit the iterative data processing operation when at least one iteration of the iterative data processing operation remains:
[0183] The condition is changed when one of the execution threads in the set of the plurality of execution threads performing the iterative data processing operation exits the iterative data processing operation when at least one iteration of the iterative data processing operation remains.
[0184] According to another aspect of the present invention, a data processor is provided, the data processor comprising:
[0185] A programmable execution unit, operable to execute a program to perform data processing operations, wherein the program can be executed concurrently by multiple execution threads; and
[0186] A processing circuit configured to, when a set of multiple execution threads is executing a program to perform an iterative data processing operation including a first iteration and one or more subsequent iterations, wherein each subsequent iteration of the iterative data processing operation is executed in response to a condition being satisfied, and wherein an execution thread in the set of multiple execution threads may exit the iterative data processing operation when at least one iteration of the iterative data processing operation remains:
[0187] The condition is changed when one of the execution threads in the set of the plurality of execution threads performing the iterative data processing operation exits the iterative data processing operation when at least one iteration of the iterative data processing operation remains.
[0188] Those skilled in the art will understand that these aspects and embodiments of the invention may, and indeed do, include one or more features of the invention in one embodiment, and all features of the invention in one embodiment. For example, a condition for performing subsequent iterations is preferably satisfied when all participating (e.g., active) execution threads have connected to the barrier (as described above). In one embodiment, the condition is changed when the execution threads are deactivated, and preferably in response to the deactivation of the execution threads.
[0189] The method preferably includes (and the data processor is configured accordingly):
[0190] The iteration that performs the data processing operation;
[0191] Determine whether the condition is met; and
[0192] When it is determined that the condition is met:
[0193] Execute the next iteration of this iterative data processing operation;
[0194] The execution threads in the set of multiple execution threads exit the iterative data processing operation and are optionally deactivated;
[0195] Change the condition (in response to the execution thread exiting or disabling the iterative data processing operation); and
[0196] Determine whether the changed conditions are met; and
[0197] When it is determined that the changed conditions are met:
[0198] Perform another iteration of the data processing operation.
[0199] The condition can be changed (e.g., a barrier) when each individual execution thread exits the iterative data processing operation (e.g., deactivates). However, in one implementation, where threads are grouped into thread groups, the condition is changed (e.g., a barrier) when the last thread in each thread group exits the iterative data processing operation (e.g., deactivates), i.e., such that all threads in the thread group must exit (e.g., deactivate) before the condition is changed (e.g., a barrier).
[0200] When a thread (group) exits an iterative data processing operation in any appropriate and expected manner (e.g., deactivation) (“early”), the condition (e.g., barrier) can be changed.
[0201] This can be a specific "change condition" instruction visible to the application programming interface (API) of the program used to cause the (e.g., barrier) condition to change. Therefore, the (e.g., barrier) condition can be changed in response to (or by the instructions included in the program) instructions visible to the program's API. In a preferred embodiment, the data processor is configured to monitor for execution thread deactivation and change the (e.g., barrier) condition in response to execution thread deactivation. Therefore, in a preferred embodiment, the condition is changed in response to execution thread deactivation.
[0202] These implementation schemes can be achieved in any suitable and desired manner.
[0203] In a preferred embodiment, the data processor maintains a count, or "obstacle count," representing the number of execution threads (for the set of multiple execution threads, or the set of multiple processing items, or the workgroup) that have completed a specific iteration of the iterative data processing operation and have connected "obstacles" relative to that iteration. This obstacle count can count, for example, individual execution threads or groups of threads. The obstacle count is preferably set at the start of each iteration (e.g., set to zero) and is preferably updated (e.g., incremented, for example, by one) in response to a thread (group) in the set of execution threads (for the set of multiple processing items, or the workgroup) completing the considered iteration and connecting the considered obstacles.
[0204] The data processor preferably also maintains a count representing the number of participating (e.g., active) execution threads in the set of multiple execution threads (for the set of multiple processing items (workgroup)), i.e., a "participating thread count," which represents the number of execution threads in the set of multiple execution threads (for the set of multiple processing items (workgroup)) that have not yet exited (or been deactivated) the iterative data processing operation. This participating thread count can count, for example, individual execution threads or thread groups. This participating thread count is preferably set at the start of the iterative data processing operation (the first iteration) (e.g., set as the number of threads or thread groups in the set of multiple execution threads).
[0205] In this case, the condition to start the next iteration (i.e., release the obstacle) is preferably met when the participating thread count and obstacle count indicate that all participating (e.g., active) threads in the set of multiple execution threads (for the set of multiple processing items (workgroup)) have completed the iteration of the considered iterative data processing operation and have connected the considered obstacle, for example when the participating thread count equals the obstacle count.
[0206] In this case, the condition (e.g., the barrier) is preferably changed by appropriately updating the participating thread count. For example, the participating thread count may be updated in response to a "change condition" instruction. In a preferred embodiment, the participating thread count is updated in response to a thread (group) being deactivated ("early"). For example, the participating thread count may be decreased (e.g., decreased by one) in response to a thread or thread group being deactivated ("early").
[0207] In this way, the barrier may not wait for, for example, a stopped execution thread, so that it can be released even if the execution thread has been stopped, for example, "early".
[0208] Therefore, in one implementation, the iteration of the iterative data processing operation involves (participating, e.g., activity) execution threads performing the corresponding data processing operation, and then exiting (or deactivating) the iterative data processing operation, and (possibly) updating the participating thread count, or connecting barriers and (possibly) updating the barrier count (for the iteration).
[0209] In the presence of multiple sets (workgroups) (sets of multiple execution threads) of multiple processing items, the data processor preferably maintains the corresponding obstacle count and participating thread count for each set (workgroup) of multiple processing items (sets of multiple execution threads) in this manner.
[0210] It can be determined whether a condition (e.g., a barrier) has been met at any suitable point when the iterative data processing operation is performed. In a preferred embodiment, the determination of whether the condition (e.g., a barrier) has been met is made in response to an update of the barrier count, and preferably also in response to an update of the participating thread count.
[0211] Therefore, in a preferred embodiment, in response to the completion of the iteration of the iterative data processing operation and connection of the considered obstacle by the thread(group) in the set of multiple execution threads (the set of multiple processing items (workgroup)), it is determined whether the (e.g., obstacle) condition is satisfied. In a preferred embodiment, in response to the deactivation ("early") (not connecting the considered obstacle) of the thread(group) in the set of multiple execution threads (the set of multiple processing items (workgroup)) or in response to a "change condition" instruction, it is (preferably) also determined whether the (e.g., obstacle) condition is satisfied. Then, when it is determined whether the (e.g., obstacle) condition is satisfied, the next iteration of the iterative data processing operation is preferably executed (started).
[0212] This invention can be implemented in any suitable system, such as a system that can be suitably configured as a microprocessor-based system. In one embodiment, the invention is implemented in a computer and / or microprocessor-based system. In one embodiment, the invention is implemented in a portable device, such as a mobile phone or tablet computer in one embodiment.
[0213] This invention is applicable to data processors and data processing systems of any suitable form or configuration, such as graphics processors (and systems) having a “pipeline” arrangement (in this case, the graphics processor includes a rendering pipeline). It is applicable, for example, to tile-based graphics processors and graphics processing systems. Therefore, the data processor can be a tile-based graphics processor.
[0214] In one embodiment, the various functions of the invention are performed on a single data processing platform for generating and outputting data, such as for a display device.
[0215] Those skilled in the art will understand that the data processor of the present invention can be part of an overall data (e.g., graphics) processing system that includes (e.g., in one embodiment) a host processor that, for example, executes an application that needs to be processed by the data (e.g., graphics) processor. The host processor sends appropriate commands and data to the data (e.g., graphics) processor to control the data (e.g., graphics) processor to perform data (e.g., graphics) processing operations and produce the data (e.g., graphics) processing output required by the application executing on the host processor. To facilitate this, the host processor should (and in one embodiment) also execute a driver for the data processor and optionally one or more compilers for compiling programs (e.g., shaders) to be executed by the data processor (e.g., a programmable execution unit).
[0216] The data processor may also include one or more memory and / or memory devices storing the data described herein and / or output data generated by the data processor, and / or storing software (e.g., a (shader) program) for performing the processes described herein, and / or communicating with the one or more memory and / or memory devices. The data processor may also communicate with a host microprocessor, and / or with a display for displaying an image based on the data generated by the data processor.
[0217] This invention can be used for all forms of output that can be generated using a data (e.g., graphics) processor. For example, the data (e.g., graphics) processor can execute a graphics processing pipeline that generates frames for display, rendering to texture output, etc. In one embodiment, output data values from the processing pipeline are exported externally (e.g., main memory) for storage and use, such as to a frame buffer for display.
[0218] The various functions of the present invention can be performed in any desired and suitable manner. For example, the functions of the present invention can be implemented in hardware or software as needed. Thus, for example, the various functional elements, stages, and "devices" of the present invention may include one or more suitable processors, one or more controllers, functional units, circuit systems, circuits, processing logic units, microprocessor arrangements, etc., capable of operating to perform various functions, such as suitable dedicated hardware elements (processing circuits) and / or programmable hardware elements (processing circuits) that can be programmed to operate in a desired manner.
[0219] It should also be noted here that, as those skilled in the art will understand, various functions of the present invention can be repeated and / or performed in parallel on a given processor. Similarly, various processing levels can share processing circuitry if desired.
[0220] Furthermore, any one or more processing levels of the present invention may be embodied, for example, in the form of one or more fixed functional units (hardware) (processing circuitry systems / circuits) and / or in the form of a programmable processing circuitry system / circuit that can be programmed to perform desired operations. Similarly, any one or more of the processing levels and processing circuitry systems / circuits of the present invention may be provided as independent circuit elements to other processing levels or processing circuitry systems / circuits, and / or any one or more or all of the processing levels and processing circuitry systems / circuits may be formed at least partially from a shared processing circuitry system / circuit.
[0221] Depending on the hardware required to perform the specific functions described above, the components of a data processing system may originally include any one or more or all of the usual functional units included in such components.
[0222] Those skilled in the art will also appreciate that all embodiments of the present invention may, as appropriate, include any one or more or all of the optional features described herein.
[0223] The method according to the invention can be implemented at least in part using software, such as a computer program. Therefore, it will be understood that, when considered from other embodiments, the invention provides: computer software, which, when installed on a data processor, is particularly suitable for performing the method described herein; a computer program element comprising computer software code portions for performing the method described herein when the program element is run on the data processor; and a computer program comprising code suitable for performing all steps of the method described herein when the program is run on a data processing system. The data processing system may be a microprocessor, a programmable FPGA (Field-Programmable Gate Array), etc.
[0224] The invention also extends to computer software carriers that include software used to cause steps of the methods of the invention to be performed in conjunction with a data processor, renderer, or other system including a data processor when operating the data processor, renderer, or system. Such computer software carriers can be physical storage media, such as ROM chips, CD-ROMs, RAM, flash memory, or disks, or they can be signals, such as electronic signals, optical signals, or radio signals, such as signals to satellites, transmitted through wires.
[0225] It will also be understood that not all steps of the method of the present invention need to be performed by computer software, and therefore, according to a broader embodiment, the present invention provides computer software mounted on a computer software carrier for performing at least one step of the method described herein, and such software.
[0226] This invention can therefore be suitably embodied as a computer program product for use with a computer system. Such embodiments may include a series of computer-readable instructions fixed on a tangible, non-transitory medium, such as a computer-readable medium, for example, a disk, CD-ROM, ROM, RAM, flash memory, or hard disk. It may also include a series of computer-readable instructions that can be transmitted to a computer system via a modem or other interface device through a tangible medium (including, but not limited to, optical or analog communication lines) or passively using wireless technologies (including, but not limited to, microwave, infrared, or other transmission technologies). This series of computer-readable instructions embodies all or part of the functions described above.
[0227] Those skilled in the art will understand that such computer-readable instructions can be written in a variety of programming languages to be used with many computer architectures or operating systems. Furthermore, such instructions can be stored using any current or future memory technology (including, but not limited to, semiconductor, magnetic, or optical technologies), or transmitted using any current or future communication technology (including, but not limited to, optical, infrared, or microwave technologies). It is conceivable that such computer program products can be distributed as removable media with accompanying printed or electronic documentation (e.g., shrink-wrapping software), pre-loaded with a computer system on, for example, a system ROM or a fixed disk, or distributed via a network (e.g., the Internet or the World Wide Web) from a server or electronic bulletin board.
[0228] Various embodiments of the invention will now be described by way of example only and with reference to the accompanying drawings, wherein:
[0229] Figure 1 An exemplary graphics processing system including a graphics processor is shown;
[0230] Figure 2 It schematically shows that it can be Figure 1 The graphics processing pipeline executed by the graphics processor;
[0231] Figure 3 This schematically illustrates the use of... Figure 1 The compilation of shader programs executed by the graphics processor;
[0232] Figure 4 Examples of four reduction operations are shown;
[0233] Figure 5 Four reduction operations are shown being performed according to an embodiment of the invention;
[0234] Figure 6A graphical processor capable of operation according to an embodiment of the present invention is schematically shown;
[0235] Figure 7A and Figure 7B Operation according to an embodiment of the present invention is shown. Figure 6 The method of graphics processor; and
[0236] Figure 8 A method for operating a graphics processor according to an embodiment of the present invention is shown.
[0237] Where appropriate in the accompanying drawings, similar reference numerals are used for similar parts.
[0238] Several embodiments of the invention will now be described. These embodiments will be described with particular reference to the use of the invention in graphics processors and graphics shader programs, but as stated above, the invention is equally applicable to other forms of data processors and programs.
[0239] Figure 1 A typical graphics processing system is illustrated. An application 2 executing on host processor 1 will require processing operations to be performed by the associated graphics processing unit (GPU) (graphics processor) 3. To this end, the application generates API (Application Programming Interface) calls, which are interpreted by a driver 4 for the graphics processor 3 running on host processor 1, generating appropriate commands for the graphics processor 3 to produce the graphical output required by application 2. To facilitate this, a set of "commands" is provided to the graphics processor 3 in response to commands from application 2 running on host system 1.
[0240] Figure 2 A graphics processing pipeline 33 that can be executed by the graphics processor 3 is shown.
[0241] Figure 2 The graphics processing pipeline 33 shown is a tile-based renderer, which will therefore produce tile arrays of rendering output data, such as the output frames to be generated.
[0242] In tile-based rendering, instead of efficiently processing the entire render output (e.g., a frame) all at once as in immediate mode rendering, the render output (e.g., a frame to be displayed) is divided into multiple smaller sub-regions (often called "tiles"). Each tile (sub-region) is rendered separately (usually processed one after another), and the rendered tiles (sub-regions) are then reassembled to provide the complete render output, such as a frame for display. In this arrangement, the render output is typically divided into sub-regions (tiles) of regular size and shape (which are typically, for example, squares or rectangles), but this is not required.
[0243] An array of render output data can typically be an output frame intended for display on a display device such as a screen or printer, but it can also include, for example, intermediate data intended for later render traversal (also known as “render to texture” output).
[0244] When displaying computer graphics, they are typically first defined as a series of primitives (polygons), then these primitives are divided (rasterized) into graphic fragments for graphics rendering. During normal graphics rendering operations, the renderer modifies the color (red, green, and blue, RGB) and transparency (α, a) data associated with each fragment so that the fragment can be displayed correctly. Once these fragments have been fully traversed by the renderer, their associated data values are stored in memory, ready for output, such as for display.
[0245] Figure 2 The main components and pipeline stages of a graphics processing pipeline 33 related to the operation of an embodiment of the present invention are shown. As those skilled in the art will understand, this graphics processing pipeline may have... Figure 2 Other elements not shown. It should also be noted here that... Figure 2 This is merely illustrative, and, for example in practice, even if the functional units and pipeline stages shown are in... Figure 2 While schematically shown as independent stages, they can also share important hardware circuitry. It should also be understood that, as... Figure 2 Each of the stages, elements, and units of the graphics processor shown can be implemented as needed, and will accordingly include, for example, appropriate circuitry and / or processing logic components for performing the required operations and functions.
[0246] like Figure 2 As shown, the graphics processing pipeline 33 includes multiple stages, including vertex shader 20, shell shader 21, tessellation 22, domain shader 23, geometry shader 24, rasterization stage 25, early Z (depth) and stencil testing stage 26, renderer in the form of fragment shading stage 27, late Z (depth) and stencil testing stage 28, blending stage 29, tile buffer 30, and downsampling and write-out (multisampling parsing) stage 31.
[0247] Vertex shader 20 receives input data values associated with vertices and the like as defined for the output to be generated, and processes those data values to generate a set of corresponding "vertex-shaded" output data values for use by subsequent stages of the graphics processing pipeline 33. Vertex shading, for example, modifies the input data to account for the effects of lighting in the image to be rendered.
[0248] The shell shader 21 operates on the set of patch control points and generates additional data called patch constants. The tessellation stage 22 subdivides the geometry to create a higher-level representation of the shell. The domain shader 23 operates on the vertex outputs through the tessellation stage (similar to the vertex shader), and the geometry shader 24 processes entire primitives such as triangles, points, or lines. These stages, together with the vertex shader 21, efficiently perform all necessary fragment front-end operations (such as transformations and lighting operations) and primitive setting (to set the primitives to be rendered) in response to commands and vertex data provided to the graphics processing pipeline 33.
[0249] The rasterization stage 25 of the graphics processing pipeline 33 is used to rasterize the primitives constituting the rendering output (e.g., an image to be displayed) into individual graphic fragments for processing. To this end, when the rasterizer 25 receives graphic primitives for rendering, it rasterizes these primitives into sample points and generates graphic fragments with appropriate positions (indicating appropriate sample positions) for rendering the primitives.
[0250] The segments generated by the rasterizer are then sent forward to the rest of the pipeline for processing.
[0251] The early Z / stencil stage 26 performs a Z (depth) test on the fragments it receives from rasterizer 25 to see if any fragments can be discarded (culled) at this stage. To do this, it compares the depth values of the fragments (and their associated values) published from rasterizer 25 with the depth values of already rendered fragments (these depth values are stored in a depth (Z) buffer, which is part of tile buffer 30), to determine if new fragments will be blocked by already rendered fragments. Simultaneously, an early stencil test is performed.
[0252] The fragments that pass the early Z-test and template test in stage 26 are then sent to the fragment shading stage 27. The fragment shading stage 27 performs appropriate fragment processing operations on the fragments that pass the early Z-test and template test in order to process the fragments to generate appropriate render fragment data.
[0253] The fragment processing may include any suitable and desired fragment shading process, such as performing fragment shader procedures on the fragment, applying textures to the fragment, applying fog or other operations to the fragment, to generate appropriate fragment data. In an embodiment of the invention, fragment shading stage 27 takes the form of a shader pipeline (programmable fragment shaders).
[0254] Then there is a “post-processing” segment Z and stencil test phase 28, which serves specifically as the end of a pipeline depth test performed on the shading segments to determine whether the rendered segments will actually be seen in the final image. This depth test uses the Z-buffer value of the segment position stored in the Z-buffer in the tile buffer 30 to determine whether the segment data of the new segment should replace the segment data of the already rendered segment. This determination process, as is well known in the art, is achieved by comparing the depth value of the segment (as associated with it) released from the segment shading phase 27 with the depth value of the already rendered segment (as stored in the depth buffer). This post-processing segment depth and stencil test phase 28 also performs any necessary “post-processing” α and / or stencil tests on the segments.
[0255] Then, if necessary, in mixer 29, the fragments that have passed the post-fragment test stage 28 are subjected to any necessary blending operations, where the fragments have been stored in the tile buffer 30. Any other necessary remaining operations, such as dithering, are also performed on the fragments in this stage (not shown).
[0256] Finally, the (mixed) output fragment data (values) are written to tile buffer 30, from which they can be output to a frame buffer for display, for example. The depth values of the output fragments are also appropriately written to the Z-buffer within tile buffer 30. This tile buffer stores color buffers and depth buffers, which store appropriate color or Z values for each sample point represented by the buffer (essentially for each sample point of the tile being processed). These buffers store arrays of fragment data representing portions (tiles) of the overall rendering output (e.g., the image to be displayed), where the corresponding set of sampled values in the buffers corresponds to the corresponding pixels of the overall rendering output (e.g., each 2×2 set of sampled values may correspond to an output pixel, where 4× multisampling is used).
[0257] The tile buffer is provided as part of the RAM located on the graphics processing pipeline (chip) (locally).
[0258] Data from tile buffer 30 is input to downsampling (multisampling parsing) write-out unit 31, and then output (write-back) to an external memory output buffer, such as the frame buffer of a display device. The display device may include, for example, a display comprising a pixel array, such as a computer monitor or a printer.
[0259] The downsampling and writing unit 31 downsamples the fragment data stored in the tile buffer 30 to the appropriate resolution of the output buffer (device) (i.e., to generate a pixel data array corresponding to the pixels of the output device) to generate output values (pixels) for output to the output buffer.
[0260] Once a tile of the rendered output has been processed and its data has been exported to main memory (e.g., to a frame buffer in main memory (not shown)) for storage, the next tile is then processed, and so on, until enough tiles have been processed to generate the entire rendered output (e.g., a frame (image) to be displayed). The process is then repeated for the next rendered output (e.g., a frame), and so on.
[0261] Other arrangements of the graphics processing pipeline 33 will be possible.
[0262] The above describes Figure 1 The specific characteristics of the operation of the illustrated graphics processing system will now be described. Figure 1 Another feature of the operation of the graphics processing system shown.
[0263] from Figure 2 As can be seen, the graphics processing pipeline 33 includes multiple programmable processing or "shader" stages, namely vertex shader 20, shell shader 21, domain shader 23, geometry shader 24, and fragment shader 27. These programmable shader stages execute corresponding shader programs, which have one or more input variables and generate a set of output variables provided by the application.
[0264] To this end, application 2 provides shader programs executed using a high-level shader programming language (such as GLSL, HLSL, OpenCL, etc.). These shader programs are then translated by a shader language compiler into binary code for the target graphics processing pipeline 33. This may include creating one or more intermediate representations of the program within the compiler. This compiler may be, for example, part of driver 4, where special API calls are made to cause the compiler to run. Therefore, this compiler execution can be viewed as part of the preparation of drawing calls performed by the driver in response to API calls generated by the application. (Of course, other compiler arrangements would be possible.)
[0265] Figure 3 This is illustrated, and a shader program provided by application 2 to driver 4 in high-level shader programming language 301 is shown, which then compiles 302 the shader program into binary code 303 for graphics processing pipeline 33.
[0266] As described above, each shader in this graphics processing pipeline is a processing level that performs graphics processing by running a small program for each "work item" in the output to be generated (in this respect, a "work item" is typically a vertex or sampled location). For each work item to be processed, the execution thread that will execute the corresponding shader program is assigned to the appropriate shader core (programmable execution unit), which then executes the shader program for the execution thread in question.
[0267] To allow multiple execution threads to run simultaneously (in parallel), the shader core is typically arranged as multiple execution channels, each capable of performing processing operations on one execution thread.
[0268] It is also known to use the shader functions of the graphics processor and graphics processing pipeline to perform more general computational tasks, such as those performed using computation shader APIs such as OpenCL and Vulkan. In this case, the execution channels of the shader core of the graphics processor 3 would be used to perform more general data processing tasks that may not specifically involve the generation of graphics data for graphics output (e.g., for display).
[0269] Embodiments of the present invention particularly relate to efficient mechanisms for performing operations such as so-called “reduction” operations. Such “reduction” operations may be required when performing graphics processing or computational shading. One example is the “tiling reduction” operation, which can be used, for example, by graphics processing (such as tiling lighting or spotlighting) to find, for example, the minimum and / or maximum depth values in a lit tile. Another example of a “reduction” operation that may be required when performing computational shading is the workgroup reduction operation (work_group_reduce) described in OpenCL.
[0270] In OpenCL, applications can submit kernels to the graphics processor to perform processing operations against a 3D iterative space called NDRange (each iteration (element) in this space is a work item). Each NDRange is divided into work groups, and each work group includes one or more subgroups. Each subgroup can be mapped, for example, to a group of threads (thread bundles) that will be executed by the graphics processor (such that each thread corresponds to a corresponding work item in that subgroup).
[0271] It can perform reduction operations on the corresponding workgroup, such as summing (adding) all data values for the workgroup (addition reduction operation), or it can determine, for example, the product of all data values for the workgroup (multiplication). It can also determine the maximum or minimum value of all data values for the workgroup.
[0272] Figure 4 One method for performing such reduction operations is shown. Figure 4 The illustration shows four reduction operations being performed by a graphics processor 3 for four corresponding workgroups 100, 120, 140, and 160. The graphics processor is arranged with sixteen execution channels, enabling it to execute sixteen threads in parallel. In the illustrated example, each reduction operation is used to combine sixteen initial "input" data values of the corresponding workgroup into a single "output" data value for that workgroup.
[0273] exist Figure 4 In the illustrated example, each reduction operation is performed by a set of eight threads from a corresponding set of eight execution channels issued to the graphics processor 3. Each such set of eight execution threads is used to perform the reduction operation in an iterative manner, wherein each iteration (loop) of the reduction operation involves one or more threads in the set of eight execution threads each combining two corresponding data values in a desired manner (e.g., adding, multiplying, or determining the maximum or minimum of two data values) to generate a single combined data value (e.g., the sum, product, maximum, or minimum of the two data values).
[0274] exist Figure 4 In the illustrated example, the first iteration of the reduction operation involves all eight execution threads in the corresponding set of execution threads each performing the corresponding combination operation. Then, each subsequent iteration involves only half of the threads that performed the combination operation in the previous iteration. Thus, each reduction operation involves the sixteen initial input data values of the corresponding workgroup being combined into a single output data value over a total of four iterations.
[0275] For example, with respect to workgroup 100, the first iteration of the reduction operation involves all eight execution threads 102 combining the corresponding pairs of sixteen initial input data values 101 of workgroup 100 into eight corresponding intermediate combined data values 103; the second iteration of the reduction operation involves four execution threads 104 combining the corresponding pairs of eight intermediate combined data values 103 into four corresponding intermediate combined data values 105; the third iteration of the reduction operation involves two execution threads 106 combining the corresponding pairs of four intermediate combined data values 105 into two corresponding intermediate combined data values 107; and the fourth iteration of the reduction operation involves one execution thread 108 combining the two intermediate combined data values 107 into a single output data value 109 of workgroup 100.
[0276] It should be understood that in this example, the execution thread will be used to combine the data values generated by other execution threads in the previous iteration. To illustrate this, a barrier operation is used to ensure that all data values to be generated in one iteration have been generated before the start of the next iteration, so that the correct data values can then be used for combination in the next iteration.
[0277] In this example, the barrier operation ensures that all eight threads in the set of eight threads performing the reduction operation must reach (connect) the barrier before the next iteration of the reduction operation can begin. Therefore, in this example, all eight threads in the set of eight threads execute each iteration of the reduction operation synchronously with each other. Then, after all iterations have been completed, all eight threads are deactivated. A new set of eight threads can then be published to the same execution channel to perform another reduction operation for a different workgroup.
[0278] For example, regarding Working Group 120, such as Figure 4 As shown, all eight threads 121-128 in the corresponding set of eight threads execute each iteration of the reduction operation synchronously with each other. Then, after all iterations of the reduction operation have been completed, all eight threads 121-128 are deactivated. Figure 4 In the example shown, a new set of eight threads is then published to the same set of execution channels to perform another reduction operation for workgroup 160.
[0279] Because the graphics processor 3 in this example can execute sixteen threads in parallel, it can perform two reduction operations simultaneously (in parallel) in this way. Therefore, in Figure 4 In the illustrated example, the reduction operations for the first two groups 100 and 120 are performed simultaneously, and then the reduction operations for the other two groups 140 and 160 begin after the reduction operations for the first two groups 100 and 120 have been completed. Therefore, it should be understood that a total of eight iterations (loops) are required to complete all four reduction operations in this example.
[0280] As stated above, the applicant has recognized that performing the reduction operation in this manner causes some execution threads to not perform the corresponding combination operations in some iterations (loops) of the reduction operation, but instead only participate in the barrier operations of those iterations (loops). Furthermore, the number of such execution threads that do not perform the corresponding combination operations increases from one iteration (loop) of the reduction operation to the next.
[0281] For example, regarding Working Group 120, such as Figure 4 As shown, during the second iteration of the reduction operation for the workgroup 120, four execution threads 121-124 will not perform the corresponding combination operation; during the third iteration of the reduction operation, six execution threads 121-126 will not perform the corresponding combination operation; and during the fourth iteration of the reduction operation, seven execution threads 121-127 will not perform the corresponding combination operation.
[0282] Figure 5A more efficient method for performing the reduction operation according to an embodiment of the present invention is shown. Similar to... Figure 4 , Figure 5 The diagram illustrates four reduction operations being performed by a graphics processor 3 for four corresponding workgroups 200, 220, 240, and 260. This graphics processor is configured with sixteen execution channels, enabling it to execute sixteen threads simultaneously (in parallel). Each reduction operation also combines sixteen initial "input" data values from the corresponding workgroup into a single "output" data value for that workgroup. However, it should be understood that other numbers of workgroups, threads, and input data values, etc., would be possible.
[0283] In an embodiment of the present invention, with Figure 4 The example is similar, with each reduction operation executed by a set of eight threads corresponding to the set of eight execution channels assigned to the graphics processor 3. Each set of eight execution threads is used to execute the reduction operation iteratively. However, it should also be understood that additional numbers of threads may be possible in the set of threads executing the reduction operation.
[0284] In an embodiment of the present invention, with Figure 4 Similarly, each iteration (loop) of the reduction operation involves one or more threads from the corresponding set of eight execution threads combining two corresponding data values in the desired manner (e.g., adding, multiplying, or determining the maximum or minimum of two data values) to produce a single combined data value (e.g., the sum, product, maximum, or minimum of two data values).
[0285] like Figure 5 As shown, with Figure 4 Similarly, in the embodiments of the present invention, the first iteration of the reduction operation involves all eight execution threads in the corresponding set of execution threads each performing the corresponding combination operation. Then, each subsequent iteration involves only half of the threads that performed the combination operation in the previous iteration. Thus, each reduction operation involves the sixteen initial input data values of the corresponding workgroup being combined into a single output data value over a total of four iterations (loops).
[0286] However, unlike the set of execution threads that only deactivate the reduction operation when all iterations of the reduction operation have been completed (as...), the execution threads in this set are deactivated only when all iterations of the reduction operation have been completed. Figure 4 (In the example as described above) Instead, in Figure 5In one embodiment, each execution thread in the set of execution threads performing the reduction operation is deactivated when it has completed (in response to its completion) all iterations of the reduction operation, in which the execution thread actually performs the combination operation, i.e., such that the execution thread does not participate in any iterations of the reduction operation in which it would not otherwise perform the corresponding combination operation. Therefore, in this embodiment, some execution threads in the set of execution threads performing the reduction operation are deactivated "early" from the reduction operation, i.e., different execution threads in the set are deactivated from the reduction operation after participating in different numbers of iterations.
[0287] For example, regarding Working Group 200, such as Figure 5 As shown, the first iteration of the reduction operation involves all eight execution threads 202 combining the corresponding pairs of sixteen initial input data values 201 of workgroup 200 into eight corresponding intermediate combined data values 203. Then, four of the execution threads are deactivated. The second iteration of the reduction operation then involves only the four remaining execution threads 204 combining the corresponding pairs of eight intermediate combined data values 203 into four corresponding intermediate combined data values 205. Then, the other two remaining execution threads are deactivated. The third iteration of the reduction operation then involves only the two remaining execution threads 206 combining the corresponding pairs of four intermediate combined data values 205 into two corresponding intermediate combined data values 207. Then, the remaining execution thread is deactivated. The fourth iteration of the reduction operation then involves only the last remaining execution thread 208 combining the two intermediate combined data values 207 into a single output data value 209 for workgroup 200.
[0288] Therefore, in embodiments of the present invention, the execution thread is prevented from not performing the corresponding combination operations in some iterations of the reduction operation, but only participating in the obstacle operations of those iterations. Therefore, it should be understood that embodiments of the present invention can allow the reduction operation to be performed in a more efficient manner, thereby reducing, for example, the power consumption of the graphics processor 3 performing the reduction operation.
[0289] Furthermore, it should be understood that "early" deactivating execution threads in this way will cause execution channels to become available earlier than they were originally supposed to be, allowing new execution threads to be issued to the available execution channels and begin execution earlier than they were originally supposed to. This enables higher parallelism.
[0290] For example, in Figure 5In the implementation scheme, the first iteration of the reduction operation for workgroup 200 is executed in parallel with the first iteration of the reduction operation for workgroup 220. Then, four execution threads from the execution threads participating in the first iteration of the reduction operation for workgroup 220 and four execution threads from the execution threads participating in the first iteration of the reduction operation for workgroup 200 are deactivated, making the eight execution channels available subsequently.
[0291] Then, a set of eight threads for performing the reduction operation for workgroup 240 is released to eight available execution channels. The first iteration of the reduction operation for workgroup 240 is then executed in parallel with the second iterations of the reduction operation for workgroup 220 and the second iteration of the reduction operation for workgroup 200. Then, four execution threads from the first iteration of the reduction operation for workgroup 240, two execution threads from the second iteration of the reduction operation for workgroup 220, and two execution threads from the second iteration of the reduction operation for workgroup 200 are deactivated, making the eight execution channels available subsequently.
[0292] Then, a set of eight threads for performing the reduction operation for workgroup 260 is issued to eight available execution channels. The first iteration of the reduction operation for workgroup 260 is then executed in parallel with the second iteration of the reduction operation for workgroup 240, the third iteration of the reduction operation for workgroup 220, and the third iteration of the reduction operation for workgroup 200. Then, four execution threads from the first iteration of the reduction operation for workgroup 260, two execution threads from the second iteration of the reduction operation for workgroup 240, one execution thread from the third iteration of the reduction operation for workgroup 220, and one execution thread from the third iteration of the reduction operation for workgroup 200 are deactivated, making the eight execution channels subsequently available. These eight available execution channels can then be used for another reduction operation (not shown) or for other purposes, and so on.
[0293] It should be understood that, Figure 5 The proposed implementation requires a total of six iterations (loops) to complete all four reduction operations. Therefore, compared to the implementation requiring a total of eight iterations (loops) to complete all four reduction operations... Figure 4 Compared to the example, Figure 5 The implementation scheme allows for the reduction operation to be performed in a faster and more efficient manner.
[0294] Furthermore, it should be understood that, with Figure 5 The implementation method of “early” deactivating threads will allow a larger number of workgroups to remain active simultaneously, such as in systems where the total number of active execution threads is limited.
[0295] exist Figure 5 In the implementation plan, with Figure 4 The example is similar, synchronizing the execution thread that performs the reduction operation to ensure that all data values to be generated in one iteration of the reduction operation have been generated before the next iteration of the reduction operation can begin, so that the correct data values can be used for combination in the next iteration.
[0296] However, to allow threads to be stopped "early" (i.e., before the last or more iterations of the reduction operation begin), instead of using barriers to synchronize threads (which requires all eight threads in the corresponding set of eight threads to reach the barrier before the next iteration can begin), Figure 4 (Similar examples are available). Figure 5 The implementation scheme uses a "partial barrier" operation, for which the requirement (condition) for releasing the barrier thread to start the next iteration changes from one iteration of the reduction operation to the next iteration.
[0297] in particular, Figure 5 The "partial barrier" operation of the implementation scheme ensures that the number of threads that must reach the barrier before it can be released corresponds to the number of threads remaining to perform the reduction operation, i.e., the number of threads that have not yet been deactivated from performing the reduction operation. Therefore, compared to a regular barrier operation, this "partial barrier" operation allows the barrier to be released in response to a portion, but not all, of the connection barriers in the execution threads used by the workgroup.
[0298] The number of threads that meet the "partial barrier" release condition can vary with the granularity of individual threads; however, Figure 6 Figures 7 and 8 illustrate an implementation in which the "partial barrier" release condition changes with the granularity of the thread group ("thread bundle").
[0299] Figure 6 A graphics processor 3 according to these implementation schemes is shown. Figure 6 The main components of the graphics processor 3 related to the operation of embodiments of the present invention are shown. As those skilled in the art will understand, the graphics processor 3 may have... Figure 6 Other elements not shown. It should also be noted here that... Figure 6 This is merely illustrative, and, for example, in practice, even if the functional units shown are... Figure 6 These are schematically shown as independent components, but these functional units may also share important hardware circuitry. It should also be understood that, as... Figure 6 Each of the elements and units of the graphics processor 3 shown can be implemented as needed, and will accordingly include, for example, appropriate circuitry and / or processing logic components for performing the required operations and functions.
[0300] like Figure 6 As shown, the graphics processor 3 of this embodiment includes a thread bundle manager 600 and a thread bundle execution engine 630. The thread bundle manager 600 is used to issue (schedule) thread groups ("thread bundles") of execution threads to the thread bundle execution engine 630 for execution. For example, each thread group ("thread bundle") may include a total of eight, sixteen, or thirty-two execution threads. Other thread groupings are possible.
[0301] Then, the thread bundle execution engine 630 executes the shader program for each execution thread in the thread group (“thread bundle”) to which it is issued, to generate appropriate output data for each execution thread. In this embodiment of the invention, the shader program is provided by the application 2 and can be compiled for execution by the driver 4.
[0302] To allow for parallel thread execution, the thread bundle execution engine 630 is arranged as multiple execution channels. In this embodiment of the invention, the thread bundle execution engine 630 is configured with the same number of execution channels because threads exist within a thread group (“thread bundle”). However, other numbers of execution channels would be possible.
[0303] Threads within a thread group ("thread bundle") execute one instruction at a time, in a lockstep manner. Grouping execution threads in this way improves the execution efficiency of the execution engine 630 because instruction fetching and scheduling resources can be shared among all threads in the group.
[0304] like Figure 6 As shown, the thread bundle manager 600 maintains a workgroup table 620. The workgroup table 620 includes entries for each workgroup to be processed. Each entry in the workgroup table 620 initially indicates the number of subgroups within the corresponding workgroup, and thus indicates the number of corresponding thread groups (“thread bundles”) that the thread bundle manager 600 will publish (schedule) to the thread bundle execution engine 630 for processing that workgroup.
[0305] like Figure 6 As shown, the thread bundle execution engine 630 maintains a thread bundle table 640, which is initiated when the thread bundle manager 600 issues thread groups (“thread bundles”) to it for execution. The thread bundle table 640 includes an entry for each thread in the thread group (“thread bundle”) currently being executed by the thread bundle execution engine 630. Each entry in the thread bundle table 640 initially indicates that the corresponding executing thread is active, and then when a thread is deactivated, the corresponding entry in the thread bundle table 640 is updated to indicate that the thread is no longer active.
[0306] like Figure 6As shown, to facilitate barrier operation, the thread bundle manager 600 also includes a barrier unit 610. The barrier unit 610 maintains a workgroup count 611 and a barrier count 612. The barrier count 612 counts the number of thread groups (“thread bundles”) (subgroups) of a workgroup that have connected to the barrier, and the workgroup count 611 (“participating thread count”) indicates the number of thread groups (“thread bundles”) (subgroups) of that workgroup that should have connected to the barrier before it can be released. The workgroup count 611 is initially equal to the total number of subgroups within the workgroup, and the barrier count for that workgroup is initially zero.
[0307] Figure 7A This illustrates the situation when thread connection barriers occur. Figure 6 The operation of the graphics processor 3. For example... Figure 7A As shown, when the thread bundle execution engine 630 is executing a thread connection barrier 701 in a thread group ("thread bundle"), it determines 702 whether all threads in the thread group ("thread bundle") have connected to the barrier. If not all threads in the thread group ("thread bundle") have connected to the barrier, no further operation 706 is taken relative to the barrier operation. Otherwise, if all threads in the thread group ("thread bundle") have connected to the barrier, such as Figure 6 As shown, the thread bundle execution engine 630 signals this result to the obstacle unit 610 at 651. In response to the signal 651, the obstacle unit 610 increments the obstacle count 612 of the workgroup (“thread bundle”) being processed by 703.
[0308] like Figure 7A As shown, barrier unit 610 then compares the incrementing barrier count 612 with the workgroup count 611 704 to determine whether the conditions for releasing the barrier have been met. If the workgroup count 611 is not equal to the barrier count 612, it indicates that not all thread groups (“thread bundles”) (subgroups) of the workgroups that should have connected the barrier before it can be released are connected to the barrier. Therefore, in this case, no further operation 706 is taken relative to the barrier operation.
[0309] However, if the workgroup count 611 equals the obstacle count 612, it indicates all connected obstacles in the thread group (“thread bundle”) (subgroup) of the workgroup that should have connected the obstacle before it can be released. Therefore, in this case, obstacle unit 610 causes the obstacle to be released 705, as... Figure 6 As shown, this is achieved by issuing a barrier release signal 652 to the thread bundle execution engine 630.
[0310] Figure 7B This shows what happens when the thread is stopped. Figure 6The operation of the graphics processor 3. When a thread in a thread group (“thread bundle”) being executed by the thread bundle execution engine 630 is deactivated 711, the corresponding entry in the thread bundle table 640 is updated 653 to indicate that the thread is no longer active, and then it is determined 712 whether all threads in the thread group (“thread bundle”) have been deactivated, which is achieved by determining whether the thread bundle table 640 indicates that all threads in the thread group (“thread bundle”) are no longer active. If not all threads in the thread group (“thread bundle”) have been deactivated, no further operation 716 is taken relative to the barrier operation. Otherwise, if all threads in the thread group (“thread bundle”) have been deactivated, such as Figure 6 As shown, the thread bundle execution engine 630 signals this result to the thread bundle manager 600 654.
[0311] In response to signal 654, thread bundle manager 600 reduces the entry in workgroup table 620 of the workgroup being processed by a thread group (“thread bundle”) by one, indicating that the remaining unprocessed subgroups (thread groups (“thread bundles”) for that workgroup are reduced by one. This then triggers 655 to reduce the workgroup count 611 of the workgroup being processed by a thread group (“thread bundle”) by 713.
[0312] like Figure 7B As shown, barrier unit 610 then compares the reduced workgroup count 611 with the barrier count 612 714 to determine whether the conditions for releasing the barrier have been met. If the workgroup count 611 is not equal to the barrier count 612, it indicates that not all thread groups (“thread bundles”) (subgroups) of the workgroups that should have connected to the barrier before it can be released are connected to the barrier. Therefore, in this case, no further operation 716 is taken relative to the barrier operation.
[0313] However, if the workgroup count 611 equals the obstacle count 612, it indicates all connected obstacles in the thread group (“thread bundle”) (subgroup) of the workgroup that should have connected the obstacle before it can be released. Therefore, in this case, obstacle unit 610 causes the obstacle to be released 715, as... Figure 6 As shown, this is achieved by issuing a barrier release signal 652 to the thread bundle execution engine 630.
[0314] Therefore, in this embodiment of the invention, the barrier unit 610 determines whether to release the barrier for a workgroup based on a comparison between the number of thread groups (“thread bundles”) (subgroups) of a workgroup with connected barriers and the number of thread groups (“thread bundles”) (subgroups) of a workgroup that has not yet been deactivated. Thus, compared to arrangements that determine whether to release a barrier based on a comparison between the number of thread groups (“thread bundles”) with connected barriers and the total number of thread groups (“thread bundles”) in a workgroup, the conditions for releasing a barrier in this embodiment of the invention change as a thread group (“thread bundle”) is deactivated. This then facilitates the above reference... Figure 5 The method described effectively performs the reduction operation.
[0315] Figure 8 The operation of this "partial barrier" in the embodiment of the invention is shown in more detail.
[0316] like Figure 8 As shown, barrier unit 610 maintains a workgroup count 611 (“activity line count”) and a barrier count 612 for each workgroup being processed. The workgroup count 611 of a workgroup is initially set to the number 811 of thread groups (“thread bundles”) (subgroups) of the workgroup in response to the workgroup 821 being created, and the corresponding barrier count 612 is initially set to zero 812.
[0317] Execution engine 630 schedules threads in thread groups 820 and 830 (“thread bundles”) to execute shader programs 840, and in response to instructions in the shader programs, threads can generate event 841, indicating that all threads in the thread group (“thread bundle”) have been deactivated, or generate event 842, indicating that all threads in the thread group (“thread bundle”) have connected to barriers. These events 841 and 842 are transmitted to barrier unit 610.
[0318] In response to receiving an event 841 indicating that all threads in a thread group (“thread bundle”) have been deactivated, barrier unit 610 decreases the workgroup count 611 of the workgroup under consideration. In response to receiving an event 842 indicating that all threads in a thread group (“thread bundle”) have connected to a barrier, barrier unit 610 increases the barrier count 612 of the barrier for the workgroup under consideration.
[0319] In response to decrementing the workgroup count 611 or incrementing the barrier count 612 of the workgroup, the barrier unit 610 determines whether the workgroup count 611 and the barrier count 612 are now equal 851. If the workgroup count 611 is equal to the barrier count 612, a check 852 is made to determine whether the workgroup count 611 and the barrier count 612 are equal to zero. If the workgroup count 611 and the barrier count 612 are equal to zero, it is indicated that all thread groups ("warps") of the workgroup have been deactivated and there are no thread groups ("warps") of the workgroup waiting to be released from the barrier.
[0320] Otherwise, if the workgroup count 611 and the barrier count 612 are not equal to zero (and are equal to each other), it is indicated that there are thread groups ("warps") of the workgroup waiting to be released from the barrier and the condition for releasing the barrier has been met. Accordingly, the barrier unit 610 signals the execution engine 630 that the barrier can be released, and resets 822 the barrier count 612 of the workgroup to zero 812. Then, the execution engine schedules 820, 830 the thread groups ("warps") for execution 840, for example, to enable the shader program to continue execution past the considered barrier, and so on.
[0321] The following pseudocode shows exemplary high-level shader program code 301 for performing an addition reduction operation.
[0322]
[0323]
[0324]
[0325] In this example, the executing threads will call the function reduce_operation to perform the corresponding addition operation for the iterations of the loop when the condition gl_LocalInvocationID.x < n_threads_active is satisfied, where gl_LocalInvocationID.x is each thread identifier and n_threads_active is initially set equal to the total number of executing threads and is reduced by half at each iteration. Thus, in this example, the number of executing threads performing the corresponding addition operation is reduced by half from one iteration to the next. To ensure thread synchronization, the function barrier() is called at each iteration to trigger the barrier synchronization event, as described above.
[0326] In one embodiment, it can be recognized (e.g., implemented by a compiler of a compilation program) that since n_threads_active will monotonically decrease at each iteration of the loop and gl_LocalInvocationID.x is a constant for each thread, when if (gl_LocalInvocationID.x < n_threads_active) is false, it will remain false until the end of the loop. Additionally, it can be recognized (implemented by this compiler) that all memory writes outside the loop are only executed when the condition if (gl_LocalInvocationID.x == 0) is satisfied.
[0327] Therefore, it can be determined (implemented by this compiler) that a thread can be safely deactivated when both if (gl_LocalInvocationID.x < n_threads_active) and if (gl_LocalInvocationID.x == 0) are false. Thus, the inverses of these two conditions can be combined by a logical AND operation to determine the overall condition for "early deactivation" of the thread. Therefore, in an embodiment of the present invention, an execution thread can be safely deactivated early when if (!(gl_LocalInvocationID.x < n_threads_active) &&!(gl_LocalInvocationID.x == 0)) is true.
[0328] In an embodiment of the present invention, therefore, a conditional branch is included in the program to cause an execution thread to be deactivated early when this "early deactivation" condition is met. The following pseudocode illustrates this.
[0329]
[0330]
[0331]
[0332] In this case, the thread will jump to the end of the program and be deactivated early if it meets the determined early deactivation condition. To illustrate the possibility of early deactivation of the thread, the operation of the graphics processor 3 is adjusted accordingly in response to barrier(), as described above.
[0333] Although the above has been specifically described with reference to early deactivation of an execution thread, in other embodiments, an execution thread can exit an iterative data processing operation early, remain active, and then continue with other processing.
[0334] Although the above has been described with particular reference to the “reduction” operation, other suitable operations such as prefix summation can also be performed in the manner of the present invention.
[0335] It should be understood from the above that the present invention, at least in its embodiments, provides an arrangement that reduces the processing required to perform iterative operations (such as reduction operations and prefix summations). This is achieved, in embodiments of the invention, at least by deactivating execution threads from the iterative operation before all iterations of the iterative operation begin, which are no longer needed to complete the iterative operation. In embodiments of the invention, this is facilitated by changing barrier conditions in response to deactivating execution threads from the iterative operation.
[0336] The specific embodiments described above are presented for illustrative and descriptive purposes only. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in accordance with the teachings above. The described embodiments were chosen to best explain the principles of the invention and its practical application, thereby enabling other skilled in the art to best utilize the techniques of the various embodiments and to have various modifications suitable for the particular intended use. The scope of the invention is intended to be defined by the appended claims.
Claims
1. A method of operating a data processing system, the data processing system comprising a data processor operable as an executable program to perform data processing operations, wherein the program can be executed simultaneously by multiple execution threads; the method comprising: The program to be executed by the data processor includes a set of one or more instructions that, when executed by a set of execution threads, will cause the execution threads in the set of execution threads to operate together to perform an iterative data processing operation, the iterative data processing operation including multiple iterations, the multiple iterations including a first iteration and one or more subsequent iterations, and wherein: Each iteration of the multiple iterations of the iterative data processing operation will include one or more execution threads from the set of multiple execution threads, each of which performs a corresponding data processing operation; The first iteration of the iterative data processing operation will include all execution threads in the set of multiple execution threads, each of which performs the corresponding data processing operation; Each iteration in one or more subsequent iterations of the iterative data processing operation will include: a subset of one or more execution threads that performed the corresponding data processing operation in the previous iteration, each of the one or more subsets of execution threads performing the corresponding data processing operation; and one or more other execution threads that performed the corresponding data processing operation in the previous iteration, wherein the one or more other execution threads do not perform the corresponding data processing operation; and Each iteration of one or more iterations of the iterative data processing operation will include at least one execution thread from the set of multiple execution threads. The at least one execution thread exits the iterative data processing operation when at least one iteration of the multiple iterations of the iterative data processing operation remains. Relative to the at least one iteration, the execution thread will not perform the corresponding data processing operation, so that the at least one execution thread will not participate in all multiple iterations of the iterative data processing operation. This will cause the execution threads in the set of multiple execution threads to operate together to perform the iterative data processing operation, such that at least one execution thread in the set of multiple execution threads currently performing the iterative data processing operation will: The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation; The corresponding data processing operation is performed relative to each of the zeroth or more subsequent iterations in one or more subsequent iterations of the iterative data processing operation; then When at least one iteration of the multiple iterations of the iterative data processing operation remains, the iterative data processing operation is exited. Relative to the at least one iteration, the execution thread will not perform the corresponding data processing operation, so that the at least one execution thread will not participate in all multiple iterations of the iterative data processing operation. The method further includes responding to a set of one or more instructions when a set of multiple execution threads is executing the program: The execution threads in the set of multiple execution threads operate together to perform the iterative data processing operation; and At least one execution thread from the set of the plurality of execution threads currently performing the iterative data processing operation: The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation; The corresponding data processing operation is performed relative to each of the zeroth or more subsequent iterations in one or more subsequent iterations of the iterative data processing operation; then The iterative data processing operation is terminated when at least one iteration remains in the multiple iterations of the iterative data processing operation. Relative to the at least one iteration, the execution thread will not perform the corresponding data processing operation, so that the at least one execution thread will not participate in all multiple iterations of the iterative data processing operation.
2. The method of claim 1, wherein including a set of one or more instructions in the program comprises: The program includes one or more instructions to trigger the execution thread to exit the iterative data processing operation in response to instructions visible to the application programming interface (API) used by the program.
3. The method of claim 1, wherein including the set of one or more instructions in the program comprises: The program automatically includes one or more instructions to trigger the execution thread to exit the iterative data processing operation.
4. The method according to claim 1, 2 or 3, wherein each subsequent iteration of the iterative data processing operation includes a portion of an execution thread that performed the corresponding data processing operation in the previous iteration, the portion of the execution thread performing the corresponding data processing operation.
5. The method according to claim 1, 2 or 3, wherein the iterative data processing operation is a reduction operation or a prefix summation operation.
6. The method according to claim 1, 2, or 3, wherein exiting the at least one execution thread from the iterative data processing operation comprises deactivating the at least one execution thread, wherein the data processor is configured as a plurality of execution channels, and wherein each execution channel is operable to process the execution thread; the method comprises: When an execution thread in the set of the plurality of execution threads that are performing the iterative data processing operation is deactivated, a new execution thread is issued to the execution channel that was previously processing the deactivated execution thread, for use in performing another data processing operation while at least one iteration of the iterative data processing operation remains.
7. The method of claim 1, wherein each subsequent iteration of the iterative data processing operation is performed in response to a condition being satisfied; the method comprising: The condition is changed when the at least one execution thread exits the iterative data processing operation.
8. The method of claim 7, wherein the data processor is operable to maintain a count representing the number of execution threads in the set of the plurality of execution threads that have not yet exited the iterative data processing operation; and wherein changing the conditions includes updating the count.
9. The method according to any one of claims 1 to 3, the method comprising determining a set of one or more conditions, the set of one or more conditions being satisfied by an execution thread executing the program when the execution thread can exit the iterative data processing operation without affecting the output of the program; The program includes a set of one or more instructions that cause the execution thread executing the program to exit the iterative data processing operation when the set of one or more conditions is satisfied.
10. A data processing system, the data processing system comprising: A data processor operable to execute a program to perform data processing operations, wherein the program can be executed simultaneously by multiple execution threads; as well as A processing circuit configured to include a set of one or more instructions in a program to be executed by the data processor, the instructions, when operated on jointly by execution threads in a set of multiple execution threads, causing the execution threads in the set of multiple execution threads to perform an iterative data processing operation, the iterative data processing operation including multiple iterations, the multiple iterations including a first iteration and one or more subsequent iterations, wherein: Each iteration of the multiple iterations of the iterative data processing operation will include one or more execution threads from the set of multiple execution threads, each of which performs a corresponding data processing operation; The first iteration of the iterative data processing operation will include all execution threads in the set of multiple execution threads, each of which performs the corresponding data processing operation; Each iteration in one or more subsequent iterations of the iterative data processing operation will include: a subset of one or more execution threads that performed the corresponding data processing operation in the previous iteration, each of the one or more subsets of execution threads performing the corresponding data processing operation; and one or more other execution threads that performed the corresponding data processing operation in the previous iteration, wherein the one or more other execution threads do not perform the corresponding data processing operation; and Each iteration of one or more iterations of the iterative data processing operation will include at least one execution thread from the set of multiple execution threads. The at least one execution thread exits the iterative data processing operation when at least one iteration of the multiple iterations of the iterative data processing operation remains. Relative to the at least one iteration, the execution thread will not perform the corresponding data processing operation, so that the at least one execution thread will not participate in all multiple iterations of the iterative data processing operation. This will cause the execution threads in the set of multiple execution threads to operate together to perform the iterative data processing operation, such that at least one execution thread in the set of multiple execution threads currently performing the iterative data processing operation will: The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation; The corresponding data processing operation is performed relative to each of the zeroth or more subsequent iterations in one or more subsequent iterations of the iterative data processing operation; then When at least one iteration of the multiple iterations of the iterative data processing operation remains, the iterative data processing operation is exited. Relative to the at least one iteration, the execution thread will not perform the corresponding data processing operation, so that the at least one execution thread will not participate in all multiple iterations of the iterative data processing operation. The data processor is configured such that, when a set of multiple execution threads is executing the program, it responds to a set of one or more instructions: The execution threads in the set of multiple execution threads operate together to perform the iterative data processing operation; and At least one execution thread in the set of the plurality of execution threads currently performing the iterative data processing operation will: The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation; The corresponding data processing operation is performed relative to each of the zeroth or more subsequent iterations in one or more subsequent iterations of the iterative data processing operation; then The iterative data processing operation is terminated when at least one iteration remains in the multiple iterations of the iterative data processing operation. Relative to the at least one iteration, the execution thread will not perform the corresponding data processing operation, so that the at least one execution thread will not participate in all multiple iterations of the iterative data processing operation.
11. The system of claim 10, wherein the processing circuitry is configured to include one or more instructions in the program to trigger an execution thread to exit the iterative data processing operation in response to an instruction visible to an application programming interface for the program.
12. The system of claim 10, wherein the processing circuitry is configured to automatically include one or more instructions in the program to trigger the execution thread to exit the iterative data processing operation.
13. The system according to any one of claims 10 to 12, wherein each subsequent iteration of the iterative data processing operation includes a portion of an execution thread that performed the corresponding data processing operation in the previous iteration, the portion of the execution thread performing the corresponding data processing operation.
14. The system according to any one of claims 10 to 12, wherein the iterative data processing operation is a reduction operation or a prefix summation operation.
15. The system according to any one of claims 10 to 12, wherein exiting the at least one execution thread from the iterative data processing operation comprises deactivating the at least one execution thread, wherein the data processor is configured as a plurality of execution channels, and wherein each execution channel is operable to process the execution thread; and The data processor is configured such that when an execution thread in the set of the plurality of execution threads performing the iterative data processing operation is deactivated, a new execution thread is issued to the execution channel that was previously processing the deactivated execution thread for performing another data processing operation during at least one iteration remaining of the iterative data processing operation.
16. The system of claim 10, wherein each subsequent iteration of the iterative data processing operation is performed in response to a condition being satisfied; and The data processor includes processing circuitry configured to change the conditions when the execution thread exits the iterative data processing operation when at least one iteration of the iterative data processing operation remains.
17. The system of claim 16, wherein the data processor is operable to maintain a count representing the number of execution threads in the set of the plurality of execution threads that have not yet exited the iterative data processing operation; and wherein the processing circuitry is configured to change the conditions by updating the count.
18. A data processor, the data processor comprising: A programmable execution unit, operable to execute a program to perform data processing operations, wherein the program can be executed simultaneously by multiple execution threads; as well as A processing circuit configured such that when a set of multiple execution threads is executing a program comprising a set of one or more instructions for performing an iterative data processing operation, the iterative data processing operation includes multiple iterations, the multiple iterations including a first iteration and one or more subsequent iterations, wherein: Each iteration of the multiple iterations of the iterative data processing operation will include one or more execution threads from the set of multiple execution threads, each of which performs a corresponding data processing operation; The first iteration of the iterative data processing operation will include all execution threads in the set of multiple execution threads, each of which performs the corresponding data processing operation; Each iteration in one or more subsequent iterations of the iterative data processing operation will include: a subset of one or more execution threads that performed the corresponding data processing operation in the previous iteration, each of the one or more subsets of execution threads performing the corresponding data processing operation; and one or more other execution threads that performed the corresponding data processing operation in the previous iteration, wherein the one or more other execution threads do not perform the corresponding data processing operation; and Each iteration of one or more iterations of the iterative data processing operation will include at least one execution thread from the set of multiple execution threads. This at least one execution thread exits the iterative data processing operation while at least one iteration remains, and relative to that at least one iteration, the execution thread will not perform the corresponding data processing operation, such that the at least one execution thread will not participate in all iterations of the iterative data processing operation; in response to the set of one or more instructions: The execution threads in the set of multiple execution threads will operate together to perform the iterative data processing operation; and At least one execution thread in the set of the plurality of execution threads currently performing the iterative data processing operation will: The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation; The corresponding data processing operation is performed relative to each of the zeroth or more subsequent iterations in one or more subsequent iterations of the iterative data processing operation; then The iterative data processing operation is terminated when at least one iteration remains in the multiple iterations of the iterative data processing operation. Relative to the at least one iteration, the execution thread will not perform the corresponding data processing operation, so that the at least one execution thread will not participate in all multiple iterations of the iterative data processing operation.
19. A method for compiling a program to be executed by a data processor, the method comprising: The program includes a set of one or more instructions that, when executed by one of a set of execution threads, cause the execution threads in the set of execution threads to operate together to perform an iterative data processing operation. The iterative data processing operation includes multiple iterations, including a first iteration and one or more subsequent iterations, wherein: Each iteration of the multiple iterations of the iterative data processing operation will include one or more execution threads from the set of multiple execution threads, each of which performs a corresponding data processing operation; The first iteration of the iterative data processing operation will include all execution threads in the set of multiple execution threads, each of which will perform its own data processing operation. Each iteration in one or more subsequent iterations of the iterative data processing operation will include: a subset of one or more execution threads that performed the corresponding data processing operation in the previous iteration, each of the one or more subsets of execution threads performing the corresponding data processing operation; and one or more other execution threads that performed the corresponding data processing operation in the previous iteration, wherein the one or more other execution threads do not perform the corresponding data processing operation; and Each iteration of one or more iterations of the iterative data processing operation will include at least one execution thread from the set of multiple execution threads. The at least one execution thread exits the iterative data processing operation when at least one iteration of the multiple iterations of the iterative data processing operation remains. Relative to the at least one iteration, the execution thread will not perform the corresponding data processing operation, so that the at least one execution thread will not participate in all multiple iterations of the iterative data processing operation. This will cause the execution threads in the set of multiple execution threads to operate together to perform the iterative data processing operation, such that at least one execution thread in the set of multiple execution threads currently performing the iterative data processing operation will: The corresponding data processing operation is performed relative to the first iteration of the iterative data processing operation; The corresponding data processing operation is performed relative to each of the zeroth or more subsequent iterations in one or more subsequent iterations of the iterative data processing operation; then The iterative data processing operation is terminated when at least one iteration remains in the multiple iterations of the iterative data processing operation. Relative to the at least one iteration, the execution thread will not perform the corresponding data processing operation, so that the at least one execution thread will not participate in all multiple iterations of the iterative data processing operation.
20. The method of claim 19, the method comprising determining a set of one or more conditions, the set of one or more conditions being satisfied by an execution thread executing the program when the execution thread can exit the iterative data processing operation without affecting the output of the program; The program includes a set of one or more instructions that cause the execution thread executing the program to exit the iterative data processing operation when the set of one or more conditions is satisfied.
21. A computer program product comprising computer software code, which, when run on a data processing device, executes the method according to any one of claims 1 to 3.