Parameter buffer-based wave throttling

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By selectively throttling geometry shader waves based on in-flight and pending work, the control circuit addresses cache thrashing issues, improving graphics pipeline performance through optimized resource management.

JP7876543B2Active Publication Date: 2026-06-19ADVANCED MICRO DEVICES INC

View PDF 4 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: ADVANCED MICRO DEVICES INC
Filing Date: 2022-02-21
Publication Date: 2026-06-19

Smart Images

Figure 0007876543000001
Figure 0007876543000002
Figure 0007876543000003

Patent Text Reader

Abstract

The graphics pipeline (300) includes a first shader (305) that generates first waves, a shader processor input (SPI) (310) that launches the first waves for execution, and a scan converter (360) that generates second waves for execution based on the results of the first waves processing on one or more shaders. The first waves are selectively throttled based on a comparison of the first waves in flight and the second waves pending execution on at least one second shader. A cache (340) holds information that is written to the cache in response to the first waves finishing execution on the shaders. Information is read from the cache in response to read requests issued by the second waves. In some cases, the first waves are selectively throttled by comparing how many of the first waves are in flight and how many read requests to the cache are pending.

Need to check novelty before this filing date? Find Prior Art

Description

Background Art

[0001] A graphics processing unit (GPU) implements a graphics processing pipeline that simultaneously processes copies of commands fetched from a command buffer. GPUs and other multi-threaded processing units typically implement multiple processing elements (also referred to as processor cores or compute units) that simultaneously execute multiple instances of a single program for multiple data sets as a single wave. A hierarchical execution model is used to conform to the hardware-implemented hierarchy. The execution model defines a kernel of instructions to be executed by all waves (also referred to as wavefronts, threads, streams, or work items). The graphics pipeline within a GPU includes one or more shaders that execute using resources of the graphics pipeline such as compute units, memory, and caches. The graphics pipeline is typically divided into a geometry portion that performs geometry operations on patches or other primitives representing portions of an image. Shaders within the geometry portion can include vertex shaders, hull shaders, domain shaders, and geometry shaders. The geometry portion of the graphics pipeline completes when the primitives generated by the geometry portion of the pipeline are rasterized (e.g., by one or more scan converters) to form a set of pixels representing a portion of the image. Subsequent processing on the pixels, called pixel processing, includes operations executed by shaders such as pixel shaders that execute using resources of the graphics pipeline.

[0002] The present disclosure will be better understood by reference to the accompanying drawings, and many of its features and advantages will become apparent to those skilled in the art. The use of the same reference numerals in different drawings indicates similar or identical items.

Brief Description of the Drawings

[0003] [Figure 1] This is a block diagram of a processing system according to several embodiments. [Figure 2] This figure shows a graphics pipeline, according to several embodiments, that can process higher-order geometric primitives to generate a rasterized image of a three-dimensional (3D) scene at a predetermined resolution. [Figure 3] This is a block diagram of a portion of a graphics pipeline that selectively throttles a wave or group of waves triggered by a geometry shader, according to several embodiments. [Figure 4] This is a block diagram of a first embodiment of a control circuit for selectively throttling a wave or a group of waves, according to several embodiments. [Figure 5] This is a block diagram of a second embodiment of a control circuit for selectively throttling a wave or a group of waves, according to several embodiments. [Figure 6] This is a flowchart illustrating a method for selectively initiating geometry shader waves or wave groups according to several embodiments. [Modes for carrying out the invention]

[0004] Before dispatching a waveset (containing one or more waves) for processing by one or more shaders implemented by the compute units in the shader hub, the geometry engine reserves space in memory or a cache to store the output generated by processing the waveset in the shaders. For example, the geometry engine may send a reservation request to the PC manager for space to hold a parameter buffer. The PC manager reserves the requested space in a level 2 (L2) cache and returns information to the geometry engine identifying the reserved space in the L2 cache. Upon receiving the reservation confirmation, the geometry engine provides the waveset to the shader processor input (SPI), which then invokes the waveset for processing by the compute units in the shader hub. Attributes from the shader output are stored in the reserved space in the L2 cache. The location is provided to the primitive assembler, which assembles the primitive (triangle, etc.) and sends the primitive to the scan converter via the primitive hub. The scan converter generates a pixel wave, which is then returned to the SPI, fetching attributes from the L2 cache. The compute units within the shader hub then perform pixel processing on the pixel wave using the attributes retrieved from the L2 cache. Therefore, there is a dependency between the graphics shader waves generated by the geometry engine and the pixel waves generated by the scan converter. This dependency can lead to excessive cache thrashing and degrade the performance of the graphics pipeline if the geometry engine invokes too many waves, writing too much data to the L2 cache.

[0005] Figures 1-6 disclose a system and technique for reducing cache thrashing shared by geometry shaders and pixel shaders by selectively throttling geometry shader (GS) wave sets (or wave sets associated with other shaders, such as vertex shaders) invoked by the Shader Processor Input (SPI) based on a comparison of in-flight GS work and pending pixel shader (PS) work generated by a scan converter. The scan converter provides the SPI with requests to read information from the cache. Several embodiments of the management circuit maintain counters for the following three events: (1) a first counter for invoking GS wave sets, (2) a second counter for GS wave sets that have finished execution on the shader by writing to the cache, and (3) a third counter for the number of requests to read from the cache for PS waves generated by the scan converter. The counters are incremented in response to corresponding events written to a windowing first-in-first-out (FIFO) buffer and decremented in response to corresponding events read from the windowing FIFO. The control circuit determines the amount of in-flight GS work based on the difference between a first counter and a second counter. The control circuit determines the amount of pending PS work based on the difference between a second counter and a third counter. If the amount of in-flight GS work is greater than the amount of pending PS work, the control circuit throttles the wave group activated by the SPI. Otherwise, the SPI can freely activate the wave group according to an algorithm such as a greedy algorithm. In some embodiments, the criterion for throttling the wave group is modified so that the control circuit throttles the wave group activated by the SPI when the amount of in-flight GS work is greater than the amount of pending PS work plus an additional coefficient, thereby reducing the possibility that throttling will exhaust the graphics pipeline work.For example, an additional coefficient can be determined based on a measure of the burstiness of the number of reads requested by pending PS work.

[0006] Figure 1 is a block diagram of a processing system 100 in several embodiments. The processing system 100 includes, or has access to, memory 105 or other storage components implemented using a non-temporary computer-readable storage medium such as Dynamic Random-Access Memory (DRAM). However, in some cases, memory 105 may also be implemented using other types of memory, including Static Random-Access Memory (SRAM), non-volatile RAM, etc. Memory 105 is called external memory because it is implemented outside the processing units implemented in the processing system 100. The processing system 100 also includes a bus 110 to support communication between entities implemented in the processing system 100, such as memory 105. Some embodiments of the processing system 100 include other buses, bridges, switches, routers, etc., which are not shown in Figure 1 for clarity.

[0007] The techniques described herein are used in various embodiments with any of the various parallel processors (e.g., vector processors, graphics processing units (GPUs), general-purpose GPUs (GPGPUs), non-scalar processors, highly parallel processors, artificial intelligence (AI) processors, inference engines, machine learning processors, and other multithreaded processing units). Figure 1 shows an example of a parallel processor, in particular a graphics processing unit (GPU) 115, according to several embodiments. The graphics processing unit (GPU) 115 renders an image for presentation on a display 120. For example, the GPU 115 renders an object to generate pixel values to be provided to the display 120, and the display 120 uses the pixel values to display an image representing the rendered object. The GPU 115 implements a number of compute units (CUs) 121, 122, 123 (collectively referred to herein as "Compute Units 121-123") that execute instructions simultaneously or in parallel. In some embodiments, the compute units 121-123 include one or more single-instruction-multiple-data (SIMD) units, and the compute units 121-123 are aggregated into a workgroup processor, shader array, shader engine, etc. The number of compute units 121-123 implemented in the GPU 115 is a matter of design choice, and some embodiments of the GPU 115 include more or fewer compute units than those shown in Figure 1. The compute units 121-123 can be used to implement a graphics pipeline as described herein. Some embodiments of the GPU 115 are used for general-purpose computing. The GPU 115 executes instructions such as program code 125 stored in memory 105, and the GPU 115 stores information such as the results of the executed instructions in memory 105.

[0008] The processing system 100 also includes a Central Processing Unit (CPU) 130 connected to a bus 110 and therefore communicating with the GPU 115 and memory 105 via the bus 110. The CPU 130 implements a plurality of processor cores 131, 132, 133 (collectively referred to herein as "processor cores 131-133") that execute instructions simultaneously or in parallel. The number of processor cores 131-133 implemented in the CPU 130 is a matter of design choice, and some embodiments include more or fewer processor cores than those shown in Figure 1. The processor cores 131-133 execute instructions such as program code 135 stored in memory 105, and the CPU 130 stores information such as the results of the executed instructions in memory 105. The CPU 130 can also start graphics processing by issuing a draw call to the GPU 115. Some embodiments of the CPU 130 implement multiple processor cores (not shown in Figure 1 for clarity) that execute instructions independently, either simultaneously or in parallel.

[0009] The Input / Output (I / O) engine 145 handles input or output operations associated with the display 120, as well as other elements of the processing system 100, such as a keyboard, mouse, printer, and external disk. The I / O engine 145 is coupled to a bus 110 so that it can communicate with memory 105, GPU 115, or CPU 130. In the illustrated embodiment, the I / O engine 145 reads information stored in an external storage component 150, which is implemented using a non-temporary computer-readable storage medium such as a Compact Disk (CD) or Digital Video Disc (DVD). The I / O engine 145 can also write information, such as the results of processing by the GPU 115 or CPU 130, to the external storage component 150.

[0010] The processing system 100 implements a pipeline circuit for executing instructions in multiple stages of the pipeline. The pipeline circuit is implemented in several embodiments of the compute units 121-123 or processor cores 131-133. In some embodiments, the pipeline circuit including compute units 121-123 is used to implement a graphics pipeline that executes different types of shaders, including, but not limited to, vertex shaders, hull shaders, domain shaders, geometry shaders, and pixel shaders. Some embodiments of the processing system 100 include one or more caches that hold information written to the cache by the shader in response to the completion of the execution of a wave or a group of waves, such as a wave or a group of geometry shader waves. The information written to the cache is then read out during the execution of other waves or groups of waves, such as a pixel shader wave. Some embodiments of the geometry shader generate a first group of waves, and the shader processor input (SPI) invokes the first group of waves for execution by the shader. A scan converter generates a second wave for execution on the shader based on the results of processing the first group of waves, one or more shaders. The first wave group is selectively throttled based on a comparison between the in-flight first wave group and the second wave, which has execution pending on at least one shader. The cache holds information that is written to the cache as the first wave group finishes execution on the shader. This information is read from the cache in response to read requests issued by the second wave. In some cases, the first wave group is selectively throttled by comparing how many first wave groups are in-flight and how many read requests to the cache are pending.

[0011] Figure 2 shows a graphics pipeline 200, in several embodiments, capable of processing higher-order geometry primitives to generate a rasterized image of a three-dimensional (3D) scene at a predetermined resolution. The graphics pipeline 200 is implemented in several embodiments of the processing system 100 shown in Figure 1. The illustrated embodiments of the graphics pipeline 200 are implemented according to the DX11 specification. Other embodiments of the graphics pipeline 200 are implemented according to other Application Programming Interfaces (APIs) such as Vulkan, Metal, and DX12. The graphics pipeline 200 is subdivided into a geometry processing section 201, which includes the pre-rasterization graphics pipeline 200, and a pixel processing section 202, which includes the post-rasterization graphics pipeline 200.

[0012] The graphics pipeline 200 has access to storage resources 205, such as one or more memory or cache hierarchies used to implement buffers and store vertex data, texture data, etc. In the illustrated embodiment, the storage resource 205 includes a load data store (LDS) 206 circuit used to store data. The storage resource 205 also includes one or more caches 207 for caching frequently used data. Caches 207 are used to implement parameter buffers. As described herein, a wave or wave group executing on a shader in the graphics pipeline 200 terminates execution by writing the results of processing the wave or wave group to the cache 207. Shaders further downstream in the graphics pipeline 200 can issue read requests to read information from the cache 207, such as the results of processing by a wave or wave group that previously terminated execution on the shader. The storage resource 205 may be implemented using some embodiment of the memory 105 shown in Figure 1.

[0013] The input assembler 210 accesses information from the storage resource 205, which is used to define objects that represent parts of the scene model. An example of a primitive is shown in Figure 2 as a triangle 211, but in some embodiments of the graphics pipeline 200, other types of primitives are processed. A triangle 203 contains one or more vertices 212 connected by one or more edges 214 (for clarity, only one of each is shown in Figure 2). The vertices 212 are shaded during the geometry processing unit 201 of the graphics pipeline 200.

[0014] The vertex shader 215, implemented in software in the illustrated embodiment, logically receives a single vertex 212 of a primitive as input and outputs a single vertex. Some embodiments of shaders, such as the vertex shader 215, perform single-instruction-multiple-data (SIMD) processing so that multiple vertices are processed simultaneously. The graphics pipeline 200 implements a unified shader model so that all shaders included in the graphics pipeline 200 have the same execution platform on a shared large-scale SIMD computing unit. Thus, shaders, including the vertex shader 215, are implemented using a common set of resources referred to herein as the unified shader pool 216.

[0015] The hull shader 218 operates on an input higher-order patch or control point used to define the input patch. The hull shader 218 outputs tessellation coefficients and other patch data, such as control points for the patch processed in the hull shader 218. The tessellation coefficients are stored in the storage resource 205 so that they can be accessed by other entities in the graphics pipeline 200.

[0016] The tessellator 220 receives objects (such as patches) from the hull shader 218. In some embodiments, primitives generated by the hull shader 218 are provided to the tessellator 220. The tessellator 220 generates information identifying primitives corresponding to the input objects by tessellating the input objects, for example, based on tessellation coefficients generated by the hull shader 218. Tessellation subdivides input higher-order primitives, such as patches, into a set of lower-order output primitives representing finer levels of detail, as indicated, for example, by tessellation coefficients that specify the granularity of the primitives generated by the tessellation process. Thus, the scene model is represented by fewer higher-order primitives (to save memory or bandwidth), and additional detail is added by tessellating the higher-order primitives.

[0017] The domain shader 224 is input with the domain location and (optionally) other patch data. The domain shader 224 operates with the provided information and generates a single vertex for the output based on the input domain location and other information. In the illustrated embodiment, the domain shader 224 generates a primitive 222 based on the triangle 211 and tessellation coefficients. The domain shader 224 invokes the primitive 222 when processing is complete.

[0018] The geometry shader 226 receives input primitives from the domain shader 224 and outputs up to four primitives (for each input primitive) generated by the geometry shader 226 based on the input primitives. In the illustrated embodiment, the geometry shader 226 generates an output primitive 228 based on the tessellated primitive 222. Some embodiments of the geometry shader 226 generate a wave set (referred to herein as the "GS wave set") which is invoked by the corresponding shader processor input (SPI, not shown in Figure 2 for clarity). Upon completion of execution on the shader engine, the wave set writes its output back to the cache 207.

[0019] One stream of primitives is provided to one or more scan converters 230, and in some embodiments, up to four streams of primitives are concatenated into a buffer in a storage resource 205. The scan converters 230 perform other operations such as shading, clipping, perspective splitting, cutting, and viewport selection. The scan converters 230 generate a set of pixels 232 that will be processed later in the pixel processing unit 202 of the graphics pipeline 200. Some embodiments of the scan converters 230 provide a request to read information from the cache 207, for example, by sending a request to an SPI implemented in the graphics pipeline 200.

[0020] In the illustrated embodiment, the pixel shader 234 takes a pixel flow (for example, including a set of pixels 232) as input and outputs 0 or another pixel flow depending on the input pixel flow. The output merger block 236 performs blending, depth, stenciling, or other operations on the pixels received from the pixel shader 234.

[0021] Some or all of the shaders in the graphics pipeline 200 perform texture mapping using texture data stored in the storage resource 205. For example, a pixel shader 234 can read texture data from the storage resource 205 and use that texture data to shade one or more pixels. The shaded pixels are then provided to the display for presentation to the user.

[0022] Figure 3 is a partial block diagram of a graphics pipeline 300 that selectively throttles a wave or group of waves triggered by a geometry shader, according to several embodiments. The graphics pipeline 300 is implemented in several embodiments of the processing system 100 shown in Figure 1 and the graphics pipeline 200 shown in Figure 2.

[0023] The geometry engine 305 generates waves or wave groups for the geometry shader. Therefore, the waves or wave groups generated by the geometry engine 305 are referred to as GS wave groups. However, in some embodiments, the waves or wave groups are generated by or for other shaders, such as vertex shaders, in which case the waves or wave groups are referred to by other names, such as VS wave groups. The geometry engine 305 provides the GS wave groups to the SPI 310, which selectively activates or throttles the GS wave groups as described herein. The geometry engine 305 also provides information to the management circuit 315 for signaling the activation of the GS wave groups, as indicated by arrow 320. The management circuit 315 increments a first counter 325 in response to activation events being written to the windowing buffer 330. The management circuit 315 also decrements the first counter 325 in response to activation events being read from the windowing buffer 330.

[0024] SPI310 launches a group of GS waves for execution on one or more shaders within shader hub 335. The group of GS waves is executed by shader hub 335, and upon completion of execution, the group of GS waves writes the results to cache 340. Shader hub 335 signals to SPI310 that the group of GS waves has completed execution in response to writing the results to cache 340. Shader hub 335 sends an indication of the completion of the group of GS waves to SPI310, and SPI310 sends a signal (referred to herein as an "end" signal or a "completion" signal) to management circuit 315 to indicate that the group of GS waves has completed execution, as indicated by arrow 345. Management circuit 315 increments a second counter 326 in response to the execution end event being written to window winding buffer 330. Also, management circuit 315 decrements the second counter 326 in response to the execution end event being read from window winding buffer 330.

[0025] The primitive assembler 350 generates primitives by processing a group of GS waves and provides the primitives to the crossbar 355 (also called the primitive hub), which provides the assembled primitives to the scan converter 360. The scan converter 360 generates pixel shader (PS) waves for execution by shaders in the shader hub 335. Thus, the scan converter 360 signals to the SPI 310, as indicated by 365, and the SPI 310 can invoke the PS waves for execution in the shader hub 335. The SPI 310 also generates a read request to read information from the cache 340, which is used by the shader hub 335 to process the PS waves. In response to generating the read request, the SPI 310 sends a signal to the management circuit 315, as indicated by arrow 370, indicating that the read request is pending for the cache 340. The management circuit 315 increments a third counter 327 when a read request event is written to the windowing buffer 330. The management circuit 315 also decrements the third counter 327 when a read request event is read from the windowing buffer 330. Read requests do not leave the SPI 310 until the SPI 310 receives a "GS wave complete" signal from the management circuit 315.

[0026] Some embodiments of management circuit 315 selectively throttle the startup from SPI 310 (or instruct SPI 310 to selectively throttle the startup) based on a comparison of the number of in-flight GS wave groups and the number of pending PS waves. The management circuit 315 determines a first number of a first wave group of in-flight based on the difference between a first counter 325 and a second counter 326. Also, the management circuit 315 determines a second number of PS waves that are holding execution on the shaders within the shader hub 335 based on the difference between the second counter 326 and a third counter 327. The management circuit 315 throttles the GS wave group (or instructs SPI 310 to throttle) in response to the first number being less than the second number. Some embodiments of the management circuit 315 determine an additional "burstiness" factor that is applied to reduce the likelihood that throttling the GS wave group will starve the operation of the graphics pipeline 300. This additional factor is determined based on an estimate of the burstiness of read requests associated with the PS waves. In that case, the management circuit 315 throttles the GS wave group (or instructs SPI 310 to throttle) in response to the first number being less than the sum of the second number and the additional burstiness factor.

[0027] FIG. 4 is a block diagram of a first embodiment of a management circuit 400 that selectively throttles waves or wave groups, according to some embodiments. The first embodiment of the management circuit 400 is used to implement some embodiments of the management circuit 315 shown in FIG. 3. The management circuit 400 receives information associated with an event from an event generation circuit 405. In some embodiments, the information includes signaling indicating a startup event, an end execution event, a read request event, etc.

[0028] The windowing buffer 410 stores information representing events in entries within the windowing buffer 410. Some embodiments of the windowing buffer 410 are implemented as first-in, first-out (FIFO) buffers such that events received from the event generation circuit 405 are added (or pushed) to the last entry of the windowing buffer 410 and removed (or popped) from the first entry of the windowing buffer 410.

[0029] The management circuit 400 includes a set of counters 415 used to count events in response to entries added to the windowing buffer 410. In the illustrated embodiment, the set 415 includes an initiation counter 416 that counts GS waves or wave groups that are invoked to execute in one or more shaders, a generation counter 417 that counts GS waves or wave groups that are completed by writing to the cache, and a consumption counter 418 that counts read requests to the cache for, for example, PS waves. The initiation counter 416, generation counter 417, and consumption counter 418 are incremented in response to corresponding events added to the windowing buffer 410. The counters 416-418 in the set 415 are reset to 0 (or other predetermined value) when idle.

[0030] The management circuit 400 also includes a set of counters 420 used to count the number of startup events, generation events, and consumption events contained in the windowing buffer 410. The set 420 includes a startup event counter 421 which increments by the number of startup events written to the windowing buffer 410 and decrements by the number of startup events read from the windowing buffer 410. The set 420 also includes a generation event counter 422 which increments by the number of generation events written to the windowing buffer 410 and decrements by the number of generation events read from the windowing buffer 410. The set 420 further includes a consumption event counter 423 which increments by the number of consumption events written to the windowing buffer 410 and decrements by the number of consumption events read from the windowing buffer 410.

[0031] The management circuit 400 further includes an event run counter 425 for each event type (e.g., launch events, production events, and consumption events). The event run counter 425 counts the burstiness of each event. The event run counter 425 for an event increments by one each time the event run is broken on the write side of the windowing buffer 410. For example, if there are 50 launch events with no production or consumption events, the LaunchRunCounter in the event run counter 425 will increment by one. After 50 launches, if there are 50 launch events and 50 production events in the next 50 cycles, the LaunchRunCounter will have a value of 51 and the ProduceRunCounter will have a value of 50.

[0032] The control circuit 400 uses the values of counters 415, 420, and 425 to calculate parameters indicating the burstiness of events. The average burst circuit 430 calculates per-event metrics as follows: AverageBurst=EventCount / EventRunCount

[0033] The high-rate circuit 435 calculates the metric for each event as follows: HighRate=EventCount+Event>AverageBurst

[0034] The values generated by the average burst circuit 430 and the high-rate circuit 435 are provided to the startup decision circuit 440, which uses this information in combination with the values of counters 415, 420, and 425 to selectively throttle the startup of a GS wave or wave group.

[0035] In some embodiments of the startup decision circuit 440, the startup of a GS wave or wave group is selectively throttled based on a comparison of in-flight GS work and pending PS work. In-flight GS work (WorkInFlight) is estimated based on the difference between the values of the startup counter 416 and the generation counter 417. Pending PS work (WorkReady) is estimated based on the difference between the values of the generation counter 417 and the consumption counter 418. The startup decision circuit 440 throttles the startup of a GS wave or wave group if the in-flight GS work is greater than the pending PS work. In some embodiments, the startup decision circuit 440 throttles the startup of a GS wave or wave group if the following criteria are met: WorkInFlight>WorkReady+HighRate[Read]

[0036] An additional coefficient (HighRate[Read]) is included to account for pending PS work, such as the potential burstiness of read requests for PS waves.

[0037] Figure 5 is a block diagram of a second embodiment of a control circuit 500 for selectively throttling a wave or group of waves, according to several embodiments. The second embodiment of the control circuit 500 is used to implement several embodiments of the control circuit 315 shown in Figure 3. The control circuit 500 receives information associated with events from the event generation circuit 505. In some embodiments, the information includes signaling indicating start events, end execution events, read request events, etc. The control circuit 500 includes a set of counters 515 used to count events in accordance with entries added to the windowing buffer 510. The set 515 includes a start counter 516, a generation counter 517, and a consumption counter 518. The control circuit 500 also includes a set of counters 520, 521, 522, and 523 used to count the number of start events, generation events, and consumption events contained in the windowing buffer 510. The event run counter 525 counts the burstiness of each event, including start events, generation events, and consumption events.

[0038] The control circuit 500 uses the values of counters 515, 520, and 525 to calculate parameters indicating the burstiness of events. The average burst circuit 530 calculates the per-event metric as follows: AverageBurst=EventCount / EventRunCount

[0039] The high-rate circuit 535 calculates the metric for each event as follows: HighRate=EventCount+EventAverageBurst

[0040] The low-rate circuit 540 calculates the metric for each event as follows: LowRate=EventCount-EventAverageBurst

[0041] The values generated by the average burst circuit 530, the high-rate circuit 535, and the low-rate circuit 540 are provided to the activation decision circuit 545, which uses this information in combination with the values of counters 515, 520, and 525 to selectively throttle the activation of a GS wave or wave group.

[0042] Some embodiments of the start-up decision circuit 545 selectively throttle the start of a GS wave or wave group based on a comparison of in-flight GS work and pending PS work. In-flight GS work (WorkInFlight) is estimated based on the difference between the values of the start-up counter 516 and the generation counter 517. Pending PS work (WorkReady) is estimated based on the difference between the values of the generation counter 517 and the consumption counter 518. In the illustrated embodiment, the start-up decision circuit 545 defines the consumption rate as follows: ConsumeRate=HighRate[Consume]-LowRate[Produce]

[0043] Next, the startup decision circuit 545 estimates or predicts the amount of work that is ready, for example, using the following definitions: ReadyForecast=WorkReady-ConsumeRate

[0044] The startup decision circuit 545 throttles the startup of the GS wave or wave group if the following criteria are met. WorkInFlight>LowRate[Launch]-ReadyForecast

[0045] If this criterion is not met, an additional GS wave or wave group will be triggered.

[0046] Figure 6 is a flowchart of a method 600 for selectively initiating a GS wave or wave group according to several embodiments. The method 600 is implemented in several embodiments of the processing system 100 shown in Figure 1, the graphics pipeline 200 shown in Figure 2, the graphics pipeline 300 shown in Figure 3, the management circuit 400 shown in Figure 4, and the management circuit 500 shown in Figure 5.

[0047] In block 605, the control circuit counts GS wave group activations. In block 610, the control circuit counts GS wave group terminations. In block 615, the control circuit counts read requests for PS waves. In determination block 620, the control circuit compares the amount of in-flight (in flight, in Figure 6) GS work (determined based on the number of GS wave group activations and terminations) with the amount of pending PS work (determined based on the number of GS wave group terminations and read requests for PS waves). In some embodiments, the control circuit compares the amount of in-flight GS work with the sum of the amount of pending PS work and an additional coefficient to account for the burstiness of PS work, as described herein. If the in-flight GS work exceeds the pending PS work (which may be augmented by the additional coefficient), method 600 proceeds to block 625, where the control circuit throttles the activation of GS wave groups. If the in-flight GS work is less than the pending PS work (which may be augmented by an additional factor), method 600 proceeds to block 630, and the control circuit does not throttle the activation of the GS wave group.

[0048] Computer-readable storage media include any non-temporary storage media or combination of non-temporary storage media that are accessible by a computer system during use to provide instructions and / or data to the computer system. Such storage media may include, but are not limited to, optical media (e.g., compact discs (CDs), digital versatile discs (DVDs), Blu-ray® discs), magnetic media (e.g., floppy disks, magnetic tapes, magnetic hard drives), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or flash memory), or microelectromechanical system (MEMS) based storage media. Computer-readable storage media (e.g., system RAM or ROM) may be built into the computing system, computer-readable storage media (e.g., magnetic hard drives) may be permanently mounted to the computing system, computer-readable storage media (e.g., optical disks or Universal Serial Bus (USB) based flash memory) may be detachably mounted to the computing system, and computer-readable storage media (e.g., network-accessible storage (NAS)) may be connected to the computer system via a wired or wireless network.

[0049] In some embodiments, certain aspects of the technology described above are implemented by one or more processors of a processing system that executes the software. The software includes one or more sets of executable instructions, which are stored in a non-temporary computer-readable storage medium or otherwise clearly embodied. The software may also include instructions and specific data, which, when executed by one or more processors, operate the one or more processors to execute one or more aspects of the technology described above. Non-temporary computer-readable storage mediums may include, for example, magnetic or optical disk storage devices, solid-state storage devices such as flash memory, caches, random-access memory (RAM), or other non-volatile memory devices (one or more). Executable instructions stored in a non-temporary computer-readable storage medium can be implemented as source code, assembly language code, object code, or other instruction forms that can be interpreted or otherwise executed by one or more processors.

[0050] In addition to the foregoing, it should be noted that not all activities or elements described in the summary are required, and certain activities or parts of devices may not be required, and one or more additional activities may be performed, and one or more additional elements may be included. Furthermore, the order in which the activities are listed does not necessarily indicate the order in which they are performed. Also, the concepts have been described with reference to specific embodiments. However, those skilled in the art will understand that various modifications and variations can be made without departing from the scope of the invention as described in the claims. Therefore, the specification and drawings should be considered illustrative rather than restrictive, and all of these variations are intended to fall within the scope of the invention.

[0051] Benefits, other advantages, and solutions to problems have been described above with respect to specific embodiments. However, benefits, advantages, solutions to problems, and features that may give rise to or manifest any benefits, advantages, or solutions are not to be construed as essential, necessary, or indispensable features to any or all of the claims. Furthermore, the disclosed invention can be modified and implemented in different but similar ways, in ways that are obvious to those skilled in the art who are interested in the teachings of this specification; therefore, the specific embodiments described above are merely illustrative. There are no limitations to the details of the configuration or design shown herein beyond those described in the appended claims. Accordingly, the specific embodiments described above may be modified or altered, and it is clear that all such modifications are within the scope of the disclosed invention. Accordingly, the protection sought herein is described in the appended claims.

Claims

1. It is a device, A wave activation circuit configured to activate a first wave group, The system comprises a scan converter circuit configured to generate a second wave based on the results of processing the first wave group, The first wave group is selectively throttled based on a comparison between the first wave group in flight and the second wave group whose execution is suspended. Device.

2. The system further includes a cache configured to hold stored information in response to the termination of execution of the first wave group, The aforementioned information is read by the corresponding second wave. The apparatus according to claim 1.

3. A first counter circuit configured to count the activation of the first wave group, A second counter circuit configured to count the first wave group whose execution has been completed by writing to the cache, The system further comprises a third counter circuit configured to count requests from the second wave for reading from the cache, The apparatus according to claim 2.

4. Each of the first counter circuit, the second counter circuit, and the third counter circuit is incremented in response to a startup event, an execution completion event, and a read request event written to the windowing buffer. The apparatus according to claim 3.

5. Each of the first counter circuit, the second counter circuit, and the third counter circuit is decremented in response to the startup event, the execution completion event, and the read request event read from the windowing buffer. The apparatus according to claim 4.

6. The system further includes a management circuit configured to determine a first number of in-flight waves based on the difference between the first counter circuit and the second counter circuit, and to determine a second number of second waves that are pending execution based on the difference between the second counter circuit and the third counter circuit. The apparatus according to claim 5.

7. The management circuit is configured to throttle the first wave group activated by the wave activation circuit, depending on whether the first number is smaller than the second number. The apparatus according to claim 6.

8. The management circuit is configured to throttle the first wave group activated by the wave activation circuit in cases where the first number is smaller than the second number plus an additional coefficient estimated based on the burstiness index of read requests associated with the second wave. The apparatus according to claim 7.

9. The management circuit includes an event run counter for determining a burstiness index of the read request associated with the second wave. The event run counter is incremented in response to interruptions in the sequence of events written to the windowing buffer. The apparatus according to claim 8.

10. It is a method, The wave activation circuit activates the first wave group, In a scan converter circuit, a second wave is generated based on the results of processing the first wave group, This includes selectively throttling a first wave group based on a comparison between a first wave group in flight and a second wave group whose execution is suspended. method.

11. The process further includes writing information to a cache in response to the completion of the first wave group, The aforementioned information is read by the corresponding second wave. The method of claim 10.

12. In the first counter circuit, the activation of the first wave group is counted, In the second counter circuit, the first wave group whose execution has been completed by writing to the cache is counted, A third counter circuit further includes counting the requests from the second wave to read from the cache, The method according to claim 11.

13. Writing a startup event to a windowing buffer, and counting the startups of the first wave group, includes incrementing the first counter circuit in response to writing the startup event to the windowing buffer. Reading the activation events from the windowing buffer, and counting the activations of the first wave group, further includes decrementing the first counter circuit in response to reading the activation events from the windowing buffer, The method according to claim 12.

14. Writing the execution completion event to the windowing buffer, and counting the first wave group whose execution has finished, includes incrementing the second counter circuit in response to writing the execution completion event to the windowing buffer. Reading the execution completion event from the windowing buffer, and counting the first wave group that has completed execution, further includes decrementing the second counter circuit in response to reading the execution completion event from the windowing buffer, The method of claim 13.

15. Writing a read request event to the windowing buffer, and counting requests from the second wave, includes incrementing the third counter circuit in response to writing the read request event to the windowing buffer. Reading the read request events from the windowing buffer, and counting the requests from the second wave, further includes decrementing the third counter circuit in response to reading the read request events from the windowing buffer, The method of claim 13.

16. Based on the difference between the first counter circuit and the second counter circuit, the first number of the first wave group in flight is determined, The further includes determining a second number of second waves whose execution is pending based on the difference between the second counter circuit and the third counter circuit. The method according to claim 15.

17. Selectively throttling the first wave group includes throttling the first wave group activated by the wave activation circuit in accordance with the fact that the first number is smaller than the second number. The method according to claim 16.

18. Selectively throttling the first wave group includes throttling the first wave group activated by the wave activation circuit in such cases that the first number is smaller than the second number plus an additional coefficient estimated based on a burstiness index of read requests associated with the second wave. The method according to claim 17.

19. The further includes determining a burstiness index of the read requests associated with the second wave by incrementing an event run counter in response to interruptions in the continuous events written to the windowing buffer. The method of claim 18.

20. It is a device, A scan converter circuit configured to generate a second wave based on the results of processing a first wave group, The system comprises a cache configured to hold stored information in response to the termination of execution of the first wave group, Based on a comparison between a first number of the first wave group that has been started and has not terminated execution by writing to the cache, and a second number of read requests to the cache from the second wave that is pending execution, the first wave group is selectively throttled. Device.