Thread bundle access pattern aware cache
By detecting the locality of GPU workload and optimizing cache management accordingly, the problems of memory latency and low cache efficiency in GPUs are solved, achieving more efficient memory usage and performance improvement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- QUALCOMM INC
- Filing Date
- 2024-11-13
- Publication Date
- 2026-06-26
AI Technical Summary
In graphics processing units (GPUs), memory operation latency becomes a performance bottleneck, especially due to cache contention in the L1 data cache and premature data eviction leading to wasted cache space, as well as inefficient cache management caused by inter-thread interference.
By detecting the locality of each workload, and leveraging a cache management scheme based on the locality behavior of each load, the use of GPU cache is optimized, including decisions on storing and bypassing data, to improve cache efficiency.
It improves GPU processing speed, reduces memory usage and power consumption, and optimizes the overall performance of the GPU.
Smart Images

Figure CN122295656A_ABST
Abstract
Description
Cross-references to related applications
[0001] This application claims the benefit of U.S. Non-Provisional Patent Application Serial No. 18 / 536,019, entitled “WARP ACCESS PATTERN-AWARE CACHES”, filed on December 11, 2023, the entire contents of which are expressly incorporated herein by reference. Technical Field
[0002] This disclosure relates generally to processing systems, and more specifically to one or more techniques for graphics or data processing. Background Technology
[0003] Computing devices typically perform graphics and / or display processing (e.g., utilizing a graphics processing unit (GPU), a central processing unit (CPU), a display processor, etc.) to render and display visual content. Such computing devices can include, for example, computer workstations, mobile phones (such as smartphones), embedded systems, personal computers, tablet computers, and video game consoles. A GPU is configured to execute a graphics processing pipeline that includes one or more processing stages that operate together to execute graphics processing commands and output frames. A CPU controls the operation of a GPU by issuing one or more graphics processing commands to it. Modern CPUs are typically capable of executing multiple applications concurrently, each of which may require the GPU during execution. A display processor is configured to convert digital information received from the CPU into analog values and can issue commands to a display panel to display visual content. Devices that provide content for visual presentation on a display may utilize a GPU and / or a display processor.
[0004] The device's GPU can be configured to execute processes within the graphics processing pipeline. Additionally, a display processor or display processing unit (DPU) can be configured to perform display processing. However, with the advent of wireless communication and smaller handheld devices, the demand for improved graphics or display processing continues to increase. Summary of the Invention
[0005] The following is a simplified summary of one or more aspects to provide a basic understanding of these aspects. This summary is not a broad overview of all anticipated aspects, nor is it intended to identify key or essential elements of all aspects, nor to describe the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that follows.
[0006] In one aspect of this disclosure, a method, computer-readable medium, and apparatus are provided. The apparatus may be a graphics processing unit (GPU), a central processing unit (CPU), or any apparatus capable of performing data processing or graphics processing. The apparatus may obtain indication of a set of data threads associated with graphics processing, wherein the set of data threads corresponds to a set of workloads. The apparatus may also convert information associated with a loader program counter (PC) of the set of workloads into a set of load identifiers (IDs) for the set of workloads. The apparatus may also store the set of load IDs for the set of workloads in an alias table based on the conversion of the information. Additionally, the apparatus may identify the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with an access pattern of at least one cache line for the cache of each workload in the set of workloads. The apparatus may also configure locality information associated with the locality of each workload in a first set of workloads, the first set of workloads corresponding to a first set of data threads in the set of data threads. Furthermore, the device can determine whether to configure or store locality information associated with the locality of each workload in the at least one second set of workloads. The device can also store access patterns of at least one cache line of the first set of workloads in the set of workloads based on the locality of each workload in the set of workloads, wherein the first set of workloads corresponds to a first set of data threads in the set of data threads. The device can also store or avoid storing data of at least one second set of workloads in the set of workloads based on the access patterns of the at least one cache line of the first set of workloads. The device can also output an indication of storing or avoiding storing data for the at least one second set of workloads.
[0007] Details of one or more examples of this disclosure are set forth in the accompanying drawings and the following description. Other features, objects, and advantages of this disclosure will become apparent from the description, the drawings, and the claims. Attached Figure Description
[0008] Figure 1 This is a block diagram illustrating the example content generation system.
[0009] Figure 2 An example graphics processing unit (GPU) is shown.
[0010] Figure 3 This is a diagram illustrating the example processing component.
[0011] Figure 4 This is a diagram illustrating an example GPU.
[0012] Figure 5 This is a diagram illustrating an example GPU.
[0013] Figure 6 This is a diagram illustrating an example mapping of a cache.
[0014] Figure 7 This is a diagram illustrating an example cache.
[0015] Figure 8 This is a diagram illustrating an example cache architecture.
[0016] Figure 9 This is a diagram illustrating examples of data locality types that depict different GPU workloads.
[0017] Figure 10 This is a diagram illustrating an example of a thread bundle execution scheme.
[0018] Figure 11 This is a diagram illustrating an example of a cache management scheme.
[0019] Figure 12 This is a diagram illustrating an example of a cache management scheme.
[0020] Figure 13 This is a communication flowchart illustrating example communication between the GPU, GPU components, and memory.
[0021] Figure 14 This is a flowchart of an example method for data processing.
[0022] Figure 15 This is a flowchart of an example method for data processing. Detailed Implementation
[0023] In some respects, the increased latency from memory operations can be a significant performance bottleneck in graphics processing units (GPUs). This is due to the shared Level 1 (L1) data cache (L1D) across dozens of thread bundles (i.e., collections of threads). (For example, data caches that store global data structures) can cause significant cache contention and premature data eviction. In some per-thread bundle cache management schemes, streaming data in active bundles (e.g., data introduced into the cache but not subsequently used) can waste cache space. Similarly, bundle-level cache bypass schemes used to reduce evictions caused by inter-bundle interference may not be ideal because any load instruction exhibiting strong temporal locality may be forced to bypass the cache if the load instruction originates from a bundle that is currently performing a bypass operation. Therefore, cache management schemes based on per-load locality behavior may be beneficial. Furthermore, in some GPU applications, each global load instruction may have stable behavior throughout the entire application execution. That is, whether a load instruction benefits from bundle throttling or cache bypass may be independent of the bundle ID or the time in the code when the load is executed. This property can be based on the GPU's unique software execution model, in which all bundles originate from the same kernel code. Therefore, the attributes expected by the loaded cache (such as the data locality type or cache sensitivity of a certain load instruction detected in one thread bundle) can be broadly applied to the same load execution in all other thread bundles. Thus, a lack of determination of locality for each workload can lead to inefficient use of the GPU cache. Aspects of this disclosure can execute workloads (e.g., workloads at the GPU) based on the locality of each workload. For example, aspects of this disclosure can detect the locality of each workload at the GPU.
[0024] The aspects of this disclosure may include several benefits or advantages. For example, the aspects of this disclosure may utilize or bypass caches (e.g., L1 data caches) based on the detection of workload locality. Based on this, the aspects of this disclosure can efficiently utilize GPU caches because GPU cache usage can be optimized. The aspects presented herein can utilize a cache management scheme for efficient use of GPU caches based on the locality of each workload. More precisely, the aspects proposed herein can utilize a cache management scheme based on per-load locality behavior. For example, the aspects proposed herein can identify or detect the locality of each workload in a set of workloads corresponding to a data thread. Furthermore, the aspects proposed herein can determine and store the access patterns of cache lines for a first set of workloads. By doing so, the aspects proposed herein can store data for at least one second set of workloads based on the access patterns of cache lines for the first set of workloads. Additionally, the aspects proposed herein can bypass (i.e., avoid storing) the data for at least one second set of workloads based on the access patterns of cache lines for the first set of workloads. By optimizing when to store data in the cache (or bypass it to avoid storage in the cache), the aspects presented in this paper can help optimize GPU processing speed, the amount of memory utilized by the GPU, and / or the amount of power consumed by the GPU.
[0025] Various aspects of the systems, apparatuses, computer program products, and methods will be described more fully below with reference to the accompanying drawings. However, this disclosure may be embodied in many different forms and should not be construed as limited to any particular structure or function presented throughout this disclosure. Rather, these aspects are provided to make this disclosure comprehensive and complete, and to fully convey the scope of this disclosure to those skilled in the art. Based on the teachings herein, those skilled in the art will understand that the scope of this disclosure is intended to cover any aspect of the systems, apparatuses, computer program products, and methods disclosed herein, whether implemented independently of or in combination with other aspects of this disclosure. For example, any number of aspects set forth herein may be used to implement an apparatus or practice. Furthermore, the scope of this disclosure is intended to cover such apparatuses or methods implemented using structures, functionalities, or structures and functionalities other than or different from the various aspects of the disclosure set forth herein. Any aspect disclosed herein may be embodied by one or more elements of the claims.
[0026] Although various aspects are described herein, many variations and substitutions of these aspects fall within the scope of this disclosure. While some potential benefits and advantages of the aspects of this disclosure are mentioned, the scope of this disclosure is not intended to be limited to a particular benefit, use, or objective. Rather, the aspects of this disclosure are intended to be broadly applicable to different wireless technologies, system configurations, networks, and transmission protocols, some of which are illustrated by way of example in the accompanying drawings and the description below. The detailed description and drawings are merely illustrative and not limiting of this disclosure, and the scope of this disclosure is defined by the appended claims and their equivalents.
[0027] Several aspects are presented with reference to various apparatuses and methods. These apparatuses and methods are described in detail and illustrated in the accompanying drawings by various blocks, components, circuits, processes, algorithms, etc. (collectively referred to as "elements"). These elements can be implemented using electronic hardware, computer software, or any combination thereof. Whether these elements are implemented as hardware or software depends on the specific application and the design constraints imposed on the overall system.
[0028] For example, an element, any part of an element, or any combination of elements can be implemented as a “processing system” including one or more processors (which may also be referred to as processing units). Examples of processors include microprocessors, microcontrollers, graphics processing units (GPUs), general-purpose GPUs (GPGPUs), central processing units (CPUs), application processors, digital signal processors (DSPs), reduced instruction set computing (RISC) processors, system-on-a-chip (SoCs), baseband processors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic components, discrete hardware circuits, and other suitable hardware configured to perform the various functionalities described in this disclosure. One or more processors in the processing system can execute software. Software can be broadly interpreted as instructions, instruction sets, code, code segments, program code, programs, subroutines, software components, applications, software applications, software packages, routines, subroutines, objects, executable files, threads of execution, procedures, functions, etc., whether expressed in terms of software, firmware, middleware, microcode, hardware description languages, or other terms. The term “application” can refer to software. As described herein, one or more technologies can refer to an application, i.e., software, configured to perform one or more functions. In such examples, the application may be stored on memory (e.g., on-chip memory of a processor, system memory, or any other memory). Hardware described herein, such as a processor, may be configured to execute the application. For example, an application may be described as including code that, when executed by the hardware, causes the hardware to perform one or more technologies described herein. As an example, the hardware may access and execute code accessed from memory to perform one or more technologies described herein. In some examples, components are identified in this disclosure. In such examples, a component may be hardware, software, or a combination thereof. Each component may be a separate component or a subcomponent of a single component.
[0029] Therefore, in one or more examples described herein, the described functionality may be implemented in hardware, software, or any combination thereof. If implemented in software, the functionality may be stored or encoded as one or more instructions or code on a computer-readable medium. Computer-readable media include computer storage media. Storage media may be any available medium accessible to a computer. By way of example and not limitation, such computer-readable media may include random access memory (RAM), read-only memory (ROM), electrically erasable programmable ROM (EEPROM), optical disc storage devices, magnetic disk storage devices, other magnetic storage devices, combinations of computer-readable media of the types described above, or any other medium capable of being used to store computer-executable code in the form of instructions or data structures accessible to a computer.
[0030] In summary, this disclosure describes techniques for having a graphics processing pipeline in a single device or multiple devices, thereby improving the rendering of graphics content and / or reducing the load on processing units (i.e., any processing unit, such as a GPU, configured to perform one or more of the techniques described herein). For example, this disclosure describes techniques for performing graphics processing in any device that utilizes graphics processing. Other example benefits are described throughout this disclosure.
[0031] As used herein, instances of the term "content" can refer to "graphic content," "image," or vice versa. This is true regardless of whether these terms are used as adjectives, nouns, or other parts of speech. In some examples, as used herein, the term "graphic content" can refer to content produced by one or more processes in a graphics processing pipeline. In some examples, as used herein, the term "graphic content" can refer to content produced by a processing unit configured to perform graphics processing. In some examples, as used herein, the term "graphic content" can refer to content produced by a graphics processing unit.
[0032] In some examples, as used herein, the term "display content" can refer to content generated by a processing unit configured to perform display processing. Graphical content can be processed to become display content. For example, a graphics processing unit can output graphical content (such as frames) to a buffer (which may be referred to as a frame buffer). A display processing unit can read graphical content (such as one or more frames) from the buffer and perform one or more display processing techniques on that display processing unit to generate display content. For example, a display processing unit can be configured to perform compositing on one or more rendering layers to generate frames. As another example, a display processing unit can be configured to composite, blend, or otherwise combine two or more layers into a single frame. A display processing unit can be configured to perform scaling on frames, such as zooming in or out. In some examples, a frame can refer to a layer. In other examples, a frame can refer to two or more layers that have been blended together to form the frame, i.e., the frame comprises two or more layers, and the frame comprising two or more layers can be subsequently blended.
[0033] Figure 1This is a block diagram illustrating an example content generation system 100 configured to implement one or more technologies of this disclosure. The content generation system 100 includes a device 104. Device 104 may include one or more components or circuitry for performing the various functions described herein. In some examples, one or more components of device 104 may be components of a System-on-a-Chip (SOC). Device 104 may include one or more components configured to perform one or more technologies of this disclosure. In the illustrated example, device 104 may include a processing unit 120, a content encoder / decoder 122, and a system memory 124. In some aspects, device 104 may include multiple components, such as a communication interface 126, a transceiver 132, a receiver 128, a transmitter 130, a display processor 127, and one or more displays 131. Reference to display 131 may refer to one or more displays 131. For example, display 131 may include a single display or multiple displays. Display 131 may include a first display and a second display. The first display may be a left-eye display, and the second display may be a right-eye display. In some examples, the first and second displays may receive different frames for presentation on the first and second displays. In other examples, the first and second displays may receive the same frames used for rendering on both displays. In further examples, the results of graphics processing may not be displayed on the devices; for example, the first and second displays may not receive any frames used for rendering on them. Instead, the frames or graphics processing results may be transferred to another device. In some respects, this is referred to as split rendering.
[0034] Processing unit 120 may include internal memory 121. Processing unit 120 may be configured to perform graphics processing, such as in a graphics processing pipeline 107. Content encoder / decoder 122 may include internal memory 123. In some examples, device 104 may include a display processor (such as display processor 127) to perform one or more display processing techniques on one or more frames generated by processing unit 120 prior to being rendered by one or more displays 131. Display processor 127 may be configured to perform display processing. For example, display processor 127 may be configured to perform one or more display processing techniques on one or more frames generated by processing unit 120. One or more displays 131 may be configured to display or otherwise render the frames processed by display processor 127. In some examples, one or more displays 131 may include one or more of the following: liquid crystal display (LCD), plasma display, organic light-emitting diode (OLED) display, projection display device, augmented reality display device, virtual reality display device, head-mounted display, or any other type of display device.
[0035] Memory (such as system memory 124) external to processing unit 120 and content encoder / decoder 122 may be accessible to processing unit 120 and content encoder / decoder 122. For example, processing unit 120 and content encoder / decoder 122 may be configured to read from and / or write to external memory (such as system memory 124). Processing unit 120 and content encoder / decoder 122 may be communicatively coupled to system memory 124 via a bus. In some examples, processing unit 120 and content encoder / decoder 122 may be communicatively coupled to each other via the bus or a different connection.
[0036] Content encoder / decoder 122 can be configured to receive graphic content from any source, such as system memory 124 and / or communication interface 126. System memory 124 can be configured to store received encoded or decoded graphic content. Content encoder / decoder 122 can be configured to receive encoded or decoded graphic content from system memory 124 and / or communication interface 126, for example, in the form of encoded pixel data. Content encoder / decoder 122 can be configured to encode or decode any graphic content.
[0037] Internal memory 121 or system memory 124 may include one or more volatile or non-volatile memories or storage devices. In some examples, internal memory 121 or system memory 124 may include RAM, SRAM, DRAM, erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, magnetic data media or optical storage media or any other type of memory.
[0038] According to some examples, internal memory 121 or system memory 124 may be a non-transitory storage medium. The term "non-transitory" may indicate that the storage medium is not embodied in a carrier wave or propagating signal. However, the term "non-transitory" should not be construed as meaning that internal memory 121 or system memory 124 is immovable or that its contents are static. For example, system memory 124 may be removed from device 104 and moved to another device. Alternatively, system memory 124 may not be removable from device 104.
[0039] Processing unit 120 may be a central processing unit (CPU), a graphics processing unit (GPU), a general-purpose GPU (GPGPU), or any other processing unit configured to perform graphics processing. In some examples, processing unit 120 may be integrated into the motherboard of device 104. In some examples, processing unit 120 may reside on a graphics card mounted in a port on the motherboard of device 104, or may otherwise be incorporated into a peripheral device configured to interoperate with device 104. Processing unit 120 may include one or more processors, such as one or more microprocessors, GPUs, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), arithmetic logic units (ALUs), digital signal processors (DSPs), discrete logic components, software, hardware, firmware, other equivalent integrated or discrete logic circuits, or any combination thereof. If the technology is partially implemented in software, processing unit 120 may store instructions for software in a suitable non-transitory computer-readable storage medium (e.g., internal memory 121) and may use one or more processors to execute instructions in hardware to perform the technology of this disclosure. Any of the above (including hardware, software, and combinations of hardware and software) can be considered as one or more processors.
[0040] The content encoder / decoder 122 can be any processing unit configured to perform content decoding. In some examples, the content encoder / decoder 122 may be integrated into the motherboard of device 104. The content encoder / decoder 122 may include one or more processors, such as one or more microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), arithmetic logic units (ALUs), digital signal processors (DSPs), video processors, discrete logic components, software, hardware, firmware, other equivalent integrated or discrete logic circuits, or any combination thereof. If the technology is partially implemented in software, the content encoder / decoder 122 may store instructions for software in a suitable non-transitory computer-readable storage medium (e.g., internal memory 123) and may use one or more processors to execute instructions in hardware to perform the technology of this disclosure. Any of the foregoing (including hardware, software, combinations of hardware and software, etc.) can be considered as one or more processors.
[0041] In some aspects, the content generation system 100 may include a communication interface 126. The communication interface 126 may include a receiver 128 and a transmitter 130. The receiver 128 may be configured to perform any of the receiving functions described herein with respect to device 104. Additionally, the receiver 128 may be configured to receive information from another device, such as eye or head positioning information, rendering commands, or location information. The transmitter 130 may be configured to perform any of the transmitting functions described herein with respect to device 104. For example, the transmitter 130 may be configured to transmit information to another device, which may include a request for content. The receiver 128 and the transmitter 130 may be combined to form a transceiver 132. In such an example, the transceiver 132 may be configured to perform any of the receiving and / or transmitting functions described herein with respect to device 104.
[0042] Refer again Figure 1In some aspects, processing unit 120 may include access mode component 198 configured to obtain indication of a set of data threads associated with graphics processing, wherein the set of data threads corresponds to a set of workloads. Access mode component 198 may also be configured to convert information associated with the loader program counter (PC) of the set of workloads into a set of load identifiers (IDs) for the set of workloads. Access mode component 198 may also be configured to store the set of load IDs for the set of workloads in an alias table based on the information conversion. Access mode component 198 may also be configured to identify the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with an access mode for at least one cache line of cache for each workload in the set of workloads. Access mode component 198 may also be configured to configure locality information associated with the locality of each workload in a first set of workloads, the first set of workloads corresponding to a first set of data threads in the set of data threads. Access mode component 198 may also be configured to determine whether to configure or store locality information associated with the locality of each workload in the at least one second set of workloads. Access mode component 198 may also be configured to store the access mode of the at least one cache line of the first set of workloads in the set of workloads based on the locality of each workload in the set of workloads, wherein the first set of workloads corresponds to the first set of data threads in the set of data threads. Access mode component 198 may also be configured to store or avoid storing data of at least one second set of workloads in the set of workloads based on the access mode of the at least one cache line of the first set of workloads. Access mode component 198 may also be configured to output an indication of storing or avoiding storing data for the at least one second set of workloads. Although the following description may focus on explicit processing, the concepts described herein are applicable to other similar processing techniques.
[0043] As described herein, a device such as device 104 can refer to any device, apparatus, or system configured to perform one or more of the technologies described herein. For example, a device can be a server, base station, user equipment, client device, station, access point, computer (e.g., personal computer, desktop computer, laptop computer, tablet computer, computer workstation, or mainframe computer), end product, apparatus, telephone, smartphone, server, video game platform or console, handheld device (e.g., portable video game device or personal digital assistant (PDA)), wearable computing device (e.g., smartwatch, augmented reality device, or virtual reality device), non-wearable device, display or display device, television, set-top box, intermediate network device, digital media player, video streaming device, content streaming device, in-vehicle computer, any mobile device, any device configured to generate graphical content, or any device configured to perform one or more of the technologies described herein. The processes described herein may be described as being performed by a specific component (e.g., GPU), but in further embodiments, other components consistent with the disclosed embodiments (e.g., CPU) may be used to perform them.
[0044] A GPU can process various types of data or data packets within its pipeline. For example, in some aspects, a GPU can process two types of data or data packets, such as context register packets and draw call data. Context register packets can be a collection of global state information, such as information about global registers, shaders, or constant data, which can regulate how the graphics context will be handled. For example, a context register packet may include information about the color format. In some aspects of a context register packet, there may be bits indicating which workload belongs to the context register. Furthermore, there may be multiple functions or programs running simultaneously and / or in parallel. For example, a function or program may describe an operation, such as a color mode or color format. Therefore, context registers can define multiple states of the GPU.
[0045] Context states can be used to determine how individual processing units (e.g., vertex extractors (VFDs), vertex shaders (VSs), shader processors, or geometry processors) operate and / or in which mode a processing unit operates. For this purpose, the GPU can use context registers and programming data. In some aspects, the GPU can generate workloads (e.g., vertex or pixel workloads) in the pipeline based on the context register definitions of modes or states. Certain processing units (e.g., VFDs) can use these states to determine certain functions, such as how to assemble vertices. Because these modes or states can change, the GPU may need to modify the corresponding context. Additionally, the workload corresponding to a mode or state may follow the changed mode or state.
[0046] Figure 2 Example GPU 200 is illustrated according to one or more technologies according to this disclosure. For example... Figure 2 As shown, GPU 200 includes a command processor (CP) 210, a draw call group 212, a VFD 220, a VS 222, a vertex cache (VPC) 224, a triangle setup engine (TSE) 226, a rasterizer (RAS) 228, a Z-process engine (ZPE) 230, a pixel interpolator (PI) 232, a fragment shader (FS) 234, a rendering backend (RB) 236, a level 1 (L1) cache (clustered cache (CCHE)) 237, a level 2 (L2) cache (UCHE) 238, and system memory 240. Although Figure 2 The GPU 200 shown includes processing units 220 to 238, but the GPU 200 may include multiple additional processing units. Additionally, processing units 220 to 238 are merely examples, and any combination or order of processing units may be used in the GPU according to this disclosure. The GPU 200 also includes a command buffer 250, a context register group 260, and a context state 261.
[0047] like Figure 2 As shown, the GPU can use a CP (e.g., CP 210) or a hardware accelerator to resolve the command buffer into context register groups (e.g., context register group 260) and / or draw call data groups (e.g., draw call group 212). Subsequently, CP 210 can transfer the context register group 260 or the draw call group 212 to a processing unit or block in the GPU via a separate path. Furthermore, the command buffer 250 can alternate between different states of the context registers and draw calls. For example, the command buffer can be constructed as follows: context register of context N, draw call of context N, context register of context N+1, and draw call of context N+1.
[0048] GPUs can render images in several different ways. In some cases, GPUs can use rendering and / or tiled rendering to render images. In a tiled rendering GPU, an image can be divided or segmented into different sections or tiles. After the image is divided, each section or tile can be rendered individually. A tiled rendering GPU can divide a computer graphics image into a grid format so that each part of the grid (i.e., a tile) is rendered individually. In some aspects, during a binning pass, the image can be divided into different bins or tiles. In some aspects, during a binning pass, a visibility stream can be constructed, where visible primitives or draw calls can be identified. In contrast to tiled rendering, direct rendering does not divide the frame into smaller bins or tiles. Instead, in direct rendering, the entire frame is rendered at once. Additionally, some types of GPUs allow both tiled rendering and direct rendering.
[0049] Instructions executed by the CPU (e.g., software instructions) or by the display processor can cause the CPU or display processor to search for and / or generate compositing strategies for compositing frames based on dynamic priorities and runtime statistics associated with one or more compositing strategy groups. A frame to be displayed by a physical display device (such as a display panel) may include multiple layers. Furthermore, frame compositing may be based on combining multiple layers into a frame (e.g., based on a frame buffer). After combining the multiple layers into a frame, the frame can be provided to the display panel for display on that display panel. The process of combining each of the multiple layers into a frame may be referred to as compositing, frame compositing, compositing process, compositing handling, etc.
[0050] A frame compositing process or strategy can correspond to a technique used to combine different layers from multiple layers into a single frame. Multiple layers can be stored in double data rate (DDR) memory. Each of the multiple layers can further correspond to a separate buffer. A compositor or hardware compositor (HWC) associated with a block or function determines the input to each layer / buffer and performs the frame compositing process to generate an output indicating the composite frame. That is, the input can be layers, and the output can be a frame compositing process used to synthesize the frame to be displayed on a display panel.
[0051] Some types of GPUs may include different types of pipelines, such as graphics processing pipelines. A graphics processing pipeline may include one or more of vertex shader levels, shell shader levels, domain shader levels, geometry shader levels, and pixel shader levels. These levels of a graphics processing pipeline can be considered shader levels. These shader levels can be implemented as one or more shader programs executing on shader units at the GPU. Shader units can be configured as programmable pipelines of processing components. In some examples, a shader unit may be referred to as a "shader processor" or "unified shader" and can perform geometry, vertex, pixel, or other shading operations to render graphics. A shader unit may include shader processors, each of which may include one or more components for fetching and decoding operations, one or more arithmetic logic units (ALUs) for performing arithmetic computations, one or more memories, caches, and registers.
[0052] Figure 3Figure 300 illustrates processing components (such as processing unit 120 and system memory 124) identified by device 104 for processing data. In various aspects, processing unit 120 may include CPU 302 and GPU 312. GPU 312 and CPU 302 may be formed as integrated circuits (e.g., system-on-a-chip (SOC)) and / or GPU 312 may be incorporated into a motherboard having CPU 302. Alternatively, CPU 302 and GPU 312 may be configured as different processing units communicatively coupled to each other. For example, GPU 312 may be incorporated into a graphics card mounted in a port on a motherboard including CPU 302.
[0053] CPU 302 can be configured to execute a software application. This software application enables the display of graphical content (e.g., on display 131 of device 104) based on one or more operations of GPU 312. The software application can issue instructions to a graphics application programming interface (API) 304, which can be a runtime program that translates instructions received from the software application into a format readable by GPU driver 310. After receiving instructions from the software application via the graphics API 304, GPU driver 310 can control the operation of GPU 312 based on the instructions. For example, GPU driver 310 can generate one or more command streams placed in system memory 124, instructing GPU 312 to execute the command streams (e.g., via one or more system calls). Command engine 314 included in GPU 312 is configured to retrieve one or more commands stored in the command streams. Command engine 314 can provide commands from the command streams for GPU 312 to execute. Command engine 314 can be hardware of GPU 312, software / firmware executing on GPU 312, or a combination thereof. Although GPU driver 310 is configured to implement graphics API 304, GPU driver 310 is not limited to being configured according to any particular API. System memory 124 may store code for GPU driver 310, which CPU 302 may retrieve for execution. In the example, GPU driver 310 may be configured to allow communication between CPU 302 and GPU 312, such as when CPU 302 offloads graphics or non-graphics processing tasks to GPU 312 via GPU driver 310.
[0054] System memory 124 may further store source code for one or more of the early preamble shader 324, feedback shader 325, or main shader 326. In such a configuration, shader compiler 308, executing on CPU 302, may compile the source code of shaders 324 to 326 to create object code or intermediate code that can be executed by shader core 316 of GPU 312 during runtime (e.g., while shaders 324 to 326 are executing on shader core 316). In some examples, shader compiler 308 may precompile shaders 324 to 326 and store the object code or intermediate code of the shader program in system memory 124. Shader compiler 308, executing on CPU 302 (or, in another example, GPU driver 310), may build a shader program with multiple components, including early preamble shader 324, feedback shader 325, and main shader 326. The primary shader 326 may correspond to a portion or all of a shader program that does not include the early preamble shader 324 or the feedback shader 325. The shader compiler 308 may receive instructions from a program executing on the CPU 302 for compiling shaders 324 through 326. The shader compiler 308 may also identify constant loading instructions and common operations in the shader program for including common operations within the early preamble shader 324 (rather than within the primary shader 326). The shader compiler 308 may, for example, identify such common instructions based on a (currently undetermined) constant 306 to be included in the common instructions. The constant 306 may be defined as a constant across the entire draw call within the graphics API 304. The shader compiler 308 may utilize instructions such as a preamble shader start indicating the start of the early preamble shader 324 and a preamble shader end indicating the end of the early preamble shader 324. Similar instructions may be used for the feedback shader 325 and the primary shader 326. The feedback shader 325 will be described in further detail below.
[0055] Shader core 316 included in GPU 312 may include general-purpose registers (GPRs) 318 and constant memory 320. GPR 318 may correspond to a single GPR, a GPR file, and / or a GPR library. Each GPR in GPR 318 may store data accessible to a single thread. Software and / or firmware executing on GPU 312 may be shader programs 324 to 326, which may execute on shader core 316 of GPU 312. Shader core 316 may be configured to execute many instances of the same instructions of the same shader program in parallel. For example, shader core 316 may execute main shader 326 for each pixel defining a given shape. Shader core 316 may send and receive data from an application executing on CPU 302. In the example, constant 306 used for the execution of shaders 324 to 326 may be stored in constant memory 320 (e.g., read / write constant RAM) or GPR 318. Shader core 316 can load constant 306 into constant memory 320. In another example, execution of early preamble shader 324 or feedback shader 325 may enable the storage of constant values or sets of constant values in on-chip memory such as constant memory 320 (e.g., constant RAM), GPU memory 322, or system memory 124. Constant memory 320 may include memory accessible to all aspects of shader core 316, rather than just a specific portion reserved for a particular thread (e.g., values stored in GPR 318).
[0056] Figure 4 Example GPU 400 is shown. Specifically, Figure 4 An example of a stream processor (SP) system in a GPU 400 is shown. For example... Figure 4 As shown, GPU 400 includes an Advanced Sequencer (HLSQ) 402, a Texture Processor (TP) 406, a Level 1 (L1) cache (Clustered Cache (CCHE)) 407, a Level 2 (L2) cache (UCHE) 408, a Render Backend (RB) 410, and a Vertex Cache (VPC) 412. GPU 400 also includes a Servlet 420, a Main Engine 422, a Sequencer 424, a Local Buffer 426, a Wave Scheduler 428, a Texture Array (TEX) 430, an Instruction Cache 432, an Arithmetic Logic Unit (ALU) 434, a GPR 436, a Dispatcher 438, and a Load Memory Storage (LDST) 440.
[0057] like Figure 4As shown, each unit or block in GPU 400 can transfer data or information to other blocks. For example, HLSQ 402 can transfer commands to the main engine 422. Furthermore, HLSQ 402 can transfer vertex threads, vertex attributes, pixel threads, pixel attributes, and / or computation commands to the sequencer 424. TP 406 can receive texture requests from TEX 430 and transfer texture elements (texture cells) back to TEX 430. Additionally, TP 406 can transfer memory read requests to CCHE 407 or UCHE 408 and receive memory data from CCHE 407 or UCHE 408. CCHE 407 or UCHE 408 can also receive memory read or write requests from MEM LDST 440 and transfer memory data back to MEM LDST 440, and receive memory read or write requests from RB 410 and transfer memory data back to RB 410. Furthermore, RB 410 can receive color-formatted output from GPR 436, for example, via dispatcher 438. VPC 412 can also receive vertex-formatted output from GPR 436, for example, via dispatcher 438. GPR 436 can transmit address data or receive write-back data from MEM LDST 440. GPR 436 can also transmit temporary data to and receive temporary data from ALU 434. Additionally, ALU 434 can transmit address or predicate information to wave scheduler 428 and receive instructions from wave scheduler 428. Local buffer 426 can transmit constant data to ALU 434. TEX 430 can also receive texture attributes from GPR 436 or transmit texture data to GPR, and receive constant data from local buffer 426. Furthermore, TEX 430 can receive texture requests from wave scheduler 428 and receive constant data from local buffer 426. The MEM LDST 440 can transfer constant data to / receive constant data from the local buffer 426. The sequencer 424 can transfer wave data to the wave scheduler 428 and to the GPR 436. The sequencer 424 can allocate resources and local memory. Furthermore, the sequencer 424 can allocate wave slots and any associated GPR 436 space. For example, when the HLSQ 402 issues a pixel tile workload to the SP 420, the sequencer 424 can allocate wave slots or GPR 436 space. The main engine 422 can transfer program data to the instruction cache 432, transfer constant data to the local buffer 426, and receive instructions from the MEM LDST 440. The instruction cache 432 can transfer instructions or decoded information to the wave scheduler 428. The wave scheduler 428 can transfer read requests to the local buffer 426 and memory requests to the MEM LDST 440.
[0058] like Figure 4As further shown, the HLSQ 402 can prepare one or more context states for the SP 420. For example, the HLSQ 402 can prepare context states for different types of data, such as global register data, shader constant data, buffer descriptors, instructions, etc. Additionally, the HLSQ 402 can embed context states into the command stream to the SP 420. The main engine 422 can parse the command stream from the HLSQ 402 and set the SP global state. Furthermore, the main engine 422 can populate or add to the instruction cache 432 and / or the local buffer 426 or constant buffer. In some aspects, an internal functional unit called a state processor 402a may exist within the HLSQ 402. The state processor 402a may be a single-fiber scalar processor capable of executing special shader programs (e.g., preamble shaders). Preamble shaders can be generated by the GPU compiler to load constant data from different buffer objects. Furthermore, the preamble shader can bind buffer objects to a single constant buffer, such as a processed constant buffer. Furthermore, the HLSQ 402 can execute a preamble shader, thus skipping the use of the main shader. In some cases, the main shader can perform different shading tasks, such as normal vertex shading and / or fragment shading procedures. Additionally, the HLSQ 402 may include a data packer 402b.
[0059] Additionally, such as Figure 4 As shown, if the HLSQ 402 decides to skip preamble execution, the SP 420 is not limited to executing preambles. For example, the SP 420 can also handle regular graphics workloads such as vertex shading and / or fragment shading. In some respects, the SP 420 can utilize its execution units and memory to process computational tasks as a general-purpose GPU (GPGPU). Multiple parallel instruction execution units, such as ALUs, elementary function units (EFUs), branch units, TEXs, general-purpose memory reads and writes (also known as LDSTs), may be present within the SP 420. The SP 420 may also include on-chip memory, such as the GPR 436, which can store per-fiber private data. Furthermore, the SP 420 may include a local buffer 426 that stores per-shader or per-kernel constant data, per-wave uniformity constants (also known as uGPRs), and per-workgroup (WG) local memory (LM). Processing a preamble shader can occupy one slot. Furthermore, most preamble shaders can use only uGPRs without using GPRs, and ALU instructions can be executed on a scalar ALU. Therefore, the execution of preamble shaders can be associated with high performance and can be power efficient, because any available slot can be used to execute the preamble shader even without GPR space allocation.
[0060] In addition, such as Figure 4As shown, dispatcher 438 can extract data from GPR 436. Dispatcher 438 can also perform format conversion and then dispatch the final color to multiple render targets (RTs). Each RT can have one or more components, such as red (r), green (G), blue (B), alpha (A) (RGBA) data, or only the alpha component of RGBA data. Furthermore, each RT can typically be stored in a vector GPR; that is, R3.0 can store red data, R3.1 can store green data, R3.2 can store blue data, and so on. Additionally, the driver program in the SP context register can be used to define the GPR identifier (ID) for storing the RT data.
[0061] Figure 5 This is a diagram illustrating another example GPU. More specifically, Figure 5 The GPU500, comprising several different components, is described. For example... Figure 5 As shown, GPU 500 includes UCHE 510 (including L2 cache 511 and L2 cache 512), CCHE 516 (including L1 cache 517 and L1 cache 518), VFD 520, CP 530, HLSQ 540, multiple SPs (e.g., SP550, SP551, and SP552), VPC 560, TSE 570, RAS 572, and a low-resolution Z (LRZ) component (e.g., LRZ574). Figure 5 As shown, CP 530 can send and receive data to and from HLSQ 540. CCHE 516 can send and receive data to / from HLSQ 540. UCHE 510 can also send and receive data to / from HLSQ 540. L2 cache 511 and L2 cache 512 can send and receive data to / from VFD 520. Furthermore, VFD 520 can send data to HLSQ 540 and to SPs 550 to 552. Furthermore, SPs 550 to 552 can send and receive data to / from VPC 560. Furthermore, VPC 560 can send and receive data to / from HLSQ 540. Data can also be sent from VPC 560 to TSE 570, which can then send the data to RAS 572 and then to LRZ 574. The CCHE 516 can send / receive data to / from the VPC 560 and LRZ 574. Additionally, the UCHE 510 can also send / receive data to / from the VPC 560 and LRZ 574.
[0062] like Figure 5As depicted, a GPU (e.g., GPU 500) may include multiple different caches. GPUs utilize caches for various reasons, such as transferring data at sufficiently high speeds. That is, because the growth rate of GPU processing power exceeds memory access speeds, storage resources between the processor and memory (e.g., caches) are already being utilized to transfer data at sufficient speeds. Caches at the GPU are also utilized to transfer data more seamlessly. One benefit of caches is that they provide buffering, so caches and buffers can be similar. For example, a cache can reduce latency by reading data from memory in larger blocks based on subsequent accesses to nearby address locations. Furthermore, caches can increase throughput by assembling multiple small transfers into larger, more efficient memory requests. These benefits are achieved by storing data in blocks called cache lines. A cache line can be a portion of data that can be mapped into the cache. For example, a cache line can be the smallest portion of data that can be mapped into the cache.
[0063] In some respects, each mapped cache line can be associated with a block (e.g., a core line), which is a corresponding region on main memory or back-end storage. Back-end storage can allow for improvements in cache and GPU performance. For example, a database cache can allow for increased throughput and reduced data retrieval latency associated with a back-end database, which improves the overall performance of both the cache and GPU. Furthermore, in some respects, both the cache and main memory / back-end storage can be divided into blocks the size of a cache line. Moreover, all cache mappings can be aligned to these blocks. Cache lines can have a certain size (e.g., between 32 bytes and 512 bytes), and memory transactions can be performed in units of cache lines. Individual cache accesses performed by code executing on the GPU processor may be smaller than these cache line units (e.g., 4 bytes).
[0064] Figure 6 This is diagram 600, which illustrates an example mapping of a cache. More specifically, Figure 6 A cache mapping 602 for cache 610 and main memory 620 is described. That is, Figure 6 The relationship between cache lines in cache 610 (e.g., cache lines 611, 612, 613, and 614) and blocks in main memory 620 (e.g., blocks 621, 622, 623, 624, 625, 626, 627, and 628) is depicted. Figure 6As illustrated in Figure 600, individual blocks 621 to 628 can be directly mapped to individual cache lines 611 to 614. For example, as illustrated in Figure 600, block 621 can be mapped to cache line 611, block 622 can be mapped to cache line 613, block 625 can be mapped to cache line 612, and block 626 can be mapped to cache line 614. Some of blocks 621 to 628 may not be directly mapped to cache lines 611 to 614. For example, blocks 623, 624, 627, and 628 may not be directly mapped to cache lines 611 to 614. In some aspects, the main memory 620 including blocks 621 to 628 may be a back-end storage device comprising multiple core lines.
[0065] In some caches, valid data (e.g., valid bits) and dirty data (e.g., dirty bits) may correspond to the current cache line state. For example, when a cache line is valid (i.e., in a valid state), it may refer to the cache line being mapped to a block in main memory (e.g., a core line determined by a core identifier (ID) and a core line number). When a cache line is invalid (i.e., in an invalid state), it may be used to map a core line accessed by some request (e.g., an input / output (I / O) request), and that cache line may subsequently become valid. A cache line may return to an invalid state for several different reasons. For example, a cache line may return to an invalid state if it is being evicted, if the core pointed to by the core ID is being removed, if the core pointed to by the core ID is being cleared, if the entire cache is being cleared, during a discard operation on the corresponding core line, or during the processing of a request (e.g., an I / O request) while selecting a cache mode that may perform invalidation.
[0066] In some respects, dirty data or modified data can refer to data associated with a memory block and indicating whether the corresponding memory block has been modified. For example, a dirty bit or modified bit can be a bit associated with a memory block and indicating whether the corresponding memory block has been modified. Dirty data (e.g., a dirty bit) can be set when the processor writes to (i.e., modifies) the memory. For example, dirty data (e.g., a dirty bit) can indicate that its associated memory block has been modified but has not yet been saved to the storage device. That is, "dirty data" can refer to modified data in the cache, but an old or outdated copy of that data still exists in memory. In some cases, when a memory block is to be replaced, its corresponding dirty data (e.g., a dirty bit) can be checked to determine whether the block may need to be written back to secondary memory before being replaced, or whether it can be simply removed. In addition, dirty data (e.g., a dirty bit) can determine whether cache line data stored in the cache is synchronized with the corresponding data on the back-end storage device. For example, if a cache line is dirty, the data on the cache storage device may be up-to-date, and that data may need to be flushed (i.e., removed) at some point in the future (e.g., after flushing, the data can be marked as clean by clearing the dirty bit). Furthermore, a cache line can be considered valid if at least one sector is valid. Similarly, a cache line can be considered dirty if at least one sector is dirty.
[0067] In some cases, the goal of caching (e.g., caches in GPUs or CPUs) may be to improve the performance of repeated accesses to the same data, because caches can maintain copies of a subset of data in memory. Therefore, subsequent accesses to data already present in the cache may not utilize expensive memory access transactions. Since some caches may have a capacity smaller than the memory size (e.g., the memory size of a GPU system), the current cached dataset may constantly change. This constant change in cached data may be due to memory access patterns of the executed code and / or the cache's data replacement strategy. In some aspects, a goal of caching may be to maximize the cache hit rate (i.e., the percentage of data accesses that can be served by data in the cache). By maximizing the cache hit rate, the overall performance of the cache (e.g., a cache at the GPU or CPU) may be improved. This performance improvement may be important for the overall system including the cache (e.g., GPU or CPU) because the system may utilize the data to serve many concurrently running threads.
[0068] A cache may receive multiple requests (e.g., data or content requests) to store or cache data. A cache hit can refer to the event when the data being processed (e.g., requested by a component or application) is successfully retrieved from the cache memory. For example, a cache hit can describe when data or content is successfully found in the cache. That is, a cache hit can mean that a system or application issues a request to retrieve data from the cache, and that specific data is currently in the cache memory. A cache miss can refer to the event when the data being processed (e.g., requested by a component or application) is not successfully retrieved from the cache memory. For example, a cache miss can describe when data or content is not successfully found in the cache. That is, a cache miss can mean that a system or application issues a request to retrieve data from the cache, but that specific data is not currently in the cache memory. A cache can be measured based on the number of data requests it can successfully satisfy. Cache hit rate (i.e., hit percentage or cache hit ratio) is a measure of how many data requests a cache can successfully satisfy compared to the total number of data requests it receives. For example, cache hit rate (i.e., hit rate or cache hit percentage) equals the number of cache hits divided by the total number of data requests. The formula is: Cache hit rate = (Number of cache hits) / (Number of cache hits + Number of cache misses).
[0069] There are several different types of caches (e.g., caches utilized by the GPU or CPU). For example, there are fully associative caches, direct-mapped caches, and set-associative caches. A fully associative cache utilizes a least recently used (LRU) caching strategy, where multiple units (e.g., M units) exist, each capable of holding a cache line corresponding to any memory location (e.g., N memory locations). In the event of cache contention, the cache line that has not been accessed for the longest time may be evicted and replaced with a new cache line. A direct-mapped cache maps memory blocks directly to their individual cache lines. A set-associative cache divides the address space into equal groups, each acting as a small fully associative cache.
[0070] A cache set index can refer to the size of a cache set, or how many different cache lines each data block can map to. In other words, a cache set index can refer to the number of cache lines associated with a cache set for that cache. Furthermore, a set index (i.e., an index) can be the portion of a cache address that identifies which lines in the cache a particular address can be found in. A cache set can include the number of cache lines in the cache. Cache associativity can refer to the number of cache lines mapped to a set. In other words, cache associativity can refer to the number of multiple different cache lines mapped to the same set. Higher associativity may lead to more efficient cache utilization, but may also increase the power / cost of cache utilization. Similarly, lower associativity may reduce the power / cost of cache utilization, but may lead to less efficient cache utilization. Cache capacity can refer to the amount of data or information that can be stored in the cache. Additionally, cache capacity or associativity can be adjusted based on several different factors, such as cache hit rate. Furthermore, data allocation for a cache can refer to the way data is allocated to the cache.
[0071] Because caches store and retrieve data from memory, they can experience memory latency in some situations. For example, memory latency can be the time elapsed from the initial data request to the actual retrieval of the data (i.e., delay). In other words, memory latency can refer to the time elapsed from initiating a request for data (e.g., a byte or word) in memory until the data is retrieved from memory (e.g., by the processor). In contrast, memory latency measures the actual time elapsed to retrieve data from memory, while memory bandwidth measures the throughput of memory. In some aspects, if data is not in memory or the cache, it may take longer to retrieve the data, resulting in increased memory latency (e.g., the processor may have to communicate with external memory cells). For example, memory latency can be a measure of memory speed, such that faster read operations will have reduced memory latency, while slower read operations will have increased memory latency. Memory latency can be expressed in different time measures (e.g., in actual elapsed time (such as ns) or clock cycles). Furthermore, average memory latency can refer to the average time elapsed from the request for data until the data is actually retrieved. Average memory latency can be calculated or determined based on the average of multiple data requests to the cache.
[0072] As mentioned in this article, caches can be used to store various types of data or information (e.g., addresses, some data, and some status information). A single cache can be used to store instructions and / or data (e.g., a unified cache). This type of cache can be referred to as an instruction cache (I-cache) and / or a data cache (D-cache). A “tag” can be a portion of a memory address stored within the cache that identifies the main memory address associated with a line of data. For example, the highest bit of a memory address (e.g., a 64-bit address) can tell the cache where some information comes from in main memory (i.e., referred to as a tag). The total cache size can be a measure of the amount of data the cache can hold (e.g., random access memory (RAM) used to hold tag values may not be included in the computation). Additionally, tags may occupy physical space within the cache. In some respects, storing a small amount of data (e.g., a word) for each tag address can be inefficient, so several locations can be grouped under the same tag. This type of logical block can be referred to as a cache line, which can refer to the smallest loadable unit of the cache (e.g., a contiguous block of words from main memory).
[0073] Additionally, a cache line can be valid when it contains cached data or instructions. Similarly, a cache line can be invalid when it does not contain cached data or instructions. Furthermore, one or more status bits may be associated with each line of data. In some aspects, there may be bits marking a cache line as valid, indicating that it contains usable data. For example, this might mean that the address label represents a specific actual value. In a data cache, there may be one or more dirty bits indicating whether a cache line (or a portion of a cache line) holds data that differs from the main memory content (i.e., data newer than the main memory content). Furthermore, the stored data may correspond to a memory address (i.e., a location) in the cache. An "index" can be a portion of a memory address that identifies which lines in the cache the address can be found in. For example, an index (e.g., the middle bit of an address) can identify the line. An index can also be used as an address for cache RAM and may not need to be stored as part of a label. A "way" can be a subdivision of the cache where each way is of equal size and indexed in the same way. A “group” or “cache group” can include cache lines from all paths that share a particular index. For example, this might mean that a few bits at the bottom of the address (i.e., the offset) might not be stored in the label. In some instances, the address of the entire line (i.e., not every byte within the line) can be utilized.
[0074] Figure 7 This is diagram 700, illustrating an example address location mapping for a cache. More specifically, Figure 7An example storage system for caching is described. For example... Figure 7 As shown, diagram 700 of cache 702 (e.g., set-associative data cache) depicts address 710, which includes tag 712, set index 714, word 716, byte 718, data line 730 (e.g., data line 0), data line 731 (e.g., data line 1), data line 732 (e.g., data line 2), data line 733 (e.g., data line 3), data line 738 (e.g., data line N-1 or the 254th data line), data line 739 (e.g., data line N or the 255th data line), and cache line 740. Figure 7 Label 712 is depicted as a portion of address 710 within cache 702, identifying the main memory address associated with a row of data. Set index 714 (i.e., the index) is a portion of address 710 that identifies which rows of cache 702 contain the address. For example, Figure 7 The diagram shows that cache 702 contains N lines of data (e.g., 256 lines), so set index 714 identifies which lines (e.g., lines 0 to 256) a particular address is found in. Set index 714 maps to all data lines 730 to 739. Diagram 700 also shows word 716 corresponding to cache line 740. Furthermore, Figure 7 The valid bits in cache 702 are shown (i.e., in...). Figure 7 (represented by "V") and visceral sites (i.e., in) Figure 7 (represented by "D" in Chinese). For example... Figure 7 As shown, data lines 738 (e.g., data line N-1) and 739 (e.g., data line N) depict that cache 702 may be an N-way associative cache.
[0075] Figure 8 This is diagram 800 illustrating an example cache architecture. More specifically, Figure 8 An exemplary architecture for the system cache and the Layer 1 (L1) cache is described. Figure 8 As shown in Figure 800, the CPU 810 includes an L1 cache 812, the GPU 820, and the system cache 830. Figure 8 This describes how the CPU 810 can transfer information to the system cache 830. For example, the L1 cache 812 can transfer information to the system cache 830. Furthermore, the GPU 820 can transfer information to the system cache 830. (The text repeats itself here.) Figure 8 As depicted, the associativity of L1 cache 812 and / or system cache 830 may be limited due to the timing of information transfer to and from CPU 810 and / or GPU 820. Additionally, as Figure 8As shown, the associativity of L1 cache 812 and / or system cache 830 may be limited due to the power utilized by CPU 810 and / or GPU 820.
[0076] Some aspects of graphics processing may utilize a particular GPU architecture and / or application architecture. For example, aspects of graphics processing may utilize a general-purpose GPU (GPGPU) architecture, which includes symmetric multiprocessors (SMs), shared cores, interconnect units, dynamic random access memory (DRAM), and / or multiple different caches (e.g., L1 cache, L2 cache, and / or last level cache (LLC)). In some cases of GPU architecture, multiple SMs, shared cores, and L1 caches may be connected to interconnect units. Interconnect units may connect to L2 caches and DRAM. Additionally, in application architectures, applications may include multiple cores, and each core may include a concurrent thread array (CTA), where each CTA includes multiple thread bundles.
[0077] As indicated in this document, a kernel can be a programming operation manager or a programming thread at the GPU. Furthermore, a kernel can be executed in parallel by an array of threads, where all threads can run the same code. Each thread can have an identifier (ID) used to compute memory addresses and make control decisions. A thread bundle can be a collection of threads (e.g., 32 threads) executed simultaneously by a symmetric multiprocessor (SM). A thread bundle can be the basic unit of execution, where multiple thread bundles can execute concurrently on an SM. When a program on the CPU invokes a kernel grid, blocks of that grid can be enumerated and assigned to an SM with available execution capacity. Threads of a thread block can execute concurrently on an SM, and multiple thread blocks can execute concurrently on a single SM. As a thread block terminates, a new block is launched on the vacated SM. The mapping between thread bundles and thread blocks can impact kernel performance. Additionally, a clock, or GPU clock, can be a logical tick or time used to synchronize GPU actions. The clock source manages how GPU components derive their clocks. A Symmetric Multiprocessor (SM) can be a single-instruction, multi-threaded processor with multiple shared cores for integer processing (e.g., shader processors (SPs)) and Special Function Units (SFUs) (e.g., for calculating functions such as sine, cosine, and root mean square (RMS)). An SM may have Load Memory (LD / ST) units for loading and storing data into memory / registers. An SM may also have an L1 cache, a shared cache, and a large repository register file. A Concurrent Thread Array (CTA) can be the basic workload unit allocated to the SM within the GPU. Threads in a CTA can be subgrouped into thread bundles / wavefronts, which are the smallest units of execution sharing the same program counter. A Last-Level Cache (LLC) can be the last level of cache from the GPU context, such as an extended cache for the SM. An interconnect unit can be a crossbar switch that performs multi-master arbitration, through which the GPU connects to the rest of the system. Furthermore, a Serialization Point / Conformance Point (PoS / PoC) can be a point in a System-on-Chip (SoC) node where each master in the system sees the same coherent copy of data.
[0078] Furthermore, some types of graphics processing may leverage data locality patterns within GPU workloads. For example, data fetched from per-thread bundle load instructions may exhibit locality. The locality exhibited by data fetched from per-thread bundle load instructions can be broadly categorized into four types: streaming locality, inter-thread bundle locality, intra-thread bundle locality, and inter-thread bundle + intra-thread bundle locality. Streaming data is introduced into the data cache as it is fetched on demand but may never be reused. Therefore, it exhibits zero temporal locality. If data fetched by a load instruction from one thread bundle is also accessed by the same loader counter (PC) across multiple thread bundles, it can be defined as inter-thread bundle locality. If data fetched by a load instruction from one thread bundle is used exclusively within the same thread bundle, it may exhibit intra-thread bundle locality. Additionally, inter-thread bundle + intra-thread bundle locality refers to data introduced into the cache by one thread bundle and then repeatedly referenced by other thread bundles as well as the original thread bundle. Locality for each workload can be the reusability of addresses or times associated with the cache (e.g., how close the next cache access is to the previous one), which can include spatial locality and / or temporal locality. Spatial locality may be associated with the reusability of cache addresses (e.g., how close the next cache access is in location to the location of the previous cache access). Temporal locality is associated with the reusability of cache access times (e.g., how close the next cache access is in time to the time of the previous cache access).
[0079] Figure 9 A diagram 900 illustrates an example of a graph depicting data locality types at the GPU. More specifically, diagram 900 depicts a graph 902 showing data locality types for different GPU workloads. Figure 9 As shown in Figure 900, multiple data locality types are included, including streaming locality 910, inter-thread locality 920, intra-thread locality 930, and inter-thread + intra-thread locality 940. Figure 9Several cache-sensitive (CS) workloads 950, cache-medium (CM) workloads 960, and cache-insensitive (CI) workloads 970 are also described. Cache-sensitive workloads 950 include breadth-first search (BFS) workloads, K-means (KMN) workloads, inverted index (IIX) workloads, word count (WC) workloads, graph coloring (GC) workloads, and single-source shortest path (SSP) workloads. Cache-medium workloads 960 include sparse matrix dense vector multiplication (SPMV) workloads, matrix multiplication (MM) workloads, similarity scoring (SS) workloads, and connected component labeling (CCL) workloads. In addition, cache-insensitive workloads 970 include Gaussian elimination (GE) workloads, speckle reduction anisotropic diffusion (SRD) workloads, magnetic resonance imaging-meshmentation (MRI) workloads, register-titled matrix-matrix multiplication (SGM) workloads, template 2D workloads, and all pairs of shortest paths (APS) workloads. That is, Figure 9 Representative GPU workloads (known as cache-sensitive workloads 950, cache-medium workloads 960, and cache-insensitive workloads 970) are shown and categorized into data locality patterns (e.g., streaming locality 910, inter-thread locality 920, intra-thread locality 930, and inter-thread+intra-thread locality 940).
[0080] In some respects, the increased latency from memory operations can be a significant performance bottleneck in graphics processing units (GPUs). This is due to the shared Level 1 (L1) data cache (L1D) across dozens of thread bundles (i.e., collections of threads). (For example, data caches that store global data structures) can cause significant cache contention and premature data eviction. In some per-thread bundle cache management schemes, streaming data in active bundles (e.g., data introduced into the cache but not subsequently used) can waste cache space. Similarly, bundle-level cache bypass schemes used to reduce evictions caused by inter-bundle interference may not be ideal because any load instruction exhibiting strong temporal locality may be forced to bypass the cache if the load instruction originates from a bundle that is currently performing a bypass operation. Therefore, cache management schemes based on per-load locality behavior may be beneficial. Furthermore, in some GPU applications, each global load instruction may have stable behavior throughout the entire application execution. That is, whether a load instruction benefits from bundle throttling or cache bypass may be independent of the bundle ID or the time in the code when the load is executed. This property can be based on the GPU's unique software execution model, in which all bundles originate from the same kernel code. Therefore, the attributes expected by the loaded cache (such as the data locality type or cache sensitivity of a certain load instruction detected in one thread bundle) can be broadly applied to the same load execution in all other thread bundles. Thus, a lack of determination of locality for each workload can lead to inefficient use of the GPU cache. Based on the above, it may be beneficial to execute workloads based on the locality of each workload. That is, detecting the locality of each workload at the GPU may be beneficial. Furthermore, it may be beneficial to utilize or bypass cache utilization based on the detection of workload locality. Additionally, cache management schemes that utilize GPU cache efficiently based on the locality of each workload may be beneficial.
[0081] The aspects of this disclosure can perform workloads (e.g., workloads at the GPU) based on the locality of each workload. For example, the aspects of this disclosure can detect the locality of each workload at the GPU. The aspects proposed herein can utilize caches (e.g., L1 data caches) or bypass cache utilization based on the detection of workload locality. By doing so, the aspects of this disclosure can efficiently utilize GPU caches because GPU cache usage can be optimized. In fact, the aspects presented herein can utilize a cache management scheme for efficient use of GPU caches based on the locality of each workload. That is, the aspects proposed herein can utilize a cache management scheme based on load-locality behavior. For example, the aspects proposed herein can identify or detect the locality of each workload in a set of workloads corresponding to a data thread. Furthermore, the aspects proposed herein can determine and store the access patterns of cache lines for a first set of workloads. Based on this, the aspects proposed herein can store data for at least one second set of workloads based on the access patterns of cache lines for the first set of workloads. Furthermore, the aspects proposed in this paper can bypass (i.e., avoid storage) data for at least one second set of workloads based on the access patterns of cache lines of the first set of workloads. By optimizing when to store data in the cache (or bypass it to avoid storage in the cache), the aspects proposed in this paper can help optimize GPU processing speed, the amount of memory utilized by the GPU, and / or the amount of power consumed by the GPU.
[0082] The aspects proposed in this paper can utilize certain types of cache management schemes to detect the locality type of each workload. For example, the aspects proposed in this paper can utilize a Thread Bundle Access Pattern Aware Cache (WAPC) management scheme that dynamically detects the locality type of each load instruction. That is, the aspects proposed in this paper dynamically detect the locality type of each load instruction by monitoring access from an exemplary thread bundle. For example, the aspects proposed in this paper can utilize a Thread Bundle Access Pattern Aware Cache (WAPC) management scheme that uses the detected locality type to selectively apply data bypassing cache and / or cache locking based on load locality characteristics. Therefore, the WAPC management scheme described in this paper can significantly improve GPU performance (e.g., increase GPU processing speed and / or reduce the amount of power consumed at the GPU). Furthermore, utilizing locality observations and combining them with GPU workload characteristics (e.g., load instructions that extract data of a specific locality type may tend to extract data of the same locality type throughout kernel execution) may be a novel idea of the Thread Bundle Access Pattern Aware Cache (WAPC) management scheme utilized in this paper.
[0083] In some aspects of the GPU execution model, each core is executed by thousands of thread bundles, all of which execute the same code. The aspects proposed in this paper utilize one of the running thread bundles as the bootstrap thread bundle for each core. The bootstrap thread bundle can be one of the earliest running thread bundles in each core. Furthermore, the bootstrap thread bundle can be used to collect access locality patterns for each core, which will be stored in a table (e.g., a locality history table or an alias table). This table can be used to reference data locality patterns for subsequent concurrent thread arrays (CTAs). The table can also be used to make cache management strategy decisions based on workload type (e.g., cache-sensitive (CS) workloads, cache-medium (CM) workloads, and cache-insensitive (CI) workloads), such as cache bypassing or cache line locking. This, in turn, can potentially help improve GPU performance because it curbs the limitations imposed by currently implemented caching strategies (e.g., L1 data cache (L1D)). This avoids unnecessary evictions caused by the Least Recently Used (LRU) strategy in [the context of cache management]. This cache management scheme also classifies kernel data access patterns without any performance overhead and / or without requiring implementation in the load-memory (LD / ST) pipeline for each shader processor (SP). Due to the efficient cache management strategy based on data access patterns, the aspects proposed in this paper also achieve a significant increase in instructions per clock cycle (IPC).
[0084] Figure 10 Figure 1000 illustrates an example of an execution scheme. More specifically, Figure 1000 depicts a thread-bundle execution scheme 1002. Figure 10 As shown, the thread bundle execution scheme 1002 includes a bootstrap thread bundle 1010, a first thread bundle (W1) 1020, a second thread bundle (W2) 1030, an nth thread bundle (Wn) 1040, and a kernel 1050. Figure 10 The paper describes how, after a certain time period, the bootstrap thread bundle 1010 completes execution. Following the completion of bootstrap thread bundle 1010, the aspects proposed in this paper can be modified based on collected statistical data. Subsequently, the aspects proposed in this paper can execute several thread bundles (e.g., bootstrap thread bundle 1010, first thread bundle (W1) 1020, second thread bundle (W2) 1030, and nth thread bundle (Wn) 1040). Then, kernel 1050 can complete execution. After a certain time period, bootstrap thread bundle 1010 completes execution, and the aspects proposed in this paper can be modified based on collected statistical data. Finally, the aspects proposed in this paper can again execute several thread bundles (e.g., bootstrap thread bundle 1010, first thread bundle (W1) 1020, second thread bundle (W2) 1030, and nth thread bundle (Wn) 1040).
[0085] This paper proposes various approaches that leverage Thread Bundle Access Pattern Aware Cache (WAPC) management schemes, including thread bundle data locality. The WAPC management scheme utilizes the observation that "a persistent data locality type is exhibited with every load" to improve the utilization of GPU caches (e.g., L1 data caches). The WAPC management scheme first detects the data locality type of each load instruction by monitoring the cache access patterns of an exemplary thread bundle. Since multiple thread bundles cause significant interference in the cache, data locality types may not be inferred from the regular cache; therefore, the WAPC management scheme uses a dedicated cache tag array to track data sharing behavior from a single thread bundle. The locality type inferred for each load in the bootstrap thread bundle can then be applied to the same loads across all thread bundles within the kernel, as cache access patterns can exhibit strong consistency across thread bundles. Subsequently, the WAPC management scheme utilized in this paper can apply a load-specific cache management scheme for each type of data locality (e.g., streaming data locality, inter-thread locality, intra-thread locality, and / or inter-thread-intra-thread locality).
[0086] In some aspects, in cache-medium (CM) and cache-insensitive (CI) workloads, streaming data locality can account for a large portion of the required data. This means that resources such as cache lines and miss state holding register (MSHR) entries can be wasted when streaming data is fetched into the cache. Furthermore, even if there are a few cache lines with strong locality in these applications, they can be evicted by streaming data. Therefore, the best way to handle streaming data might be to completely bypass the cache (e.g., L1 cache) and feed data directly to the compute core from another cache (e.g., L2 cache) and its associated interconnect network. Additionally, a major cause of inter-thread bundle locality might be straddle access to large data arrays across different thread bundles. The index of the data array accessed by each thread can be calculated using a linear function of the thread identifier (ID), and / or the data address of the CTA ID thread can be merged into a single cache line space. However, if the addresses are not aligned, the merge request might span two cache line regions. In this scenario, a single thread bundle access might introduce two cache lines, but a portion of the second cache line might be accessed by the current thread bundle. An adjacent thread bundle might access the remaining data in the second cache line and then fetch a new cache line that might be partially used. This misaligned data access can lead to sharing between thread bundles. Another reason for inter-bundle locality is small data request sizes. Even if thread request addresses are merged, small data sizes (e.g., 1 byte per thread) may not occupy an entire cache line. Adjacent thread bundles might then consume the remaining cache line space. In some cases, for cache-sensitive (CS) workloads, intra-bundle locality can be the dominant locality type. Cache lines allocated by loads of intra-bundle locality types may not be efficiently reused, even if they are referenced multiple times. Furthermore, intra-bundle locality types can have long reuse distances, a consequence of GPU thread bundle schedulers that interleave thread bundles, where instructions that are close to each other even within a single thread bundle can actually be separated by long time intervals. Therefore, data of intra-thread locality types can be frequently disturbed by a large number of accesses from other intra-thread bundles, leading to premature eviction of cache lines. The WAPC management scheme used in this paper can leverage this data locality by protecting cache lines allocated for loading intra-thread locality types (until they are mostly reused).
[0087] Figure 11Figure 1100 illustrates an example of a cache management scheme. More specifically, Figure 1100 depicts a cache management scheme 1102 that includes an architecture for a WAPC management scheme. Figure 11 As shown, Figure 1100 includes a thread bundle (load) 1110, a thread 1112, an address generator 1114, a merger 1116, an alias table 1118, a cache management policy guideline 1120, an L1 data cache 1130, a protection scoreboard 1140, a thread bundle tag directory 1150, a locality information table 1160, and a path for bypassing the cache 1170. Figure 11 The diagram depicts thread bundles (loads) 1110 being separated into threads 1112, which are then sent to address generator 1114 and subsequently to merger 1116. Information associated with the loader program counter (PC) of this set of workloads can be converted into a set of load identifiers (IDs) for that set of workloads. Load PCs and load IDs can be stored in alias table 1118. This information, along with information from merger 1116, is transferred from the alias table to L1 data cache 1130. L1 data cache 1130 may include multiple tags and data, as well as protection bits, thread bundle IDs, access counts, and / or tag addresses. Furthermore, L1 data cache 1130 may store information for cache management policy standard 1120, which may include data locality type (e.g., streaming data locality, inter-thread bundle locality, intra-thread bundle locality, and / or inter- and intra-thread bundle locality), total access count, and access counts obtained through thread bundle allocation. Information may also be transmitted to a protection scoreboard 1140, which may determine whether locality information associated with the locality of each workload in the at least one second set of workloads is configured or stored. The protection scoreboard 1140 may transmit information to a thread bundle tag directory (WTG) (e.g., thread bundle tag directory 1150). Additionally, information from the L1 data cache 1130 may be transmitted to the thread bundle tag directory 1150. The thread bundle tag directory 1150 may include the first load ID, last load ID, access count, in-thread access count, and / or tag address. The thread bundle tag directory 1150 may transmit information to a locality information table (LIT) (e.g., locality information table 1160). The locality information table 1160 may include a valid bit, management information, last ID, and access count. Management information may be cache line management information (e.g., information about whether to store or bypass a cache line). L1 data cache 1130 can output information about a path 1170 used to bypass the cache, which may include an indication of storing or avoiding storage of data for a set of workloads. The information about the path 1170 used to bypass the cache can be transmitted to the interconnect unit.
[0088] like Figure 11As shown, the aspects proposed in this paper utilize cache management scheme 1102, which proposes changes to the L1 cache access pipeline of the LD / ST unit in the GPU core. The aspects proposed in this paper may include structures such as an alias table (AT) (e.g., alias table 1118), a thread bundle tag directory (WTD) (e.g., thread bundle tag directory 1150), a locality information table (LIT) (e.g., locality information table 1160), a protection scoreboard (PSB) (e.g., protection scoreboard 1140), and a cache bypass path (e.g., a path for bypassing the cache 1170). The WAPC management scheme utilized in this paper can track the access history of the bootstrap thread bundle in the thread bundle tag directory 1150. Figure 11 The structure of an entry in the thread bundle tag directory is shown. Each tag entry may contain four additional fields in addition to the usual tag information. The first and last load ID fields store the load instruction that initially allocated the given cache line and the load instruction that last accessed that cache line. Furthermore, the access count field stores the total number of times the given cache line was accessed by any thread bundle (including the bootstrap thread bundle), while the thread bundle access count field tracks the number of times only the bootstrap thread bundle accessed the cache line. When a global load instruction is executed for the first time, the load PC can be hashed to create a shorter load ID. The load PC and load ID can then be stored in the content-addressable memory (CAM) with N entries in alias table 1118.
[0089] like Figure 11Furthermore, in some aspects, if the load has been executed at least once before, the alias table 1118 may contain an entry for the load PC, and the load ID can be retrieved. If the load instruction originates from the bootstrap thread bundle, an index pointing to the thread bundle tag directory 1150 can be generated using the load address. The thread bundle tag directory 1150 can function similarly to a cache tag array. For example, when a memory request from the bootstrap thread bundle is not hit in the thread bundle tag directory 1150, the tag address of the request can be allocated in the thread bundle tag directory 1150. Additionally, the load ID (the hashed PC of the load instruction) can be recorded in the first load ID field, and the access count field and the thread bundle access count field can each be set to 1. If the tag is hit by other requests after allocation, the load IDs of these requests can be stored in the last load ID field. Future accesses from the bootstrap thread bundle to this cache line can increment both the access count field and the thread bundle access count field. Furthermore, the access count field can be incremented if load instructions from all other thread bundles (except the bootstrap thread bundle) hit the WTD entry. If a single thread bundle load generates more than two memory addresses (e.g., an unmerged load), the first two requests may assign the WTD tag, while other accesses from the same load may be discarded. Requests from the bootstrap thread bundle may also hit in L1 data cache 1130 but miss in thread bundle tag directory 1150. This can happen when a cache line is initially allocated by any thread bundle other than the bootstrap thread bundle, and then a request from the bootstrap thread bundle hits the cache. In this case, the access count from the L1 cache tag can be used to initialize the WTD tag. To support this, the L1 cache tag can be augmented to track access counts (e.g., an access count field) that only tracks the number of times a cache line is accessed by any thread bundle. The remaining fields (e.g., protection and thread bundle ID) can be used for cache line protection.
[0090] like Figure 11As further illustrated, the Locality Information Table 1160 manages the locality type and data access dependency information tracked for each load. That is, entries in the Locality Information Table (LIT) 1160 may be updated when a WTD tag is evicted due to an address conflict, or when the access count of a WTD entry exceeds a predefined threshold. Each entry in the Locality Information Table 1160 can be indexed by the load ID stored in the first load ID field of the WTD entry. The management method field can store the cache management scheme selected for the corresponding load instruction. The management field can use the access count and in-thread access count fields from the WTD entry to determine the load locality type using the criteria defined herein. For streaming data, the management method can be set to bypass, and if the load exhibits in-thread locality, the management method can be set to protect. Otherwise, a normal cache management scheme can be set. The access count and last load ID fields can be copied from the WTD tag. After the WTD tag associated with the load ID is evicted, the same load ID can be executed again from the bootstrap thread and a new WTD entry can be reassigned. Therefore, WTD entries may be evicted and reassigned multiple times by the same load before the bootstrap bundle completes execution. Since LIT entries can be indexed using the load ID each time a WTD entry is evicted, LIT entries can also be updated. If the access count in a WTD entry is greater than the current access count stored in the LIT entry, the aspects proposed in this paper can overwrite the LIT entry with the new information. Finally, after the bootstrap bundle completes execution, all WTD entries can be scanned, and the LIT can be updated for each WTD entry. All WTD entries can then be invalidated and reused to monitor different kernel executions.
[0091] like Figure 11 As further shown, cache management scheme 1102 can utilize load instructions indexed to directly mapped LITs using load IDs. If the LIT entry indexed by the load ID is valid, cache management scheme 1102 specified in the management method field can be applied to the load. Some possible cache management methods are: normal, bypass, and / or protected. If the management method is specified as normal, the load may undergo a normal GPU cache access process. If the load is classified as bypass, the request from the load may be directly allocated to the interconnect without accessing the L1 data cache 1130. If the load access specifies protection, the allocated cache line can be pinned as a protected line by setting the protection bit of the cache tag in the L1 data cache 1130. In addition, the thread bundle ID field corresponding to the cache line can also be set to the current thread bundle ID of the load instruction. Finally, the aspects proposed herein can utilize a mechanism to determine when to unpin the protected cache line.
[0092] like Figure 11 Further described, the validity of protected lines can be controlled by the Protection Status Board (PSB). The PSB may have one entry per thread bundle. When a load instruction in a thread bundle allocates a protected cache line, the PSB bit for that thread bundle can be set to 1. The last load ID from the locality information table 1160 can be copied to the PSB entry for that thread bundle. From then on, the PSB can track whether the load instruction mapped to that last load ID has completed execution (upon completion, the PSB bit is reset to zero, meaning the cache line for that thread bundle may no longer be protected). If the last load ID of an entry in the locality information table 1160 is the same as the load ID used to index that entry, the loop indicator bit of the PSB entry can be set to mark the load instruction as part of an iteration loop. In this case, the protected line may be pinned until the thread bundle exits the loop. The aspects presented herein can use a Single Instruction Multithreaded (SIMT) stack to track when a load instruction exits the loop. In the specific implementations presented herein, the PSB can track a single protected load instruction for each thread bundle. In practice, within any given thread bundle, it may only be necessary to protect a single load instruction. Therefore, if a previous load instruction in a thread bundle has already set a PSB entry, protection controls for other load instructions in that thread bundle may be ignored until that PSB entry is released.
[0093] Figure 12 Figures 1200, 1220, 1240, and 1260 illustrate examples of cache management schemes. More specifically, Figures 1200, 1220, 1240, and 1260 depict cache management schemes including WAPC management schemes. Figure 12As shown, Figure 1200 includes thread bundle 0 1202, load A 1204, thread bundle 1 1206, PSB 1210, W0, W1, L1 cache 1216, protection instructions, and thread bundle ID. Figure 1220 includes thread bundle 0 1222, load A 1224, thread bundle 1 1226, load A 1228 (bypass), PSB 1230, W0, W1, L1 cache 1236, protection instructions, and thread bundle ID. Furthermore, Figure 1240 includes thread bundle 0 1242, load A 1244, load A (hit), branch, thread bundle 11246, load A 1248 (bypass), PSB 1250, W0, W1, L1 cache 1256, protection instructions, and thread bundle ID. Figure 1260 includes thread bundle 0 1262, load A 1264, load A (hit), branch, thread bundle 1 1266, load A 1268, PSB 1270, W0, W1, L1 cache 1276, protection instruction and thread bundle ID. Figure 12 Four examples of WAPC management schemes are depicted (Figure 1200, Figure 1220, Figure 1240 and Figure 1260).
[0094] As shown in Figure 1200, the example in Figure 1200 assumes that the bootstrap bundle has completed execution and that it has been determined that the data retrieved by the load instruction (load A) was accessed multiple times within the loop during the execution of the bootstrap bundle. The WTD entry accessed via the address from load A 1204 will mark the first load and last load fields with the same load ID, and the access count field and the intra-bundle access count field may also be the same at the end of the bootstrap bundle execution. Based on this WTD entry information, the LIT entry can be updated to indicate that load A 1204 is characterized as an intra-bundle locale load. For the LIT entry indexed by load A 1204, its management method may have been set to protection-based (for intra-bundle loads). Consider two bundles: bundle 0 1202 and bundle 1 1206. First, bundle 0 1202 executes load A 1204. The LIT entry indexed by this load ID may be valid, and its management method may be set to protection. At this point, the PSB of thread bundle 01202 can be set to 1, indicating that data fetched into the L1 cache by load A1204 may be protected. Once the data is fetched into the L1 cache, the protection bit in the cache line can be set to 1, and the thread bundle ID field of that cache line can be set to thread bundle 01202. From then on, thread bundle 01202 can continue to protect the cache line fetched by load A1204 until that load exits the loop.
[0095] As shown in Figure 1220, L1 cache 1236 can be directly mapped, and thread bundle 1 1226 can perform load A 1228. Since the LIT is indexed by the load ID and is independent of the thread bundle, the LIT entry for load A 1228 from thread bundle 1 1226 may also indicate that the load specifies a protected cache line. Load A 1228 from thread bundle 1 1226 might attempt to fetch data into this cache line, but note that the protection bit of this cache line has been set to 1. For this protected cache line, the thread bundle ID might be set to thread bundle 0 1222, and the PSB entry for thread bundle 0 1222 might still be set to 1, meaning thread bundle 0 1222 has not yet exited the loop. Instead of evicting the protected cache line, the data fetched by load A 1228 from thread bundle 1 1226 can simply bypass the cache, as shown in Figure 1220.
[0096] As shown in Figure 1240, the bit in the PSB for thread bundle 1 1246 may not be set because the request from thread bundle 1 1246 is not pinned. In Figure 1240, repeated load A 1244 from thread bundle 0 1242 hits the protected cache line multiple times and eventually the loop terminates. Then, because the lifetime of the protected cache line has ended, the bit in the PSB for thread bundle 0 1242 can be reset.
[0097] As shown in Figure 1260, if thread bundle 1 1266 is still executing load A 1268 in the loop, it might continue to attempt to allocate a cache line and protect it each time the load is executed. After thread bundle 0 1262 exits the loop, thread bundle 1 1266 might see that the cache line is to be protected. However, thread bundle 01262 stored in that cache line can access the PSB, and eventually the PSB might indicate that thread bundle 0 1262 is no longer protected. At this point, the cache line protection bit is reset, and thread bundle 1 1266 can allocate the cache line, then set its protection bit and set the thread bundle ID to thread bundle 1 1266. As shown in Figure 1260, the PSB for thread bundle 1 1266 can now be set to 1.
[0098] The aspects of this disclosure may include several benefits or advantages. For example, the aspects of this disclosure may utilize or bypass caches (e.g., L1 data caches) based on the detection of workload locality. Based on this, the aspects of this disclosure can efficiently utilize GPU caches because GPU cache usage can be optimized. The aspects presented herein can utilize a cache management scheme for efficient use of GPU caches based on the locality of each workload. More precisely, the aspects proposed herein can utilize a cache management scheme based on per-load locality behavior. For example, the aspects proposed herein can identify or detect the locality of each workload in a set of workloads corresponding to a data thread. Furthermore, the aspects proposed herein can determine and store the access patterns of cache lines for a first set of workloads. By doing so, the aspects proposed herein can store data for at least one second set of workloads based on the access patterns of cache lines for the first set of workloads. Additionally, the aspects proposed herein can bypass (i.e., avoid storing) the data for at least one second set of workloads based on the access patterns of cache lines for the first set of workloads. By optimizing when to store data in the cache (or bypass it to avoid storage in the cache), the aspects presented in this paper can help optimize GPU processing speed, the amount of memory utilized by the GPU, and / or the amount of power consumed by the GPU.
[0099] Figure 13 This is a communication flowchart 1300 for data processing or graphics processing according to one or more techniques of this disclosure. For example... Figure 13 As shown, Figure 1300 includes example communication between GPU 1302 (e.g., GPU, cache at GPU, GPU component, another graphics processor, CPU, CPU component, or another central processing unit), GPU component 1304 (e.g., GPU, cache at GPU, GPU component, another graphics processor, CPU, CPU component, or another central processing unit), and memory 1306 (e.g., memory, cache, system memory, graphics memory, memory or cache at CPU, or memory or cache at GPU), which conforms to one or more techniques of this disclosure.
[0100] At 1310, GPU 1302 may receive an indication of a set of data threads associated with graphics processing, wherein the set of data threads corresponds to a set of workloads. In some aspects, the set of workloads may be a set of graphics workloads at a graphics processing unit (GPU). Furthermore, receiving an indication of the set of data threads may include receiving the indication of the set of data threads from at least one component at the GPU or a kernel at the GPU (e.g., GPU 1302 may receive indication 1312 from GPU component 1304).
[0101] At 1320, GPU 1302 can convert the information associated with the loader program counter (PC) of that set of workloads into a set of load identifiers (IDs) for that set of workloads. Furthermore, at 1320, based on the conversion of the information, GPU 1302 can store that set of load IDs for that set of workloads in an alias table.
[0102] At 1330, GPU 1302 may identify the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with an access pattern of at least one cache line for the cache of each workload in the set of workloads. The locality of each workload in the set of workloads may be at least one of spatial locality or temporal locality. Spatial locality may be associated with the reusability of cache addresses, and temporal locality may be associated with the reusability of cache access times. The set of data threads may be associated with a thread bundle, the cache may be a Level 1 (L1) data cache, and the L1 data cache may include at least one of the following: protection bits, a set of thread bundle identifiers (IDs), access counts, or a set of tag addresses. Furthermore, this set of tag addresses includes information associated with at least one of the following: inter-thread bundle locality, intra-thread bundle locality, or inter-thread bundle-intra-thread bundle locality, wherein inter-thread bundle locality may be associated with: data fetched by a load instruction of a first thread bundle is also accessed across multiple thread bundles by the same loader counter (PC); wherein intra-thread bundle locality may be associated with: data fetched by a load instruction of a second thread bundle is used within the same thread bundle as the data; and wherein inter-thread bundle-intra-thread bundle locality may be associated with: data is introduced into the cache by a third thread bundle and is reused by other thread bundles. In some aspects, the access pattern of the at least one cache line corresponds to the reusability of the at least one cache line, or wherein the access pattern of the at least one cache line corresponds to whether the reusability level of the workload is greater than or less than a reusability threshold.
[0103] At 1340, GPU 1302 can configure locality information associated with the locality of each workload in a first group of workloads within the first group of workloads, the first group of workloads corresponding to a first group of data threads in the first group of data threads. In some aspects, configuring the locality information associated with the locality of each workload in the first group of workloads may include configuring the locality information associated with the locality of each workload in the first group of workloads based on a set of tag addresses of the access patterns for the at least one cache line. The set of tag addresses of the access patterns for the at least one cache line is associated with a thread bundle tag directory. Furthermore, the locality information may include at least one of cache line management information, a validity bit, a last loaded identifier (ID), or an access count. Additionally, the locality information may be associated with a locality information table, and storing the access patterns of the at least one cache line may include storing the access patterns of the at least one cache line in the locality information table.
[0104] At 1350, GPU 1302 may determine whether to configure or store locality information associated with the locality of each workload in the at least one second set of workloads.
[0105] At 1360, GPU 1302 may store the access pattern of at least one cache line of the first set of workloads in the set of workloads based on the locality of each workload in the set of workloads, wherein the first set of workloads corresponds to the first set of data threads in the set of data threads (e.g., GPU 1302 may store indication 1364 in memory 1306).
[0106] At 1370, GPU 1302 may store or avoid storing data for at least one second set of workloads within the first set of workloads based on the access patterns of the at least one cache line of the first set of workloads. In some aspects, storing data for the at least one second set of workloads may include storing the data for the at least one second set of workloads in memory or cache at the graphics processing unit (GPU) (e.g., GPU 1302 may store instruction 1374 in memory 1306). Furthermore, avoiding storing data for the at least one second set of workloads may include bypassing the storage of data for the at least one second set of workloads at the graphics processing unit (GPU).
[0107] At 1380, GPU 1302 may output an indication of storing or avoiding storing data for the at least one second set of workloads. In some aspects, outputting an indication of storing or avoiding storing data for the at least one second set of workloads may include: sending an indication of storing or avoiding storing data for the at least one second set of workloads (e.g., GPU 1302 may send indication 1382 to GPU component 1304); or storing an indication of storing or avoiding storing data for the at least one second set of workloads (e.g., GPU 1302 may store indication 1384 in memory 1306).
[0108] Figure 14 This is a flowchart 1400 of an example method for data processing or graphics processing according to one or more techniques of this disclosure. The method may be performed by: a GPU (e.g., a GPU, a cache at the GPU, a GPU component, another graphics processor, a CPU, a CPU component, or another central processing unit), a CPU (e.g., a CPU, a cache at the CPU, a CPU component, another central processing unit, a GPU, a GPU component, or another graphics processor), a display driver integrated circuit (DDIC), means for data or graphics processing, a wireless communication device, and / or an executable combination thereof. Figures 1 to 13 The example uses any device for data or graphics processing.
[0109] At 1402, the GPU receives instructions for a set of data threads associated with graphics processing, where the set of data threads corresponds to a set of workloads, such as in combination. Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1310, GPU 1302 can obtain instructions for a set of data threads associated with graphics processing, wherein the set of data threads corresponds to a set of workloads. Furthermore, step 1402 can be performed by... Figure 1 The processing unit 120 executes the process. In some aspects, the set of workloads may be a set of graphics workloads at a graphics processing unit (GPU). Furthermore, obtaining an instruction for the set of data threads may include receiving the instruction for the set of data threads from at least one component at the GPU or a kernel at the GPU (e.g., GPU 1302 may receive instruction 1312 from GPU component 1304).
[0110] At 1406, the GPU can identify the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with the access pattern of at least one cache line of the cache for each workload in the set of workloads, such as in combination with Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13As described in 1330, GPU 1302 can identify the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with the access pattern of at least one cache line of the cache for each workload in the set of workloads. Furthermore, step 1406 can be... Figure 1 The processing unit 120 executes the process. The locality of each workload in this set of workloads can be at least one of spatial locality or temporal locality. Spatial locality can be associated with the reusability of cache addresses, and temporal locality can be associated with the reusability of cache access times. This set of data threads can be associated with a thread bundle, the cache can be a Level 1 (L1) data cache, and the L1 data cache can include at least one of the following: protection bits, a set of thread bundle identifiers (IDs), access counts, or a set of tag addresses. Furthermore, this set of tag addresses includes information associated with at least one of the following: inter-thread bundle locality, intra-thread bundle locality, or inter-thread bundle-intra-thread bundle locality, wherein inter-thread bundle locality may be associated with: data fetched by a load instruction of a first thread bundle is also accessed across multiple thread bundles by the same loader counter (PC); wherein intra-thread bundle locality may be associated with: data fetched by a load instruction of a second thread bundle is used within the same thread bundle as the data; and wherein inter-thread bundle-intra-thread bundle locality may be associated with: data is introduced into the cache by a third thread bundle and is reused by other thread bundles. In some aspects, the access pattern of the at least one cache line corresponds to the reusability of the at least one cache line, or wherein the access pattern of the at least one cache line corresponds to whether the reusability level of the workload is greater than or less than a reusability threshold.
[0111] At 1412, the access patterns of at least one cache line for the first group of workloads in the set of workloads can be stored based on the locality of each workload in the set of workloads, where the first group of workloads corresponds to the first group of data threads in the set of data threads, such as in combination with Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1360, GPU 1302 may store the access patterns of at least one cache line of a first set of workloads in the set of workloads based on the locality of each workload in the set of workloads, wherein the first set of workloads corresponds to the first set of data threads in the set of data threads. Furthermore, step 1412 may be... Figure 1 The processing unit 120 in the middle executes.
[0112] At 1414, the GPU can store or avoid storing data from at least one second set of workloads within that first set of workloads based on the access patterns of that at least one cache line, such as in combination with... Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1370, GPU 1302 may store or avoid storing data from at least one second set of workloads within the first set of workloads based on the access patterns of the at least one cache line of the first set of workloads. Furthermore, step 1414 may be... Figure 1 The processing unit 120 performs the operation. In some aspects, storing the data of the at least one second set of workloads may include storing the data of the at least one second set of workloads in memory or cache at the graphics processing unit (GPU) (e.g., GPU 1302 may store instruction 1374 in memory 1306). Furthermore, avoiding the storage of the data of the at least one second set of workloads may include bypassing the storage of the data of the at least one second set of workloads at the graphics processing unit (GPU).
[0113] Figure 15 This is a flowchart 1500 of an example method for data processing or graphics processing according to one or more techniques of this disclosure. The method may be performed by: a GPU (e.g., a GPU, a cache at the GPU, a GPU component, another graphics processor, a CPU, a CPU component, or another central processing unit), a CPU (e.g., a CPU, a cache at the CPU, a CPU component, another central processing unit, a GPU, a GPU component, or another graphics processor), a display driver integrated circuit (DDIC), means for data or graphics processing, a wireless communication device, and / or an executable combination thereof. Figures 1 to 13 The example uses any device for data or graphics processing.
[0114] At 1502, the GPU receives instructions for a set of data threads associated with graphics processing, where the set of data threads corresponds to a set of workloads, such as in combination. Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1310, GPU 1302 can obtain instructions for a set of data threads associated with graphics processing, wherein the set of data threads corresponds to a set of workloads. Furthermore, step 1502 can be performed by... Figure 1 The processing unit 120 executes the process. In some aspects, the set of workloads may be a set of graphics workloads at a graphics processing unit (GPU). Furthermore, obtaining an instruction for the set of data threads may include receiving the instruction for the set of data threads from at least one component at the GPU or a kernel at the GPU (e.g., GPU 1302 may receive instruction 1312 from GPU component 1304).
[0115] At 1504, the GPU can convert the information associated with the loader program counter (PC) of that set of workloads into a set of load identifiers (IDs) for that set of workloads, such as combining... Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1320, GPU 1302 can convert information associated with the loader program counter (PC) of that set of workloads into a set of load identifiers (IDs) for that set of workloads. Furthermore, step 1504 can be performed by... Figure 1 The processing unit 120 executes this. Furthermore, at 1504, the GPU can store the set of load IDs for this set of workloads in an alias table based on the transformation of the information, such as in combination with... Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1320, GPU 1302 may store the set of load IDs for that set of workloads in an alias table. Furthermore, step 1504 may be... Figure 1 The processing unit 120 in the middle executes.
[0116] At 1506, the GPU can identify the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with the access pattern of at least one cache line of the cache for each workload in the set of workloads, such as in combination with Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1330, GPU 1302 can identify the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with the access pattern of at least one cache line of the cache for each workload in the set of workloads. Furthermore, step 1506 can be... Figure 1The processing unit 120 executes the process. The locality of each workload in this set of workloads can be at least one of spatial locality or temporal locality. Spatial locality can be associated with the reusability of cache addresses, and temporal locality can be associated with the reusability of cache access times. This set of data threads can be associated with a thread bundle, the cache can be a Level 1 (L1) data cache, and the L1 data cache can include at least one of the following: protection bits, a set of thread bundle identifiers (IDs), access counts, or a set of tag addresses. Furthermore, this set of tag addresses includes information associated with at least one of the following: inter-thread bundle locality, intra-thread bundle locality, or inter-thread bundle-intra-thread bundle locality, wherein inter-thread bundle locality may be associated with: data fetched by a load instruction of a first thread bundle is also accessed across multiple thread bundles by the same loader counter (PC); wherein intra-thread bundle locality may be associated with: data fetched by a load instruction of a second thread bundle is used within the same thread bundle as the data; and wherein inter-thread bundle-intra-thread bundle locality may be associated with: data is introduced into the cache by a third thread bundle and is reused by other thread bundles. In some aspects, the access pattern of the at least one cache line corresponds to the reusability of the at least one cache line, or wherein the access pattern of the at least one cache line corresponds to whether the reusability level of the workload is greater than or less than a reusability threshold.
[0117] At 1508, the GPU can configure locality information associated with the locality of each workload in the first group of workloads within this set of workloads, which corresponds to the first group of data threads in this set of data threads, such as in combination with Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1340, GPU 1302 can be configured with locality information associated with the locality of each workload in a first group of workloads within the set of workloads, the first group of workloads corresponding to a first group of data threads in the set of data threads. Furthermore, step 1508 can be... Figure 1The processing unit 120 executes the process. In some aspects, configuring locality information associated with the locality of each workload in the first group of workloads may include configuring locality information associated with the locality of each workload in the first group of workloads based on a set of tag addresses for the access patterns of the at least one cache line. The set of tag addresses for the access patterns of the at least one cache line is associated with a thread bundle tag directory. Furthermore, the locality information may include at least one of cache line management information, a validity bit, a last loaded identifier (ID), or an access count. Additionally, the locality information may be associated with a locality information table, and storing the access patterns of the at least one cache line may include storing the access patterns of the at least one cache line in the locality information table.
[0118] At 1510, the GPU can determine whether to configure or store locality information associated with the locality of each workload in the at least one second set of workloads, such as in combination with Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1350, GPU 1302 can determine whether to configure or store locality information associated with the locality of each workload in the at least one second set of workloads. Furthermore, step 1510 can be... Figure 1 The processing unit 120 in the middle executes.
[0119] At 1512, the access patterns of at least one cache line for the first group of workloads in the set of workloads can be stored based on the locality of each workload in the set of workloads, where the first group of workloads corresponds to the first group of data threads in the set of data threads, such as in combination with Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1360, GPU 1302 may store the access patterns of at least one cache line of a first set of workloads in the set of workloads based on the locality of each workload in the set of workloads, wherein the first set of workloads corresponds to the first set of data threads in the set of data threads. Furthermore, step 1512 may be... Figure 1 The processing unit 120 in the middle executes.
[0120] At 1514, the GPU can store or avoid storing data from at least one second set of workloads within that first set of workloads based on the access patterns of that at least one cache line, such as in combination with... Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13As described in 1370, GPU 1302 may store or avoid storing data from at least one second set of workloads within the first set of workloads based on the access patterns of the at least one cache line of the first set of workloads. Furthermore, step 1514 may be... Figure 1 The processing unit 120 performs the operation. In some aspects, storing the data of the at least one second set of workloads may include storing the data of the at least one second set of workloads in memory or cache at the graphics processing unit (GPU) (e.g., GPU 1302 may store instruction 1374 in memory 1306). Furthermore, avoiding the storage of the data of the at least one second set of workloads may include bypassing the storage of the data of the at least one second set of workloads at the graphics processing unit (GPU).
[0121] At 1516, the GPU can output an indication of storing or avoiding storing data for at least one second set of workloads, such as in conjunction with... Figures 1 to 13 The examples described in [the document / reference]. For example, as in [the document / reference]... Figure 13 As described in 1380, GPU 1302 may output an indication of storing or avoiding storage of data for the at least one second set of workloads. Furthermore, step 1516 may be... Figure 1 The processing unit 120 executes the operation. In some aspects, outputting an indication of storing or avoiding storing data for the at least one second set of workloads may include: sending an indication of storing or avoiding storing data for the at least one second set of workloads (e.g., GPU 1302 may send indication 1382 to GPU component 1304); or storing an indication of storing or avoiding storing data for the at least one second set of workloads (e.g., GPU 1302 may store indication 1384 in memory 1306).
[0122] In the configuration, methods or apparatus for data or graphics processing are provided. The apparatus may be a GPU (or other graphics processing unit), a CPU (or other central processing unit), a DDIC, a device for graphics processing, and / or some other processor capable of performing data or graphics processing. In various aspects, the apparatus may be a processing unit 120 within device 104, or may be some other hardware within device 104 or another device. The apparatus (e.g., processing unit 120) may include components for obtaining indications of a set of data threads associated with graphics processing, wherein the set of data threads corresponds to a set of workloads. The apparatus (e.g., processing unit 120) may also include components for identifying the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with an access pattern of at least one cache line of a cache for each workload in the set of workloads. The apparatus (e.g., processing unit 120) may further include means for storing access patterns of at least one cache line of a first set of workloads in the set of workloads based on the locality of each workload in the set of workloads, wherein the first set of workloads corresponds to a first set of data threads in the set of data threads. The apparatus (e.g., processing unit 120) may further include means for storing or avoiding storing data of at least one second set of workloads in the set of workloads based on the access patterns of the at least one cache line of the first set of workloads. The apparatus (e.g., processing unit 120) may further include means for converting information associated with the loader program counter (PC) of the set of workloads into a set of load identifiers (IDs) for the set of workloads. The apparatus (e.g., processing unit 120) may further include means for storing the set of load IDs of the set of workloads in an alias table based on the information conversion. The apparatus (e.g., processing unit 120) may further include means for configuring locality information associated with the locality of each workload in the first set of workloads in the set of workloads, wherein the first set of workloads corresponds to a first set of data threads in the set of data threads. The apparatus (e.g., processing unit 120) may also include components for determining whether to configure or store locality information associated with the locality of each of the at least one second set of workloads. The apparatus (e.g., processing unit 120) may also include components for outputting an indication of storing or avoiding storage of data for the at least one second set of workloads.
[0123] The subjects described herein can be implemented to achieve one or more benefits or advantages. For example, the described data or graphics processing techniques can be used by a cache, GPU, CPU, central processing unit, or some other processor capable of performing data or graphics processing to implement the cache management techniques described herein. This can also be achieved at a lower cost compared to other data or graphics processing techniques. Furthermore, the data or graphics processing techniques of this invention can improve or accelerate data processing or execution. In addition, the data or graphics processing techniques of this invention can improve resource or data utilization and / or resource efficiency. Additionally, aspects of this disclosure can utilize cache management techniques to improve memory bandwidth efficiency and / or increase processing speed at a cache, CPU, GPU, or DPU.
[0124] It should be understood that the specific order or hierarchy of the boxes in the disclosed process / flowcharts is merely an example of the exemplary method. It should be understood that the specific order or hierarchy of the boxes in the process / flowcharts may be rearranged based on design preferences. Furthermore, some boxes may be combined or omitted. The appended method claims present the elements of various boxes in a sample order, but this does not imply limitation to the given specific order or hierarchy.
[0125] The foregoing description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be apparent to those skilled in the art, and the general principles defined herein can be applied to other aspects. Therefore, the claims are not intended to be limited to the aspects shown herein, but should be given the full scope consistent with the language of the claims, wherein references to elements in the singular form, unless specifically stated otherwise, are not intended to mean “one and only one,” but rather “one or more.” The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
[0126] Unless otherwise specified, the term "some" means one or more, and unless otherwise specified in the context, the term "or" may be interpreted as "and / or". Combinations such as "at least one of A, B, or C", "one or more of A, B, or C", "at least one of A, B, and C", "one or more of A, B, and C", and "A, B, C, or any combination thereof" include any combination of A, B, and / or C, which may include multiple A, multiple B, or multiple C. Specifically, combinations such as "at least one of A, B, or C", "one or more of A, B, or C", "at least one of A, B, and C", "one or more of A, B, and C", and "A, B, C, or any combination thereof" may be only A, only B, only C, A and B, A and C, B and C, or A and B and C, wherein any such combination may contain one or more members of A, B, or C. All structural and functional equivalents of the elements throughout the various aspects described herein that are known to or will later be known to a person skilled in the art are expressly incorporated herein by reference and are intended to be covered by the claims. Furthermore, nothing disclosed herein is intended to be offered to the public, whether or not such disclosure is explicitly recited in the claims. The terms “module,” “mechanism,” “element,” “device,” etc., cannot replace the word “component.” Therefore, no claim element will be construed as a functional component unless the element is explicitly described using the phrase “component for…”.
[0127] In one or more examples, the functionality described herein may be implemented in hardware, software, firmware, or any combination thereof. For example, although the term "processing unit" is used throughout this disclosure, such a processing unit may be implemented in hardware, software, firmware, or any combination thereof. If any functionality, processing unit, technique, or other module described herein is implemented in software, then such functionality, processing unit, technique, or other module may be stored on or transmitted on a computer-readable medium as one or more instructions or code.
[0128] According to this disclosure, unless otherwise specified in the context, the term "or" may be understood as "and / or". Additionally, while phrases such as "one or more" or "at least one" may be used for some features disclosed herein but not others, features not using such language may be understood to have such implied meaning unless otherwise specified in the context.
[0129] In one or more examples, the functionality described herein may be implemented in hardware, software, firmware, or any combination thereof. For example, although the term “processing unit” is used throughout this disclosure, such a processing unit may be implemented in hardware, software, firmware, or any combination thereof. If any functionality, processing unit, technique, or other module described herein is implemented in software, then the functionality, processing unit, technique, or other module described herein may be stored on or transmitted on a computer-readable medium as one or more instructions or code. A computer-readable medium may include computer data storage media and communication media, including any medium that facilitates the transfer of a computer program from one place to another. In this way, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium or (2) a communication medium such as a signal or carrier wave. A data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, code, and / or data structures for implementing the techniques described herein. By way of example and not limitation, such computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage devices, magnetic disk storage devices, or other magnetic storage devices. As used herein, disks and optical discs include: compact optical discs (CDs), laser optical discs, optical discs, digital multifunction optical discs (DVDs), floppy disks, and Blu-ray discs, wherein disks typically reproduce data magnetically, while optical discs reproduce data optically using lasers. Combinations of the above should also be included within the scope of computer-readable media. Computer program products may include computer-readable media.
[0130] The code can be executed by one or more processors, such as one or more digital signal processors (DSPs), general-purpose microprocessors, application-specific integrated circuits (ASICs), arithmetic logic units (ALUs), field-programmable arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Therefore, the term "processor" as used herein can refer to any of the above-described structures or any other structure suitable for implementing the techniques described herein. Furthermore, these techniques can be fully implemented in one or more circuit or logic elements.
[0131] The techniques disclosed herein can be implemented in a wide variety of devices or apparatuses, including wireless mobile phones, integrated circuits (ICs), or IC sets (e.g., chipsets). Various components, modules, or units are described in this disclosure to emphasize functional aspects of a device configured to perform the disclosed techniques, but they do not necessarily need to be implemented by different hardware units. Rather, as described above, various units can be combined in any hardware unit or provided by a collection of interoperable hardware units (including one or more processors as described above) combined with suitable software and / or firmware. Therefore, the term "processor" as used herein can refer to any of the above-described structures or any other structure suitable for implementing the techniques described herein. Furthermore, these techniques can be fully implemented in one or more circuit or logic elements.
[0132] The following aspects are merely illustrative and may be combined with other aspects or teachings described herein without limitation.
[0133] Aspect 1 is an apparatus for data processing, the apparatus comprising at least one memory and at least one processor coupled to said at least one memory, and at least based on information stored in said at least one memory, said at least one processor being configured individually or in any combination to: obtain indication of a set of data threads associated with graphics processing, wherein said set of data threads corresponds to a set of workloads; identify the locality of each workload in the set of workloads corresponding to said set of data threads, wherein said locality of each workload in the set of workloads is associated with an access pattern of at least one cache line of a cache for each workload in the set of workloads; store the access pattern of said at least one cache line of a first set of workloads in the set of workloads based on said locality of said at least one workload in the set of workloads, wherein said first set of workloads corresponds to a first set of data threads in the set of data threads; and store or avoid storing data of at least one second set of workloads in the set of workloads based on said access pattern of said at least one cache line of the first set of workloads.
[0134] Aspect 2 is the apparatus according to aspect 1, wherein the locality of each workload in the set of workloads is at least one of spatial locality or temporal locality.
[0135] Aspect 3 is the apparatus according to aspect 2, wherein the spatial locality is associated with the reusability of the address of the cache, and wherein the temporal locality is associated with the reusability of the time of accessing the cache.
[0136] Aspect 4 is an apparatus according to any one of Aspects 1 to 3, wherein the at least one processor is further configured, individually or in any combination, to convert information associated with the loader program counter (PC) of the set of workloads into a set of load identifiers (IDs) for the set of workloads.
[0137] Aspect 5 is the apparatus according to aspect 4, wherein the at least one processor is further configured, individually or in any combination, to store the set of load IDs of the set of workloads in an alias table based on the transformation of the information.
[0138] Aspect 6 is an apparatus according to any one of aspects 1 to 5, wherein the set of data threads is associated with a thread bundle, wherein the cache is a Level 1 (L1) data cache, and wherein the L1 data cache includes at least one of the following: a protection bit, a set of thread bundle identifiers (IDs), an access count, or a set of tag addresses.
[0139] Aspect 7 is the apparatus according to aspect 6, wherein the set of tag addresses includes information associated with at least one of the following: inter-thread locality, intra-thread locality, or inter-thread-intra-thread locality, wherein inter-thread locality is associated with the following: data retrieved by a load instruction of a first thread bundle is also accessed by the same loader counter (PC) across multiple thread bundles; wherein intra-thread locality is associated with the following: data retrieved by a load instruction of a second thread bundle is used within the same thread bundle of the data; and wherein inter-thread-intra-thread locality is associated with the following: data is introduced into the cache by a third thread bundle and is rereferenced by other thread bundles.
[0140] Aspect 8 is an apparatus according to any one of aspects 1 to 7, wherein the at least one processor is further configured individually or in any combination to: configure locality information associated with the locality of each workload in the first set of workloads, the first set of workloads corresponding to the first set of data threads in the set of data threads.
[0141] Aspect 9 is the apparatus according to aspect 8, wherein, in order to configure the locality information associated with the locality of each workload in the first group of workloads, the at least one processor is configured individually or in any combination to configure the locality information associated with the locality of each workload in the first group of workloads based on a set of tag addresses of the access patterns for the at least one cache line.
[0142] Aspect 10 is the apparatus according to aspect 9, wherein the set of tag addresses for the access pattern of the at least one cache line is associated with a thread bundle tag directory.
[0143] Aspect 11 is an apparatus according to any one of aspects 8 to 10, wherein the locality information includes at least one of cache line management information, valid bits, last loaded identifier (ID), or access count.
[0144] Aspect 12 is an apparatus according to any one of aspects 8 to 11, wherein the at least one processor is further configured individually or in any combination to: determine whether to configure or store the locality information associated with the locality of each of the at least one second set of workloads.
[0145] Aspect 13 is an apparatus according to any one of aspects 8 to 12, wherein the locality information is associated with a locality information table, and wherein, in order to store the access patterns of the at least one cache line, the at least one processor is configured individually or in any combination to store the access patterns of the at least one cache line in the locality information table.
[0146] Aspect 14 is an apparatus according to any one of aspects 1 to 13, wherein the at least one processor is further configured, individually or in any combination, to output the storage or the avoidance of storage of the data of the at least one second set of workloads.
[0147] Aspect 15 is the apparatus according to aspect 14, wherein, in order to output the indication for storage or avoidance of storage of the data of the at least one second set of workloads, the at least one processor is configured individually or in any combination to: send the indication for storage or avoidance of storage of the data of the at least one second set of workloads; or store the indication for storage or avoidance of storage of the data of the at least one second set of workloads.
[0148] Aspect 16 is an apparatus according to any one of aspects 1 to 15, wherein the access pattern of the at least one cache line corresponds to the reusability of the at least one cache line, or wherein the access pattern of the at least one cache line corresponds to whether the reusability level of the workload is greater than or less than a reusability threshold.
[0149] Aspect 17 is an apparatus according to any one of aspects 1 to 16, wherein the set of workloads is a set of graphics workloads at a graphics processing unit (GPU), and wherein, in order to obtain the instruction for the set of data threads, the at least one processor is configured individually or in any combination to receive the instruction for the set of data threads from at least one component at the GPU or a kernel at the GPU.
[0150] Aspect 18 is an apparatus according to any one of aspects 1 to 17, wherein, in order to store the data of the at least one second set of workloads, the at least one processor is configured individually or in any combination to store the data of the at least one second set of workloads in memory at a graphics processing unit (GPU) or in the cache.
[0151] Aspect 19 is an apparatus according to any one of aspects 1 to 18, wherein, in order to avoid storing the data of the at least one second set of workloads, the at least one processor is configured individually or in any combination to bypass the storage of the data of the at least one second set of workloads at the graphics processing unit (GPU).
[0152] Aspect 20 is an apparatus according to any one of aspects 1 to 19, the apparatus further comprising (i.e., including) at least one of an antenna or a transceiver coupled to the at least one processor, wherein in order to obtain the indication of the set of data threads, the at least one processor is configured to obtain the indication of the set of data threads via the antenna or the transceiver.
[0153] Aspect 21 is a method for implementing any one of aspects 1 to 20.
[0154] Aspect 22 is an apparatus for data processing, the apparatus including components for implementing any one of aspects 1 to 20.
[0155] Aspect 23 is a computer-readable medium (e.g., a non-transitory computer-readable medium) storing computer-executable code (e.g., code for data processing) that, when executed by at least one processor, causes the at least one processor to implement any one of aspects 1 to 20.
Claims
1. An apparatus for data processing, the apparatus comprising: At least one memory; and At least one processor, coupled to the at least one memory, and configured individually or in any combination, based at least in part on information stored in the at least one memory, to: Obtain an indication of a set of data threads associated with graphics processing, wherein the set of data threads corresponds to a set of workloads; Identify the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with the access pattern of at least one cache line of the cache for each workload in the set of workloads. The access patterns of at least one cache line of the first set of workloads in the set of workloads are stored based on the locality of each workload in the set of workloads, wherein the first set of workloads corresponds to the first set of data threads in the set of data threads; as well as The access pattern of at least one cache line of the first set of workloads is used to store or avoid storing data of at least one second set of workloads in the first set of workloads.
2. The apparatus of claim 1, wherein the locality of each workload in the set of workloads is at least one of spatial locality or temporal locality.
3. The apparatus of claim 2, wherein the spatial locality is associated with the reusability of the addresses of the cache, and wherein the temporal locality is associated with the reusability of the time of accessing the cache.
4. The apparatus of claim 1, wherein the at least one processor is further configured, alone or in any combination, to: The information associated with the loader program counter (PC) of the set of workloads is converted into a set of load identifiers (IDs) for the set of workloads.
5. The apparatus of claim 4, wherein the at least one processor is further configured, alone or in any combination, to: Based on the transformation of the information, the set of load IDs of the set of workloads are stored in an alias table.
6. The apparatus of claim 1, wherein the set of data threads is associated with a thread bundle, wherein the cache is a Level 1 (L1) data cache, and wherein the L1 data cache includes at least one of the following: a protection bit, a set of thread bundle identifiers (IDs), an access count, or a set of tag addresses.
7. The apparatus of claim 6, wherein the set of tag addresses includes information associated with at least one of the following: inter-thread bundle locality, intra-thread bundle locality, or inter-thread bundle-intra-thread bundle locality, wherein inter-thread bundle locality is associated with the following: data retrieved by a load instruction of a first thread bundle is also accessed by the same loader counter (PC) across multiple thread bundles; wherein intra-thread bundle locality is associated with the following: data retrieved by a load instruction of a second thread bundle is used within the same thread bundle of the data; and wherein inter-thread bundle-intra-thread bundle locality is associated with the following: data is introduced into the cache by a third thread bundle and is re-referenced by other thread bundles.
8. The apparatus of claim 1, wherein the at least one processor is further configured, alone or in any combination, to: Configure locality information associated with the locality of each workload in the first group of workloads, which corresponds to the first group of data threads in the set of data threads.
9. The apparatus of claim 8, wherein, in order to configure the locality information associated with the locality of each workload in the first group of workloads, the at least one processor is configured individually or in any combination to configure the locality information associated with the locality of each workload in the first group of workloads based on a set of tag addresses of the access patterns for the at least one cache line.
10. The apparatus of claim 9, wherein the set of tag addresses for the access pattern of the at least one cache line is associated with a thread bundle tag directory.
11. The apparatus of claim 8, wherein the locality information includes at least one of cache line management information, valid bits, last loaded identifier (ID), or access count.
12. The apparatus of claim 8, wherein the at least one processor is further configured, alone or in any combination, to: Determine whether to configure or store the locality information associated with the locality of each workload in the at least one second group of workloads.
13. The apparatus of claim 8, wherein the locality information is associated with a locality information table, and wherein, in order to store the access patterns of the at least one cache line, the at least one processor is configured individually or in any combination to store the access patterns of the at least one cache line in the locality information table.
14. The apparatus of claim 1, wherein the at least one processor is further configured, alone or in any combination, to: Output an indication of storage or avoidance of storage for the data of the at least one second set of workloads.
15. The apparatus of claim 14, wherein, in order to output the indication of storage or avoidance of storage for the data of the at least one second set of workloads, the at least one processor is configured individually or in any combination to: Send the instruction to store or avoid storage of the data for the at least one second set of workloads; or The storage of data for the at least one second set of workloads, or the instruction to avoid storage.
16. The apparatus of claim 1, wherein the access pattern of the at least one cache line corresponds to the reusability of the at least one cache line, or wherein the access pattern of the at least one cache line corresponds to whether the reusability level of the workload is greater than or less than a reusability threshold.
17. The apparatus of claim 1, wherein the set of workloads is a set of graphics workloads at a graphics processing unit (GPU), and wherein, in order to obtain the instruction for the set of data threads, the at least one processor is configured individually or in any combination to receive the instruction for the set of data threads from at least one component at the GPU or a kernel at the GPU.
18. The apparatus of claim 1, wherein, for storing the data for the at least one second set of workloads, the at least one processor is configured individually or in any combination to: store the data for the at least one second set of workloads in memory at a graphics processing unit (GPU) or in the cache; and Wherein, in order to avoid storing the data of the at least one second set of workloads, the at least one processor is configured individually or in any combination to bypass the storage of the data of the at least one second set of workloads at the GPU.
19. A method for data processing, the method comprising: Obtain an indication of a set of data threads associated with graphics processing, wherein the set of data threads corresponds to a set of workloads; Identify the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with the access pattern of at least one cache line of the cache for each workload in the set of workloads. The access patterns of at least one cache line of the first set of workloads in the set of workloads are stored based on the locality of each workload in the set of workloads, wherein the first set of workloads corresponds to the first set of data threads in the set of data threads; as well as The access pattern of at least one cache line of the first set of workloads is used to store or avoid storing data of at least one second set of workloads in the first set of workloads.
20. A computer-readable medium storing computer-executable code for data processing, said code, when executed by at least one processor, causing said at least one processor to: Obtain an indication of a set of data threads associated with graphics processing, wherein the set of data threads corresponds to a set of workloads; Identify the locality of each workload in the set of workloads corresponding to the set of data threads, wherein the locality of each workload in the set of workloads is associated with the access pattern of at least one cache line of the cache for each workload in the set of workloads. The access patterns of at least one cache line of the first set of workloads in the set of workloads are stored based on the locality of each workload in the set of workloads, wherein the first set of workloads corresponds to the first set of data threads in the set of data threads; as well as The access pattern of at least one cache line of the first set of workloads is used to store or avoid storing data of at least one second set of workloads in the first set of workloads.