Persistent graphics memory for rendering targets
By selecting and managing rendering targets within the graphics processing unit and optimizing memory bandwidth using persistent graphics memory, the problem of low storage efficiency in existing technologies is solved, achieving more efficient memory utilization and reduced power consumption.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- QUALCOMM INC
- Filing Date
- 2024-10-17
- Publication Date
- 2026-06-16
AI Technical Summary
Existing graphics processing node storage technologies cannot efficiently store acceleration structures, leading to increased memory bandwidth and performance costs during deferred rendering.
By selecting and managing rendering targets within the graphics processing unit, persistent graphics memory can be used to optimize memory bandwidth, reduce unnecessary read and write operations, and allocate buffer or cache resources appropriately.
It effectively reduces memory bandwidth consumption during deferred rendering, lowers power consumption and processing unit load, and improves memory utilization efficiency.
Smart Images

Figure CN122228516A_ABST
Abstract
Description
Cross-references to related applications
[0001] This application claims the benefit of U.S. non-provisional patent application No. 18 / 520,517, filed November 27, 2023, entitled “PERSISTENT GRAPHICS MEMORY FORRENDER TARGETS”, the entire contents of which are expressly incorporated herein by reference. Technical Field
[0002] This disclosure generally relates to processing systems, and more specifically, to one or more techniques for graphics processing. Background Technology
[0003] Computing devices typically perform graphics and / or display processing (e.g., utilizing a graphics processing unit (GPU), a central processing unit (CPU), a display processor, etc.) to render and display visual content. Such computing devices can include, for example, computer workstations, mobile phones (such as smartphones), embedded systems, personal computers, tablet computers, and video game consoles. A GPU is configured to execute a graphics processing pipeline that includes one or more processing stages that operate together to execute graphics processing commands and output frames. A CPU controls the operation of a GPU by issuing one or more graphics processing commands to it. Modern CPUs are typically capable of executing multiple applications concurrently, each of which may require the GPU during execution. A display processor is configured to convert digital information received from the CPU into analog values and can issue commands to a display panel to display visual content. Devices that provide content for visual presentation on a display may utilize a GPU and / or a display processor.
[0004] Currently, there is a need to improve graphics processing. For example, current node storage techniques in graphics processing may not be efficient at storing acceleration structures. Therefore, there is an increasing demand for improved node storage techniques to efficiently store acceleration structures. Summary of the Invention
[0005] The following is a simplified summary of one or more aspects to provide a basic understanding of these aspects. This summary is not a broad overview of all anticipated aspects, nor is it intended to identify key or essential elements of all aspects, nor to describe the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that follows.
[0006] In one aspect of this disclosure, a method, computer-readable medium, and apparatus are provided. The apparatus may be a graphics processing unit (GPU), a GPU, or any apparatus capable of performing graphics processing. The apparatus may obtain indications of a plurality of rendering targets (RTs) associated with a rendering process, wherein each of the plurality of RTs is associated with a subset of graphics surfaces used for the rendering process. The apparatus may also select at least one RT from the plurality of RTs based on the subset of graphics surfaces associated with the at least one RT. The apparatus may also determine whether space exists in a buffer or cache for storing the at least one RT. Additionally, the apparatus may allocate the subset of graphics surfaces associated with the at least one RT to the buffer or cache based on the existence of space in the buffer or cache for storing the at least one RT. The apparatus may also remove a portion of the buffer or cache based on the absence of space in the buffer or cache for storing the at least one RT, in order to allocate the subset of graphics surfaces associated with the at least one RT; or determine whether an updated command buffer or an updated command list exists for the subset of graphics surfaces associated with the at least one RT. The device may also store at least one selected RT in a buffer or cache, or avoid storing at least one selected RT in the buffer or cache. Furthermore, the device may write one or more remaining RTs from the plurality of RTs to at least one memory, wherein the one or more remaining RTs do not include the at least one selected RT. The device may also output an indication of the at least one selected RT from the plurality of RTs.
[0007] Details of one or more examples of this disclosure are set forth in the accompanying drawings and the following description. Other features, objects, and advantages of this disclosure will become apparent from the description, the drawings, and the claims. Attached Figure Description
[0008] Figure 1 This is a block diagram illustrating the example content generation system.
[0009] Figure 2 This is an example graphics processing unit (GPU).
[0010] Figure 3 This is a diagram illustrating an example image or surface used for graphics processing.
[0011] Figure 4 This is a diagram illustrating the system memory and graphics memory.
[0012] Figure 5A This is an example illustration of the rendering method.
[0013] Figure 5B This is an example illustration of the rendering method.
[0014] Figure 6 This is a diagram illustrating an example flowchart used for rendering target selection algorithms.
[0015] Figure 7A This is an example illustration of the rendering method.
[0016] Figure 7B This is an example illustration of the rendering method.
[0017] Figure 8A This is an example illustration of the rendering method.
[0018] Figure 8B This is an example illustration of the rendering method.
[0019] Figure 9A This is a diagram illustrating the surface assignment method.
[0020] Figure 9B This is a diagram illustrating the surface assignment method.
[0021] Figure 10 This is a flowchart illustrating example communication between the GPU, CPU, and memory.
[0022] Figure 11 This is a flowchart of an example method for graphics processing.
[0023] Figure 12 This is a flowchart of an example method for graphics processing. Detailed Implementation
[0024] Compared to forward rendering, deferred rendering reduces the number of original computation cycles by shading a smaller number of fragments. For example, in 3D graphics, deferred rendering is a popular choice due to its advantage in saving raw computational power compared to forward rendering. However, deferred rendering comes at the cost of multiple render targets (RTs) and high memory bandwidth. For instance, deferred rendering consumes high memory bandwidth because render targets (RTs) may require multiple passes for certain types of graphics computations (e.g., calculations for lighting and special effects) to produce the final image. That is, deferred rendering can utilize render targets that may be associated with an increased number of certain computations (e.g., calculations for lighting and special effects). Therefore, deferred rendering may initially render to render targets and save them to memory, but may later need to fetch render targets from memory. Thus, deferred rendering may later utilize an increased number of delayed read and write-back instructions in the pipeline. In fact, deferred rendering may not initially render instructions, but may save render targets to memory and then retrieve them at a later time. Therefore, deferred rendering can save on-the-fly bandwidth and performance, but it can result in delayed bandwidth and performance costs. For example, any savings in shader instructions due to deferred rendering may come at the cost of higher bandwidth, thus increasing power consumption in the memory subsystem (e.g., cache, channels, and DDR). Additionally, some rendering operations (RTs) are more dominant than others in terms of the number of frames they are used to render. Therefore, some RTs contribute more memory bandwidth compared to others. Based on the above, it may be beneficial to identify RTs that contribute to an increase in memory bandwidth (e.g., system memory bandwidth). Aspects of this disclosure can identify rendering targets that contribute to an increase in memory bandwidth (e.g., system memory bandwidth).
[0025] The aspects of this disclosure may include several benefits or advantages. For example, aspects of this disclosure may reduce the amount of memory bandwidth used by identifying rendering targets that contribute to increased memory bandwidth. To this end, aspects of this disclosure may identify rendering targets that contribute to an increased amount of memory bandwidth compared to other rendering targets. Furthermore, the aspects presented herein may reduce the number of read and write operations in deferred rendering due to certain rendering targets. That is, the aspects presented herein may identify rendering targets that contribute to an increased number of read and write operations in deferred rendering. The aspects presented herein may also utilize some type of memory (e.g., persistent graphics memory) to identify rendering targets that contribute to an increased amount of memory bandwidth. Furthermore, the aspects presented herein may utilize some type of memory (e.g., persistent graphics memory) to identify rendering targets that contribute to an increased number of read and write operations.
[0026] Various aspects of the systems, apparatuses, computer program products, and methods will be described more fully below with reference to the accompanying drawings. However, this disclosure may be embodied in many different forms and should not be construed as limited to any particular structure or function presented throughout this disclosure. Rather, these aspects are provided to make this disclosure comprehensive and complete, and to fully convey the scope of this disclosure to those skilled in the art. Based on the teachings herein, those skilled in the art will understand that the scope of this disclosure is intended to cover any aspect of the systems, apparatuses, computer program products, and methods disclosed herein, whether implemented independently of or in combination with other aspects of this disclosure. For example, any number of aspects set forth herein may be used to implement an apparatus or practice. Furthermore, the scope of this disclosure is intended to cover such apparatuses or methods implemented using structures, functionalities, or structures and functionalities other than or different from the various aspects of the disclosure set forth herein. Any aspect disclosed herein may be embodied by one or more elements of the claims.
[0027] Although various aspects are described herein, many variations and substitutions of these aspects fall within the scope of this disclosure. While some potential benefits and advantages of the aspects of this disclosure are mentioned, the scope of this disclosure is not intended to be limited to a particular benefit, use, or objective. Rather, the aspects of this disclosure are intended to be broadly applicable to different wireless technologies, system configurations, networks, and transmission protocols, some of which are illustrated by way of example in the accompanying drawings and the description below. The detailed description and drawings are merely illustrative and not limiting of this disclosure, and the scope of this disclosure is defined by the appended claims and their equivalents.
[0028] Several aspects are presented with reference to various apparatuses and methods. These apparatuses and methods are described in detail and illustrated in the accompanying drawings by various blocks, components, circuits, processes, algorithms, etc. (collectively referred to as "elements"). These elements can be implemented using electronic hardware, computer software, or any combination thereof. Whether such elements are implemented as hardware or software depends on the specific application and the design constraints imposed on the overall system.
[0029] For example, an element, any part of an element, or any combination of elements can be implemented as a “processing system” including one or more processors (which may also be referred to as processing units). Examples of processors include microprocessors, microcontrollers, graphics processing units (GPUs), general-purpose GPUs (GPGPUs), central processing units (CPUs), application processors, digital signal processors (DSPs), reduced instruction set computing (RISC) processors, system-on-a-chip (SoCs), baseband processors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic units, discrete hardware circuits, and other suitable hardware configured to perform the various functionalities described in this disclosure. One or more processors in a processing system can execute software. Software can be broadly interpreted as instructions, instruction sets, code, code segments, program code, programs, subroutines, software components, applications, software applications, software packages, routines, subroutines, objects, executable files, threads of execution, procedures, functions, etc., whether expressed in terms of software, firmware, middleware, microcode, hardware description languages, or other terms. The term “application” can refer to software. As described herein, one or more technologies can refer to an application, i.e., software, configured to perform one or more functions. In such examples, the application may be stored on memory (e.g., on-chip memory of a processor, system memory, or any other memory). Hardware described herein, such as a processor, may be configured to execute the application. For example, an application may be described as including code that, when executed by the hardware, causes the hardware to perform one or more technologies described herein. As an example, the hardware may access and execute code accessed from memory to perform one or more technologies described herein. In some examples, components are identified in this disclosure. In such examples, a component may be hardware, software, or a combination thereof. Each component may be a separate component or a subcomponent of a single component.
[0030] Therefore, in one or more examples described herein, the described functionality can be implemented in hardware, software, or any combination thereof. If implemented in software, the functionality can be stored or encoded as one or more instructions or code on a computer-readable medium. Computer-readable media includes computer storage media. Storage media can be any available medium accessible to a computer. By way of example and not limitation, such computer-readable media may include random access memory (RAM), read-only memory (ROM), electrically erasable programmable ROM (EEPROM), optical disc storage devices, magnetic disk storage devices, other magnetic storage devices, combinations of computer-readable media of the types described above, or any other medium capable of being used to store computer-executable code in the form of instructions or data structures accessible to a computer.
[0031] In summary, this disclosure describes techniques for having a graphics processing pipeline in a single device or multiple devices, thereby improving the rendering of graphics content and / or reducing the load on processing units (i.e., any processing unit, such as a GPU, configured to perform one or more of the techniques described herein). For example, this disclosure describes techniques for performing graphics processing in any device that utilizes graphics processing. Other example benefits are described throughout this disclosure.
[0032] As used herein, instances of the term "content" can refer to "graphic content," "image," or vice versa. This is true regardless of whether these terms are used as adjectives, nouns, or other parts of speech. In some examples, as used herein, the term "graphic content" can refer to content produced by one or more processes in a graphics processing pipeline. In some examples, as used herein, the term "graphic content" can refer to content produced by a processing unit configured to perform graphics processing. In some examples, as used herein, the term "graphic content" can refer to content produced by a graphics processing unit.
[0033] In some examples, as used herein, the term "display content" can refer to content generated by a processing unit configured to perform display processing. Graphical content can be processed to become display content. For example, a graphics processing unit can output graphical content (such as frames) to a buffer (which may be referred to as a frame buffer). A display processing unit can read graphical content (such as one or more frames) from the buffer and perform one or more display processing techniques on that display processing unit to generate display content. For example, a display processing unit can be configured to perform compositing on one or more rendering layers to generate frames. As another example, a display processing unit can be configured to composite, blend, or otherwise combine two or more layers into a single frame. A display processing unit can be configured to perform scaling on frames, such as zooming in or out. In some examples, a frame can refer to a layer. In other examples, a frame can refer to two or more layers that have been blended together to form the frame, i.e., the frame comprises two or more layers, and the frame comprising two or more layers can be subsequently blended.
[0034] Figure 1This is a block diagram illustrating an example content generation system 100 configured to implement one or more technologies of this disclosure. The content generation system 100 includes a device 104. Device 104 may include one or more components or circuitry for performing the various functions described herein. In some examples, one or more components of device 104 may be components of a System-on-a-Chip (SOC). Device 104 may include one or more components configured to perform one or more technologies of this disclosure. In the illustrated example, device 104 may include a processing unit 120, a content encoder / decoder 122, and a system memory 124. In some aspects, device 104 may include multiple components, such as a communication interface 126, a transceiver 132, a receiver 128, a transmitter 130, a display processor 127, and one or more displays 131. Reference to display 131 may refer to one or more displays 131. For example, display 131 may include a single display or multiple displays. Display 131 may include a first display and a second display. The first display may be a left-eye display, and the second display may be a right-eye display. In some examples, the first and second displays may receive different frames for presentation on the first and second displays. In other examples, the first and second displays may receive the same frames used for rendering on both displays. In further examples, the results of graphics processing may not be displayed on the devices; for example, the first and second displays may not receive any frames used for rendering on them. Instead, the frames or graphics processing results may be transferred to another device. In some respects, this is referred to as split rendering.
[0035] Processing unit 120 may include internal memory 121. Processing unit 120 may be configured to perform graphics processing, such as in a graphics processing pipeline 107. Content encoder / decoder 122 may include internal memory 123. In some examples, device 104 may include a display processor (such as display processor 127) to perform one or more display processing techniques on one or more frames generated by processing unit 120 prior to presentation by one or more displays 131. Display processor 127 may be configured to perform display processing. For example, display processor 127 may be configured to perform one or more display processing techniques on one or more frames generated by processing unit 120. One or more displays 131 may be configured to display or otherwise present the frames processed by display processor 127. In some examples, the one or more displays 131 may include one or more of the following: liquid crystal display (LCD), plasma display, organic light-emitting diode (OLED) display, projection display device, augmented reality display device, virtual reality display device, head-mounted display, or any other type of display device.
[0036] Memory (such as system memory 124) external to processing unit 120 and content encoder / decoder 122 may be accessible to processing unit 120 and content encoder / decoder 122. For example, processing unit 120 and content encoder / decoder 122 may be configured to read from and / or write to external memory (such as system memory 124). Processing unit 120 and content encoder / decoder 122 may be communicatively coupled to system memory 124 via a bus. In some examples, processing unit 120 and content encoder / decoder 122 may be communicatively coupled to each other via the bus or a different connection.
[0037] Content encoder / decoder 122 can be configured to receive graphic content from any source, such as system memory 124 and / or communication interface 126. System memory 124 can be configured to store received encoded or decoded graphic content. Content encoder / decoder 122 can be configured to receive encoded or decoded graphic content from system memory 124 and / or communication interface 126, for example, in the form of encoded pixel data. Content encoder / decoder 122 can be configured to encode or decode any graphic content.
[0038] Internal memory 121 or system memory 124 may include one or more volatile or non-volatile memories or storage devices. In some examples, internal memory 121 or system memory 124 may include RAM, SRAM, DRAM, erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, magnetic data media or optical storage media or any other type of memory.
[0039] According to some examples, internal memory 121 or system memory 124 may be a non-transitory storage medium. The term "non-transitory" may indicate that the storage medium is not embodied in a carrier wave or propagating signal. However, the term "non-transitory" should not be construed as meaning that internal memory 121 or system memory 124 is immovable or that its contents are static. For example, system memory 124 may be removed from device 104 and moved to another device. Alternatively, system memory 124 may not be removable from device 104.
[0040] Processing unit 120 may be a central processing unit (CPU), a graphics processing unit (GPU), a general-purpose GPU (GPGPU), or any other processing unit configured to perform graphics processing. In some examples, processing unit 120 may be integrated into the motherboard of device 104. In some examples, processing unit 120 may reside on a graphics card mounted in a port on the motherboard of device 104, or may otherwise be incorporated into a peripheral device configured to interoperate with device 104. Processing unit 120 may include one or more processors, such as one or more microprocessors, GPUs, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), arithmetic logic units (ALUs), digital signal processors (DSPs), discrete logic components, software, hardware, firmware, other equivalent integrated or discrete logic circuits, or any combination thereof. If the technology is partially implemented in software, processing unit 120 may store instructions for software in a suitable non-transitory computer-readable storage medium (e.g., internal memory 121) and may use one or more processors to execute instructions in hardware to perform the technology of this disclosure. Any of the above (including hardware, software, and combinations of hardware and software, etc.) can be considered as one or more processors.
[0041] The content encoder / decoder 122 can be any processing unit configured to perform content decoding. In some examples, the content encoder / decoder 122 may be integrated into the motherboard of device 104. The content encoder / decoder 122 may include one or more processors, such as one or more microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), arithmetic logic units (ALUs), digital signal processors (DSPs), video processors, discrete logic components, software, hardware, firmware, other equivalent integrated or discrete logic circuits, or any combination thereof. If the technology is partially implemented in software, the content encoder / decoder 122 may store instructions for software in a suitable non-transitory computer-readable storage medium (e.g., internal memory 123) and may use one or more processors to execute instructions in hardware to perform the technology of this disclosure. Any of the foregoing (including hardware, software, combinations of hardware and software, etc.) can be considered as one or more processors.
[0042] In some aspects, the content generation system 100 may include a communication interface 126. The communication interface 126 may include a receiver 128 and a transmitter 130. The receiver 128 may be configured to perform any of the receiving functions described herein with respect to device 104. Additionally, the receiver 128 may be configured to receive information from another device, such as eye or head positioning information, rendering commands, or location information. The transmitter 130 may be configured to perform any of the transmitting functions described herein with respect to device 104. For example, the transmitter 130 may be configured to transmit information to another device, which may include a request for content. The receiver 128 and the transmitter 130 may be combined to form a transceiver 132. In such an example, the transceiver 132 may be configured to perform any of the receiving and / or transmitting functions described herein with respect to device 104.
[0043] Refer again Figure 1 In some aspects, processing unit 120 may include a rendering target component 198 configured to obtain indications of a plurality of rendering targets (RTs) associated with a rendering process, wherein each of the plurality of RTs is associated with a subset of graphics surfaces used for the rendering process. The rendering target component 198 may also be configured to select at least one RT from the plurality of RTs based on a subset of graphics surfaces associated with at least one RT. The rendering target component 198 may also be configured to determine whether space exists in a buffer or cache for storing at least one RT. The rendering target component 198 may also be configured to allocate a subset of graphics surfaces associated with at least one RT to a buffer or cache based on the existence of space in a buffer or cache for storing at least one RT. The rendering target component 198 may also be configured to remove a portion of a buffer or cache based on the absence of space in a buffer or cache for storing at least one RT, in order to allocate a subset of graphics surfaces associated with at least one RT; or to determine whether an updated command buffer or an updated command list exists for the subset of graphics surfaces associated with at least one RT. The rendering target component 198 can also be configured to store at least one selected RT in a buffer or cache, or to avoid storing at least one selected RT in a buffer or cache. The rendering target component 198 can also be configured to write one or more remaining RTs from a plurality of RTs to at least one memory, wherein the one or more remaining RTs do not include the at least one selected RT. The rendering target component 198 can also be configured to output an indication of the at least one selected RT from a plurality of RTs. Although the following description may focus on graphics processing, the concepts described herein are applicable to other similar processing techniques.
[0044] As described herein, a device such as device 104 can refer to any device, apparatus, or system configured to perform one or more of the technologies described herein. For example, a device can be a server, base station, user equipment, client device, station, access point, computer (e.g., personal computer, desktop computer, laptop computer, tablet computer, computer workstation, or mainframe computer), end product, apparatus, telephone, smartphone, server, video game platform or console, handheld device (e.g., portable video game device or personal digital assistant (PDA)), wearable computing device (e.g., smartwatch, augmented reality device, or virtual reality device), non-wearable device, display or display device, television, set-top box, intermediate network device, digital media player, video streaming device, content streaming device, in-vehicle computer, any mobile device, any device configured to generate graphical content, or any device configured to perform one or more of the technologies described herein. The processes described herein may be described as being performed by a specific component (e.g., GPU), but in further embodiments, other components consistent with the disclosed embodiments (e.g., CPU) may be used to perform them.
[0045] A GPU can process various types of data or data packets within its pipeline. For example, in some aspects, a GPU can process two types of data or data packets, such as context register packets and draw call data. Context register packets can be a collection of global state information, such as information about global registers, shaders, or constant data, which can regulate how the graphics context will be handled. For example, a context register packet may include information about the color format. In some aspects of a context register packet, there may be bits indicating which workload belongs to the context register. Furthermore, there may be multiple functions or programs running simultaneously and / or in parallel. For example, a function or program may describe an operation, such as a color mode or color format. Therefore, context registers can define multiple states of the GPU.
[0046] Context states can be used to determine how individual processing units (e.g., vertex extractors (VFDs), vertex shaders (VSs), shader processors, or geometry processors) operate and / or in which mode a processing unit operates. For this purpose, the GPU can use context registers and programming data. In some aspects, the GPU can generate workloads (e.g., vertex or pixel workloads) in the pipeline based on the context register definitions of modes or states. Certain processing units (e.g., VFDs) can use these states to determine certain functions, such as how to assemble vertices. Because these modes or states can change, the GPU may need to modify the corresponding context. Additionally, the workload corresponding to a mode or state may follow the changed mode or state.
[0047] Figure 2 Example GPU 200 is illustrated according to one or more technologies according to this disclosure. For example... Figure 2 As shown, GPU 200 includes a command processor (CP) 210, a draw call group 212, a VFD 220, a VS 222, a vertex cache (VPC) 224, a triangle setup engine (TSE) 226, a rasterizer (RAS) 228, a Z-process engine (ZPE) 230, a pixel interpolator (PI) 232, a fragment shader (FS) 234, a rendering backend (RB) 236, a level 2 (L2) cache (UCHE) 238, and system memory 240. Although Figure 2 The GPU 200 shown includes processing units 220 to 238, but the GPU 200 may include multiple additional processing units. Additionally, processing units 220 to 238 are merely examples, and any combination or order of processing units may be used in the GPU according to this disclosure. The GPU 200 also includes a command buffer 250, a context register group 260, and a context state 261.
[0048] like Figure 2 As shown, the GPU can use a CP (e.g., CP 210) or a hardware accelerator to resolve the command buffer into context register groups (e.g., context register group 260) and / or draw call data groups (e.g., draw call group 212). Subsequently, CP 210 can transfer the context register group 260 or the draw call group 212 to a processing unit or block in the GPU via a separate path. Furthermore, the command buffer 250 can alternate between different states of the context registers and draw calls. For example, the command buffer can be structured as follows: context register of context N, draw call of context N, context register of context N+1, and draw call of context N+1.
[0049] GPUs can render images in several different ways. In some cases, GPUs can use rendering and / or tiled rendering to render images. In a tiled rendering GPU, an image can be divided or segmented into different sections or tiles. After the image is divided, each section or tile can be rendered individually. A tiled rendering GPU can divide a computer graphics image into a grid format so that each part of the grid (i.e., a tile) is rendered individually. In some aspects, during a binning pass, the image can be divided into different bins or tiles. In some aspects, during a binning pass, a visibility stream can be constructed, where visible primitives or draw calls can be identified. In contrast to tiled rendering, direct rendering does not divide the frame into smaller bins or tiles. Instead, in direct rendering, the entire frame is rendered at once. Additionally, some types of GPUs allow both tiled rendering and direct rendering.
[0050] In some aspects of tile rendering, multiple processing stages or passes may exist. For example, rendering may be performed in two passes, such as a visibility or box visibility pass and a rendering or box rendering pass. During a visibility pass, the GPU may input a rendering workload, record the positioning of primitives or triangles, and then determine which primitives or triangles fall into which box or region. In some aspects of a visibility pass, the GPU may also identify or mark the visibility of each primitive or triangle in the visibility stream. During a rendering pass, the GPU may input a visibility stream and process one box or region at a time. In some aspects, the visibility stream may be analyzed to determine which primitives or primitive vertices are visible or invisible. Thus, visible primitives or primitive vertices can be processed. In this way, the GPU can reduce the unnecessary workload of processing or rendering invisible primitives or triangles. In some aspects, certain types of primitive geometry, such as geometries that are only positioned, may be processed during a visibility pass. Additionally, primitives may be classified into different boxes or regions based on their positioning or location. In some cases, classifying primitives or triangles into different bins can be performed by determining visibility information for those primitives or triangles. For example, the GPU can determine the visibility information for each primitive in each bin or region or write it to, for example, system memory. This visibility information can be used to determine or generate a visibility stream. In a rendering pass, the primitives in each bin can be rendered individually. In these instances, the visibility stream can be retrieved from memory used to discard primitives that are not visible to that bin.
[0051] Some aspects of the GPU or GPU architecture offer multiple different options for rendering (e.g., software rendering and hardware rendering). In software rendering, the driver or CPU can process each view... Figure 1 The entire frame geometry is copied each time. Additionally, some different states can change depending on the viewpoint. Therefore, in software rendering, the software can copy the entire workload by changing some states that can be used for rendering for each viewpoint in the image. In some respects, this can lead to increased overhead because the GPU may submit the same workload multiple times for each viewpoint in the image. In hardware rendering, the hardware or GPU may be responsible for copying or processing the geometry for each viewpoint in the image. Therefore, the hardware can manage the copying or processing of primitives or triangles for each viewpoint in the image.
[0052] Figure 3 Example image or surface 300, which includes multiple primitives divided into multiple bins. For example... Figure 3As shown, the image or surface 300 includes a region 302, which includes primitives 321, 322, 323, and 324. Primitives 321, 322, 323, and 324 are divided or placed into different bins, such as bins 310, 311, 312, 313, 314, and 315. Figure 3 An example of tile rendering using multiple viewpoints is shown for primitives 321 to 324. For example, primitives 321 to 324 are in a first viewpoint 350 and a second viewpoint 351. Therefore, GPU processing or rendering of an image or surface 300 including region 302 can utilize multi-view or multi-view rendering.
[0053] As indicated herein, GPUs or graphics processing units can use a tiled rendering architecture to reduce power consumption or save memory bandwidth. As further stated above, this rendering method divides the scene into multiple bins and includes visibility iterations that identify the visible triangles within each bin. Therefore, in tiled rendering, the entire screen can be divided into multiple bins or tiles. The scene can then be rendered multiple times, for example, once or multiple times for each bin. In various aspects of graphics rendering, some graphics applications may render a single target (i.e., the rendering target) once or multiple times. For example, in graphics rendering, the frame buffer on system memory can be updated multiple times. The frame buffer can be part of memory or random access memory (RAM) (e.g., containing bitmaps or storage devices) to facilitate storing display data for the GPU. The frame buffer can also be a memory buffer containing complete data frames. Additionally, the frame buffer can be a logical buffer. In some aspects, updating the frame buffer can be performed in bin or tile rendering, where, as discussed above, the surface is divided into multiple bins or tiles, and each bin or tile can then be rendered individually. In addition, in tile rendering, the frame buffer can be divided into multiple bins or tiles.
[0054] Additionally, graphics applications may construct or include multiple buffers, such as a depth buffer and / or a color buffer with diffuse colors. Furthermore, graphics applications may construct or include shadow maps at the depth or color buffers, for example, for light. For instance, an application may run a renderer on one buffer, for example, for diffuse colors, and then move to another buffer, for example, to create shadows for different lights. Graphics applications may also combine additional information with previously saved information at the buffers, such as specular colors and / or shadows from a previous color buffer. As indicated herein, in a box or tile rendering architecture, frame buffers may have data repeatedly stored or written to these frame buffers, for example, when rendering from different types of memory. This can be referred to as unresolving the frame buffers or system memory. For example, when storing or writing to one frame buffer and then switching to another, data or information on the frame buffers can be resolved from GPU Internal Memory (GMEM) at the GPU to system memory, i.e., memory in Dual Data Rate (DDR) RAM or Dynamic RAM (DRAM).
[0055] In some aspects, system memory can also be system-on-chip (SoC) memory or another chip-based memory, such as on a device or smartphone, for storing data or information. System memory can also be a physical data storage device shared by the CPU and / or GPU. In some cases, system memory can be, for example, a DRAM chip on a device or smartphone. Thus, SoC memory can be a chip-based manner for storing data. In some aspects, GMEM can be on-chip memory at the GPU, which can be implemented using static RAM (SRAM). Additionally, GMEM can be stored on the device (e.g., a smartphone). As indicated herein, data or information can be transferred between system memory or DRAM and GMEM, for example, at the device. In some aspects, system memory or DRAM can be located at the CPU or GPU. Additionally, data can be stored at DDR or DRAM. In bin or tile rendering, a small portion of the memory can be stored at the GPU (e.g., at GMEM). In some cases, storing data at GMEM may consume a greater processing workload and / or power consumption compared to storing data at the frame buffer or system memory.
[0056] As indicated herein, in bin or tile rendering, different types of memory storage (e.g., system or SoC memory and GMEM or on-chip memory) may exist to store different data or information (e.g., the color or depth of a particular tile). In some aspects, rendering data for each tile or bin may be transferred during a de-parsing or parsing process. During the de-parsing process, data or information may be moved from system memory to GMEM. Similarly, during the parsing process, data or information may be moved from GMEM to system memory. This process can then be repeated for the next bin or tile. In some aspects, GMEM or on-chip memory may have a limited data size. Therefore, the process of transferring rendering information from GMEM to system memory or frame buffer can be performed on a tile-by-tile basis. For example, GMEM may have a size that stores 256x256 pixels of color, which may correspond to the size of a tile. Compared to the size of GMEM, frame buffer or system memory may have a larger data size, for example, storing 1920x1080 pixels of color. In some respects, when dividing the frame buffer (e.g., 1920x1080 pixels), this can be done in multiple steps based on the size of each tile (e.g., 256x256 pixels).
[0057] As described above, when data or information is stored or written to system memory or a frame buffer, tiles or bins can be de-parsed while the data or information is being moved from system memory to GMEM. Furthermore, tiles or bins can be parsed while the data or information is being moved from GMEM to system memory. For example, the parsing process can transfer data or information of the size of a tile (e.g., 256x256 pixels) to system memory. Aspects of this disclosure can then move to another tile and continue the de-parsing / parsing process, such as rendering the tile by de-parsing it from system memory to GMEM, and then parsing it from GMEM to system memory. This process can continue until the entire frame buffer is filled. As indicated herein, data for each tile can be moved from system memory to GMEM (i.e., the de-parsing process), and then, after rendering, that data can be moved back from GMEM to system memory (i.e., the parsing process). Therefore, the de-parsing process can be the reverse movement of data compared to the parsing process. This de-parsing / parsing process can be performed because GPU memory or GMEM may be able to store less information compared to system memory. Therefore, once rendered, tile data can be moved from GMEM back to the framebuffer and stored in system memory. Thus, the rendered data for a tile can be transferred to the framebuffer in system memory. Furthermore, in some aspects, during the de-rendering process, when tiles need to be rendered at the GPU, data stored in the framebuffer can be transferred to GMEM. Therefore, a portion of the framebuffer data can be transferred from system memory to GMEM, and after rendering based on this data, the data can be transferred back to the framebuffer in system memory. This process can be performed for each bin or tile until the entire surface has been rendered.
[0058] Additionally, in some aspects, each tile can be rendered multiple times, such that only a portion of the tile is rendered. Therefore, rendering data can be transferred back and forth between system memory and GMEM multiple times during the de-parse / parse process. For example, the GPU can render one aspect of a surface or tile (e.g., the background), and this data can be stored in system memory while other aspects of the surface or tile are being rendered. This data can then be transferred back to the GPU while another part of the scene (e.g., a character) is being rendered. This process can also be referred to as rendering in multiple paths. Furthermore, the GPU can render different aspects of the scene at different times. For example, the diffuse color of the scene can be rendered, then the spectral color, then the shadows. Therefore, when rendering tiles or boxes in multiple paths, the frame buffer can incrementally store data. Furthermore, during the process of rendering each box or tile, data can be transferred back and forth between system memory and GPU memory multiple times.
[0059] In some types of GPUs (e.g., box-rendering GPUs), switching back to a previously rendered surface can involve several different operations for each box. For example, some data for a box (e.g., color and depth data) can be moved from a buffer (e.g., a color and depth buffer in system memory) to the GPU's internal memory for color and depth. As described above, this process can be referred to as a de-parsing process. The box or tile can then be rendered based on the data (e.g., color and depth data). The data (e.g., color and depth data) can then be moved from the GPU's internal memory for color and depth to a buffer in system memory (e.g., a color and depth buffer). As described above, this process can be referred to as a parsing process. In some cases, when de-parsing a tile or box, the entire tile can be transferred from system memory to GMEM before rendering the tile. After rendering, the entire tile can be parsed from GMEM to system memory. Therefore, when transferring some data for a tile to and / or from system memory and GMEM, for example, to render the tile, data for the entire tile can be transferred. As indicated in this article, both the reverse parsing and parsing processes may require GPU power and performance to transfer data from system memory to GMEM and vice versa.
[0060] Figure 4 Example illustration 400 illustrates one or more technologies according to this disclosure, including system memory and GMEM. For example... Figure 4 As shown, Figure 400 includes system memory 410, system memory 420, system memory 430, system memory 440, GMEM 412, GMEM 422, GMEM 432, display content 428, de-parse process 414, rendering 424, and parsing process 434. The system memory at 410 / 420 / 430 / 440 can represent the system memory at the GPU or CPU during different times of the de-parse / parsing process. GMEM 412 / 422 / 432 can represent the GMEM at the GPU during different times of the de-parse / parsing process.
[0061] like Figure 4As shown, during the de-parsing process 414, data or information for the tiles can be moved from system memory 410 to GMEM 412. During rendering 424, display content 428 (e.g., the sun) can be rendered for the tiles. After rendering, data or information for display content 428 can be written to or stored in GMEM 422. After the data or information for display content 428 has been copied and / or stored in GMEM 432, the data or information for display content 428 can be moved from GMEM 432 to system memory 430 during the parsing process 434. The data or information for display content 428 can then be copied to or stored in system memory 440. Figure 4 This demonstrates that in some aspects, a portion of a tile (e.g., the sun) can be updated, but data for the entire tile can be transferred from system memory to GMEM and then returned. Transferring data for the entire tile wastes significant memory bandwidth. When rendering a portion of a tile, the entire area of the tile is not rendered. This can also apply to certain rendering operations, such as when rendering color and depth memory. In some aspects, during box rendering, significant portions of data or information for a box or tile may not be written to or updated after rendering. For example, portions of GMEM may not need to be updated during rendering. Figure 4 As shown, the sun is being rendered at position 424, so the rest of the box or tile does not need to be rendered.
[0062] Various aspects of graphics processing can utilize different rendering techniques. For example, forward rendering is a rendering technique that delivers each geometry one at a time through the graphics pipeline to produce the final image. Forward rendering is the standard rendering technique used by most graphics engines. In forward rendering, the geometry is supplied, broken down into vertices, and then those vertices are transformed and divided into fragments or pixels. These fragments or pixels then undergo final rendering processing before being delivered to the display. In deferred rendering, rendering is delayed until all geometry has been delivered through the graphics pipeline. Then, after all geometry has been delivered through the graphics pipeline, the final image is produced by applying shading at the end. That is, final rendering is delayed until all geometry has been delivered.
[0063] Various aspects of graphics processing can also utilize render targets (RTs), which allow graphics scenes to be rendered to intermediate locations (e.g., intermediate memory or buffers) rather than frame buffers. That is, in graphics processing, a render target is an area of memory (e.g., graphics memory or computer memory) where the next frame to be displayed is drawn. For example, in three-dimensional (3D) computer graphics, a render target can be a feature of a graphics processing unit (GPU) that allows a 3D scene to be rendered to an intermediate memory buffer (e.g., a render target texture (RTT)). That is, the scene can be rendered to an intermediate memory buffer, rather than a frame buffer or post-buffer. In some aspects, the intermediate memory buffer or render target texture can be manipulated by certain shaders (e.g., pixel shaders) to apply additional effects to the final image before it is displayed.
[0064] Render targets reside in areas of dedicated memory on a graphics card or graphics processing unit (GPU). Render targets can be used to increase rendering speed at the GPU. Additionally, render targets may be referred to as back buffers, framebuffer objects, or double buffers. In three-dimensional (3D) computer graphics, render targets can be used to draw textures onto objects to help optimize the final displayed image (e.g., when the image is compiled). In some aspects of graphics processing, multiple render targets (MRTs) may exist, where different parts of a frame can be drawn onto different surfaces and then composited onto a final target. Render targets can resemble the double buffering process. For example, an image can be drawn onto a surface outside the screen (i.e., an area of memory) so that when the next frame is to be drawn to the display or screen, this can be done quickly because all drawing functions have been performed. With render targets, the area of memory utilized can reside on the graphics card and be managed by the GPU's hardware or other aspects, allowing render targets to be both fast and efficient. Render targets can also be used to optimize the rendering of objects that use images for surface textures. In some aspects, the rendering context can reside within the graphics hardware, which allows for rapid rasterization of graphical objects.
[0065] In forward rendering, certain types of computations (e.g., lighting calculations) are performed for each vertex and each fragment for all the light in the visible scene. In some cases, an increase in the number of lights and / or special effects in the scene can lead to an increase in the number of computations associated with the scene. For example, the number of computations can correspond to the instructions to be performed on a pixel in the GPU. Rendering complexity can be proportional to the number of fragment shader instructions, the number of fragments, and the number of effective lights. In Big O notation, rendering complexity can be written as: O(number_of_fragment_shader_instruction * number_of_fragments * number_of_effective_lights). That is, rendering complexity can be proportional to the number of fragments being shaded before the final depth test and the number of lights in the scene. In some respects, modern graphics may require more dynamic lighting for better realism in applications / games and in virtual worlds.
[0066] In deferred rendering, certain types of calculations (e.g., lighting calculations) can be performed on the pixels visible on the screen, reducing the total number of fragments to be shaded for light. Ideally, the number of shaded fragments should equal the resolution size, rather than the total number of fragments. Rendering complexity can be proportional to the number of fragment shader instructions, screen resolution, and the number of effective lights. In Big O notation, the complexity of deferred rendering can be expressed as O(number_of_fragment_shader_instruction * screen_resolution * number_of_effective_lights). The savings in the number of fragments to be shaded are directly related to the savings in the original computation cycles. Therefore, deferred rendering benefits both power and performance. Those render targets (RTs) may need to go through multiple passes to calculate light and special effects in order to produce the final image.
[0067] As indicated above, deferred rendering reduces the number of original computation cycles by shading a smaller number of fragments compared to forward rendering. For example, in 3D graphics, deferred rendering is a popular choice due to its advantage in saving raw computational power compared to forward rendering. However, deferred rendering comes at the cost of multiple render targets (RTs) and high memory bandwidth. For example, deferred rendering consumes high memory bandwidth because render targets (RTs) may require multiple passes for certain types of graphics computations (e.g., calculations of lighting and special effects) to produce the final image. That is, deferred rendering can utilize render targets that may be associated with an increased number of certain computations (e.g., calculations of lighting and special effects). Therefore, deferred rendering may initially render to render targets and save them to memory, but may later need to fetch render targets from memory. Thus, deferred rendering may later utilize an increased number of delayed read and write-back instructions in the pipeline. In practice, deferred rendering may not initially render instructions, but may save render targets to memory and then retrieve these render targets at a later time.
[0068] Therefore, deferred rendering can save immediate bandwidth and performance, but it can also result in delayed bandwidth and performance costs. For example, any savings in shader instructions due to deferred rendering can come at the cost of higher bandwidth, thus increasing power consumption in the memory subsystem (e.g., cache, channels, and DDR). Additionally, some rendering tasks (RTs) are more dominant than others in terms of the number of frames they are used to render. Therefore, some RTs contribute more memory bandwidth compared to others. Based on the above, it may be beneficial to identify the RTs that contribute to an increase in memory bandwidth (e.g., system memory bandwidth). That is, it may also be beneficial to reduce the amount of memory bandwidth utilized by identifying rendering goals that contribute to more memory bandwidth. Furthermore, it may also be beneficial to reduce the number of read and write operations in deferred rendering due to certain rendering goals.
[0069] The aspects of this disclosure can identify rendering targets that contribute to an increase in memory bandwidth (e.g., system memory bandwidth). For example, the aspects of this disclosure can reduce the amount of memory bandwidth used by identifying rendering targets that contribute to more memory bandwidth. To this end, the aspects of this disclosure can identify rendering targets that contribute to an increase in memory bandwidth compared to other rendering targets. Additionally, the aspects presented herein can reduce the number of read and write operations in deferred rendering due to certain rendering targets. That is, the aspects presented herein can identify rendering targets that contribute to an increased number of read and write operations in deferred rendering. The aspects presented herein can utilize some type of memory (e.g., persistent graphics memory) to identify rendering targets that contribute to an increase in memory bandwidth. Furthermore, the aspects presented herein can utilize some type of memory (e.g., persistent graphics memory) to identify rendering targets that contribute to an increased number of read and write operations.
[0070] In some cases, the aspects presented in this paper can identify and select rendering targets that contribute to increased system memory traffic or contribute the most to system memory traffic. The aspects presented in this paper can then store the selected rendering targets in a local buffer for multiple uses within a scene or frame (i.e., persistent use). Therefore, the aspects presented in this paper can significantly reduce the amount of system memory traffic, which can improve performance at the GPU and / or reduce the amount of power utilized at the GPU. For example, the GPU in this paper can introduce a local buffer (e.g., persistent graphics memory (pGMEM)) to store memory traffic internally, thereby significantly reducing traffic on channels and DDR. By doing so, the aspects presented in this paper can provide performance improvements and power savings at the GPU. The performance improvements may be due to lower read / write latency and lower effective bandwidth. Furthermore, bandwidth savings can be directly related to power savings in the system-on-chip (SoC), such as SoCs with varying thermal design points. Therefore, the aspects presented in this paper can give the GPU a competitive advantage in both sustained performance and power consumption.
[0071] The aspects presented in this paper relate to rendering and the corresponding high memory bandwidth, as a rendering target (RT) may require multiple passes for certain computations (e.g., computations for lighting and special effects) to produce the final image. That is, the aspects presented in this paper can leverage the fact that a rendering target can undergo multiple passes, thus involving multiple round trips between the graphics processing unit (GPU) and system memory (SysMem). To achieve this, both hardware and software can work together. First, a local buffer (e.g., persistent graphics memory (pGMEM)) can be introduced, large enough to hold at least one of the multiple rendering targets (MRTs). The aspects presented in this paper can utilize algorithms to select the best possible RT that yields the best memory bandwidth savings. Candidate surfaces can then be allocated to pGMEM and used as both RTs (e.g., for writing) and texture surfaces (e.g., for reading) as needed. The algorithm can also guide the driver when to reclaim pGMEM space for the next best candidate RT. By doing so, this can result in a competitive advantage for the GPU in terms of sustained performance and / or power consumption.
[0072] Figure 5A and Figure 5B Illustrations 500 and 550, respectively, illustrate examples including rendering methods. Figure 5A Figure 500 illustrates a rendering technique 502 utilizing multiple rendering targets (MRTs). For example... Figure 5A As shown, Figure 500 includes a GPU 510 and a system memory 520 that includes multiple rendering targets (e.g., RT0, RT1, RT2 and RT3). Figure 5A Multiple rendering targets (e.g., RT0, RT1, RT2, and RT3) are depicted and stored in system memory 520. Figure 5B Figure 550 illustrates a rendering technique 552 utilizing an MRT that includes persistent graphics memory. For example... Figure 5B As shown, Figure 550 includes a GPU 560, a system memory 570 including multiple rendering targets (e.g., RT1, RT2, and RT3), and a persistent graphics memory 580 including rendering targets (e.g., RT0). Figure 5BSome of the multiple rendering targets (e.g., RT1, RT2, and RT3) are stored in system memory 570, and at least one rendering target (e.g., RT0) is stored in persistent graphics memory 580. Figure 550 illustrates the aspects presented herein for selecting a rendering target (e.g., RT0) and moving that rendering target to persistent graphics memory 580. Once an RT is stored in persistent graphics memory 580, software can control this RT. The software can retain the RT for multiple uses and release it at appropriate times. For example, the software can determine that it can evict an RT and replace it with a candidate RT that performs better. Figure 5B As shown, the aspects presented herein can determine which rendering target to store in graphics memory (e.g., persistent graphics memory 580) and / or which rendering target will be used most persistently in graphics memory.
[0073] In some respects, the traffic contribution from multiple passes of the rendering target can be a post-shadow effect. Furthermore, the driver may not be able to determine how to select the optimal RT. However, the driver can receive hints about how to select the optimal RT, such as when processing commands to be submitted to the GPU's command buffer. For example, just before submitting a command, the driver can scan the command buffer (CB) for each SubmitQ and / or Command Queue (CmdQ) instruction. Additionally, the driver can attempt to select the best possible surface to allocate by constructing a Dependency Graph (DAG). For certain application programming interfaces (APIs) (e.g., unbound graphics APIs), the driver can consider the number of times a surface is referenced in a frame (RefCnt) by leveraging changes in resource barriers (RBs) or by using some surface reference counters from the GPU hardware. The driver can even utilize both RBs and RefCnt to improve the accuracy and confidence of the selected surface. In the case of resource barriers, the driver can utilize changes in the surface view during barrier changes. Surface reference counts can come from multiple sources, and one approach is to obtain reference counts from previous frames. In the case of APIs such as OpenGL (Open Graphics Library), surfaces are not unbound, and the driver can have complete knowledge of the binding information of surfaces for each stage of the pipeline. Therefore, for APIs such as OpenGL, the driver may be able to build a dependency graph over the entire frame without any feedback from the GPU hardware (e.g., feedback on counters).
[0074] Additionally, the aspects presented herein may include algorithms that consider multiple factors such as surface resolution, blending information, surface type (e.g., color, depth, normals, light, etc.), and any other heuristics that may be associated with memory traffic contribution. Among surfaces with similar reference counts, higher resolution surfaces tend to generate an increased amount of memory traffic. In some cases, if surfaces are to be blended, a surface may generate both read and write traffic to the system memory. Furthermore, certain types of surfaces (e.g., color surfaces) may generate more traffic compared to depth or normal surfaces with the same resolution and bits per pixel (bpp). These heuristics (e.g., heuristics coupled with reference counting) may be input to surface selection algorithms and / or may help select candidate surfaces to be included in persistent graphics memory (pGMEM).
[0075] Figure 6 Figure 600 illustrates an example including a rendering target selection algorithm. More specifically, Figure 600 depicts a flowchart 602 for surface allocation in persistent graphics memory (pGMEM). Figure 6 As shown in Figure 600, multiple steps are included for surface assignment in pGMEM. Figure 6 The description also depicts a driver that can receive prompts while processing commands to be submitted to the GPU's command buffer. A command buffer or command list can be a buffer that stores instructions or commands at the GPU. For example, a command buffer or command list can be a placeholder used to record commands for drawing and resource management. Command queues and submission queues can be queues within the GPU used to hold one or more command lists / buffers. Resource barriers or memory barriers can be synchronization commands used to manage access to the memory where a surface resides. A surface can be referred to as a read / write surface or a read-write surface. Barriers can be used to change the state of a surface accessed via GPU commands. Furthermore, Figure 6 Multiple input parameters that may exist for rendering target selection are described. For example, input parameters for rendering target selection may include surface resolution, surface type, blending, resource barriers, and / or surface reference count.
[0076] like Figure 6As shown, at 610, the algorithm can wait for a submission instruction (e.g., a submission queue (queue) instruction). At 612, the algorithm can determine if the command is inside the active submission queue (queue). If no at 612, the algorithm returns to step 610. If yes at 612, then at 614, the algorithm can resolve the pre-allocated surface (if needed). If yes at 626, then at 630, the algorithm can determine if a command list / buffer (CmdL / B) instruction exists. If yes at 630, then at 640, the algorithm can search for a barrier command (Cmd). At 642, the algorithm can collect reference counts (RefCnts) for each surface. At 644, the algorithm can collect resolution and surface properties (e.g., format, type, etc.). At 646, the algorithm can label each surface with weights. At 650, the algorithm can output a dependency graph (DAG). At 620, the algorithm can input a dependency graph (DAG). At 622, the algorithm selects the surface with the highest weight. At 624, the algorithm determines when space exists in pGMEM. If no at 624, then at 626, the algorithm determines whether there is a need for a pGMEM surface. If no at 626, then at 660, the algorithm reclaims pGMEM space and then allocates a surface at 670. If yes at 624, the algorithm allocates a surface at 670. At 680, the algorithm waits for instructions from the new submission queue (the queue).
[0077] Figure 7A and Figure 7B Illustrations 700 and 750, respectively, illustrate examples including rendering methods. Figure 7A Figure 700 illustrates a box rendering technique 702 utilizing multiple rendering targets (MRTs). For example... Figure 7A As shown, Figure 700 includes a shader core 710, a local GMEM 720 including multiple rendering targets (e.g., RT0, RT1, RT2 and RT3), and a system memory 730 including multiple rendering targets (e.g., RT0, RT1, RT2 and RT3). Figure 7A Multiple rendering targets (e.g., RT0, RT1, RT2, and RT3) are depicted and stored in local GMEM 720 and system memory 730. Figure 7B Figure 750 illustrates a direct rendering method 752 using MRT. For example... Figure 7B As shown, Figure 750 includes a shader core 760 and a system memory 770 that includes multiple rendering targets (e.g., RT0, RT1, RT2 and RT3). Figure 7B Multiple rendering targets (e.g., RT0, RT1, RT2, and RT3) are depicted and stored in system memory 770. Figure 7A and Figure 7BThis illustrates that in the case of MRT, at least one or more surfaces (e.g., four in deferred rendering) can be rendered in a single render pass (RP), where the set of draw calls serves a similar rendering purpose. Figure 7A In the box rendering method 702, all RTs in the MRT iteration can be rendered to the local GMEM 720 (e.g., bGMEM). Figure 7B As shown, in the direct rendering method 752, RT is directly rendered to the system memory 770.
[0078] To allocate one or more surfaces to pGMEM, the aspects presented herein may utilize a rendering technique known as Hybrid Rendering (HR). In Hybrid Rendering, the aspects presented herein may allocate one or more surfaces to pGMEM, but not all surfaces of the MRT. In Hybrid Rendering according to this disclosure, the RT may reside in either pGMEM or sysMem. Furthermore, in Hybrid Rendering, the software may have complete control over the residency of the RT. Additionally, based on the available space in pGMEM, the driver may allocate at least one or more surfaces to pGMEM, and the driver may allocate the remaining surfaces to system memory (sysMEM).
[0079] Figure 8A and Figure 8B Examples of rendering methods are shown in Figures 800 and 850, respectively. Figure 8A Figure 800 illustrates a hybrid box rendering technique 802 utilizing multiple rendering targets (MRTs). For example... Figure 8A As shown, Figure 800 includes a shader core 810, a local memory 820 including multiple rendering targets (e.g., RT1, RT2, and RT3), a system memory 830 including multiple rendering targets (e.g., RT1, RT2, and RT3), and a persistent graphics memory 840 including a rendering target (e.g., RT0). Figure 8A Some of the multiple rendering targets (e.g., RT1, RT2, and RT3) are depicted and stored in local memory 820 and system memory 830, and at least one rendering target (e.g., RT0) is stored in persistent graphics memory 840. Figure 800 illustrates the aspects of selectable rendering targets (e.g., RT0) presented herein and the movement of that rendering target to persistent graphics memory 840. Figure 8AAs shown, in the case of hybrid rendering using the box mode, RT0 can be rendered directly to persistent graphics memory 840, and the other three RTs (RT1, RT2, RT3) can be rendered to local memory 820 and later resolved (i.e., written back) to system memory 830. Once an RT is stored in persistent graphics memory 840, the software can control this RT. Furthermore, the software can retain an RT for multiple uses and release it at appropriate times. For example, the software can determine that it can evict an RT and replace it with a candidate RT that performs better. Figure 8A As shown, the aspects presented in this paper can determine which rendering target should be stored in graphics memory (e.g., persistent graphics memory 840).
[0080] Figure 8B Figure 850 illustrates a hybrid direct rendering method 852 utilizing MRT. For example... Figure 8B As shown, Figure 850 includes a shader core 860, a system memory 870 including multiple rendering targets (e.g., RT1, RT2, and RT3), and a persistent graphics memory 880 including rendering targets (e.g., RT0). Figure 8B Some of the multiple rendering targets (e.g., RT1, RT2, and RT3) are stored in system memory 870, and at least one rendering target (e.g., RT0) is stored in persistent graphics memory 880. Figure 850 illustrates the aspects of selecting a rendering target (e.g., RT0) and moving that rendering target to persistent graphics memory 880. Figure 8A As shown, in the case of hybrid rendering using direct mode, each of the RTs (e.g., RT1, RT2, RT3) can bypass local memory and be rendered directly to system memory 870, while RT0 can still be rendered to persistent graphics memory 880. Furthermore, once an RT is stored in persistent graphics memory 880, the software can control this RT, and the software can hold the RT for multiple uses and release it at appropriate times. For example, the software can determine that it can evict an RT and replace it with a candidate RT that performs better. Figure 8B As shown, the aspects presented in this paper can determine which rendering target should be stored in graphics memory (e.g., persistent graphics memory 840).
[0081] In some aspects, applications / games can have different resolutions, such as 1920×1080 (HD or 1080p), 2560×1440 (QHD or 1440p), 3840×2160 (4K or 2160p), etc. Even changes in aspect ratio (such as 16:9 or 16:10) can lead to variations in the supported rendering target (RT) size. In the most common pixel format (i.e., 32 bits per pixel (bpp)), the pGMEM space for one RT is approximately 8MB in a game at 1080p resolution, approximately 14MB at 1440p, and at least approximately 32MB at 4K. The aspects presented in this article can utilize partial render target allocation. For example, with a fixed-size pGMEM, there may not be enough space to fully accommodate the render target. The aspects presented in this article can allocate a portion of the RT in pGMEM and the remainder in sysMEM. During the command construction phase, the driver can consider the available pGMEM space, the surface format, and calculate how many rows (e.g., Y rows) the driver can store in pGMEM starting from the top left corner. A pixel can be located in pGMEM if its y-coordinate is less than or equal to its Y-line. Otherwise, the pixel can be located in system memory. Furthermore, if the base address points to the bottom left of the image (API-specific), the driver can handle that base address accordingly during surface partitioning.
[0082] Figure 9A and Figure 9B Figures 900 and 950 illustrate examples including surface dispensing means, respectively. Figure 9A Figure 900 in the diagram depicts the full surface distribution 902. For example... Figure 9A As shown, Figure 900 includes a shader core 910, a system memory 930, and a persistent graphics memory 940 that includes a rendering target (e.g., RTO). Figure 9A At least one rendering target (e.g., RT0) is depicted being stored in persistent graphics memory 940. Figure 900 illustrates the aspects presented herein for selecting a rendering target (e.g., RT0) and moving that rendering target to persistent graphics memory 940. Figure 9A As shown, in the case of full surface allocation, RT0 can be directly rendered to persistent graphics memory 940. Figure 9B Figure 950 in the diagram depicts a portion of the surface allocation 952. For example... Figure 9B As shown, Figure 950 includes a shader core 960, a system memory 970 including a rendering target (e.g., RT0), and a persistent graphics memory 980 including a rendering target (e.g., RT0). Figure 9BAt least one rendering target (e.g., RT0) is depicted being stored in system memory 970, and at least one rendering target (e.g., RT0) is stored in persistent graphics memory 980. Figure 950 illustrates the aspects presented herein for selecting a rendering target (e.g., RT0) and moving that rendering target to persistent graphics memory 980. Figure 9B As shown, in the case of partial surface allocation 952, the first half of the RT (e.g., RT0) can bypass local memory and be rendered directly to system memory 970, while the second half of RT0 can still be rendered to persistent graphics memory 980. Furthermore, once the RT is stored in persistent graphics memory 980, the software can control this RT, and the software can retain the RT for multiple uses and release it at appropriate times. For example, the software can determine that it can evict the RT and replace it with a candidate RT that performs better. Figure 9B As shown, the aspects presented herein can determine which rendering target to store in graphics memory (e.g., persistent graphics memory 980) and / or which rendering target will be used most persistently in graphics memory.
[0083] For full-surface and partial-surface assignments, the aspects presented in this paper can be addressed using the following instructions: If (surface.pGMEM) { If (pixel y co-ordinate<= surface.PARTITION.Y) { Calculate the GMEM address and send the request to GMEM. } else { Calculate the SYSMEM address and send a request to SYSMEM. } Additionally, to save memory, the aspects presented herein can utilize several other options. For example, the aspects presented herein can utilize compressed pGMEM, which buffers compressed traffic instead of uncompressed pGMEM. The aspects presented herein can also utilize paging pGMEM, which allocates the surface as pages instead of buffers in a cache-like architecture at the expense of pGMEM regions. The aspects presented herein can also utilize pGMEM as a sub-cache, which allows pGMEM located outside the GPU but inside the LLC to function as a sub-cache. The aspects presented herein can also utilize rectangular partial surfaces, which can define rectangles for the surface, such that a pixel is located in pGMEM when its (x,y) coordinates are inside the rectangle, otherwise the pixel can be located in system memory. The aspects presented herein can also utilize multiple partial regions in the surface, which can define multiple lines or rectangles for the surface, so that multiple portions of the surface can be in pGMEM. The aspects presented in this paper can also utilize dynamic paging in pGMEM, which can define pGMEM as multiple pages / tiles (e.g., 64kB or 1MB). Furthermore, this defines the shape of the pGMEM-resident tiles associated with the page size. During rendering, a page from pGMEM can be allocated based on pixel location. If allocation is granted, the tile can reside in pGMEM; otherwise, the tile can reside in system memory. During allocation failures, such as when pGMEM is full, no page may be evicted. Additionally, after a tile is allocated, it can remain in pGMEM, and the driver may need to issue an explicit command to release the page. The graphics surface can be a render target, a source texture, a buffer, or any kind of memory structure used for reading from (and writing to) it during graphics processing.
[0084] In some respects, specific implementations of the aspects presented herein can yield significant benefits in both performance and power. Performance improvements can be primarily due to bandwidth savings and / or shorter load-memory latency for shaders. For example, pGMEM can be located closer to the GPU shader system compared to sysMem or the last-level cache (LLC). In addition to surface persistence, additional benefits can be achieved by discarding allocated surfaces instead of resolving / writing them back to sysMem. A “discard” flag can be provided by the application to facilitate the driver safely discarding any surfaces not used across SubmitQ / cmdB / cmdL and / or frame boundaries. Furthermore, the aspects presented herein can include memory bandwidth savings and performance improvements due to pGMEM features. pGMEM features can be expressed in terms of performance / mm 2This provides consistent benefits when measuring performance / wattage metrics. In some respects, the pGMEM size may be limited to a specific size (e.g., 8MB) due to system-on-chip (SoC) cost considerations. However, this size can be easily scaled to a larger size (e.g., 128MB) covering multiple RTs. Furthermore, the pGMEM can reside on a single die in a multi-die package, and the software can more easily manage the pGMEM through standard stack management.
[0085] The aspects of this disclosure may include several benefits or advantages. For example, aspects of this disclosure may reduce the amount of memory bandwidth used by identifying rendering targets that contribute to increased memory bandwidth. To this end, aspects of this disclosure may identify rendering targets that contribute to an increased amount of memory bandwidth compared to other rendering targets. Furthermore, the aspects presented herein may reduce the number of read and write operations in deferred rendering due to certain rendering targets. That is, the aspects presented herein may identify rendering targets that contribute to an increased number of read and write operations in deferred rendering. The aspects presented herein may also utilize some type of memory (e.g., persistent graphics memory) to identify rendering targets that contribute to an increased amount of memory bandwidth. Furthermore, the aspects presented herein may utilize some type of memory (e.g., persistent graphics memory) to identify rendering targets that contribute to an increased number of read and write operations.
[0086] Figure 10 This is a communication flowchart 1000 based on one or more techniques of this disclosure for graphical processing. For example... Figure 10 As shown, Figure 1000 includes example communication between a GPU 1002 (e.g., a GPU, a cache on the GPU, a GPU component, another graphics processor, a CPU, a CPU component, or another central processing unit) according to one or more technologies of this disclosure, a CPU / GPU component 1004 (e.g., a CPU, a cache on the CPU, a CPU component, another central processing unit, a GPU, a GPU component, or another graphics processor), and a memory 1006 (e.g., system memory, graphics memory, or memory or cache on the GPU).
[0087] At 1010, GPU 1002 may obtain indications of a plurality of render targets (RTs) associated with the rendering process, wherein each of the plurality of RTs is associated with a subset of graphics surfaces used for the rendering process (e.g., GPU 1002 may obtain indication 1012 from CPU / GPU component 1004). Graphics surfaces may be at least one of the following: texture surfaces, static textures, read / write textures, procedural textures, vertex buffers, or frame buffers. The subset of graphics surfaces may be less than all graphics surfaces. Furthermore, the plurality of RTs may be associated with a memory location for storing a set of render pixels used for the rendering process.
[0088] At 1020, GPU 1002 may select at least one RT from a plurality of RTs based on a subset of graphics surfaces associated with at least one RT. In some aspects, selecting at least one RT may include: selecting at least one RT associated with the highest amount of memory traffic among the plurality of RTs. The subset of graphics surfaces associated with at least one RT may be a portion of the subset of graphics surfaces associated with at least one RT, and this portion of the subset of graphics surfaces associated with at least one RT may include less than the entire subset of graphics surfaces associated with at least one RT. In some cases, selecting at least one RT includes: determining whether there is space in a buffer or cache for storing this portion of the subset of graphics surfaces associated with at least one RT. Furthermore, this portion of the subset of graphics surfaces may correspond to the y-coordinate of the subset of graphics surfaces, and determining whether there is space in the buffer or cache includes: determining whether there is space in the buffer or cache for the portion of the subset of graphics surfaces corresponding to the y-coordinate. In some aspects, at least one RT may be associated with the highest amount of memory traffic among the plurality of RTs, such that at least one RT includes memory traffic higher than one or more of the remaining RTs among the plurality of RTs. In addition, the memory traffic for at least one RT may be associated with at least one of the following: surface resolution, blending information, surface type (e.g., color, depth, normal, light, etc.), usage frequency, or surface grade for a subset of the graphics surfaces associated with at least one RT.
[0089] At 1030, GPU 1002 can determine whether there is space in the buffer or cache for storing at least one RT.
[0090] At 1040, GPU 1002 can allocate a subset of graphics surfaces associated with at least one RT to a buffer or cache based on the existence of space in the buffer or cache for storing at least one RT.
[0091] At 1050, GPU 1002 may remove a portion of a buffer or cache based on the absence of space in the buffer or cache for storing at least one RT, in order to allocate a subset of graphics surfaces associated with at least one RT; or determine whether an updated command buffer or updated command list exists for the subset of graphics surfaces associated with at least one RT. If an updated command buffer or updated command list does not exist for the subset of graphics surfaces associated with at least one RT, the GPU may configure an updated command buffer or updated command list for the subset of graphics surfaces associated with at least one RT. If an updated command buffer or updated command list exists for the subset of graphics surfaces associated with at least one RT, the GPU may perform at least one of the following: identify the existence of a resource barrier for the updated command buffer or updated command list; identify the usage count for the subset of graphics surfaces associated with at least one RT; identify the surface resolution and surface format for the subset of graphics surfaces associated with at least one RT; or identify the surface level for the subset of graphics surfaces associated with at least one RT.
[0092] At 1060, GPU 1002 may store at least one selected RT in a buffer or cache, or avoid storing at least one selected RT in a buffer or cache. Avoiding storage in a buffer / cache may mean not storing in a buffer / cache or stopping storage in a buffer / cache. In some aspects, the buffer may be a local buffer in the graphics processing unit (GPU) or graphics memory in the GPU, and the cache may be a local cache in the GPU or graphics cache in the GPU.
[0093] At 1070, GPU 1002 may write one or more remaining RTs from a plurality of RTs to at least one memory, wherein the one or more remaining RTs do not include the selected at least one RT. The at least one memory may be system memory in the graphics processing unit (GPU), persistent memory in the GPU, persistent graphics memory (GMEM) in the GPU, persistent video memory in the GPU, or a persistent buffer in the GPU. In some aspects, the at least one memory may include a first memory and a second memory, and writing one or more remaining RTs to the at least one memory may include writing one or more remaining RTs from a plurality of RTs to the first memory and the second memory. Furthermore, the first memory may be system memory in the graphics processing unit (GPU), and the second memory may be persistent memory in the GPU.
[0094] At 1080, GPU 1002 may output an indication of at least one selected RT from a plurality of RTs. In some aspects, outputting an indication of at least one selected RT may include sending an indication of at least one selected RT from a plurality of RTs. For example, GPU 1002 may send indication 1082 to CPU / GPU component 1004. Furthermore, outputting an indication of at least one selected RT may include storing an indication of at least one selected RT from a plurality of RTs. For example, GPU 1002 may store indication 1084 in memory 1006.
[0095] Figure 11 This is a flowchart 1100 of an example method for graphics processing according to one or more techniques of this disclosure. The method may be performed by a GPU (such as a device for graphics processing), a graphics processor, a CPU, a wireless communication device, and / or a combination thereof. Figures 1 to 10 The example uses any device capable of performing graphics processing. The methods described herein can provide a variety of benefits, such as improved resource utilization and / or power savings.
[0096] At 1102, the GPU can obtain indications of multiple render targets (RTs) associated with the rendering process, where each of the multiple RTs is associated with a subset of graphics surfaces used in the rendering process, such as in combination. Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1010, GPU 1002 can obtain indications of multiple rendering targets (RTs) associated with the rendering process, wherein each of the multiple RTs is associated with a subset of graphics surfaces used in the rendering process. Furthermore, step 1102 can be... Figure 1 The processing unit 120 executes the process. The graphics surface can be at least one of the following: a textured surface, a static texture, a read / write texture, a procedural texture, a vertex buffer, or a frame buffer. Furthermore, multiple RTs can be associated with a memory location for storing a set of rendered pixels used in the rendering process.
[0097] At 1104, the GPU can select at least one RT from a plurality of RTs based on a subset of graphics surfaces associated with at least one RT, such as combining Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1020, GPU 1002 can select at least one RT from a plurality of RTs based on a subset of graphics surfaces associated with at least one RT. Furthermore, step 1104 can be performed by… Figure 1The processing unit 120 performs this operation. In some aspects, selecting at least one RT may include: selecting at least one RT associated with the highest amount of memory traffic among a plurality of RTs. A subset of graphics surfaces associated with at least one RT may be a portion of a subset of graphics surfaces associated with at least one RT, and this portion of the subset of graphics surfaces associated with at least one RT may include less than the entire subset of graphics surfaces associated with at least one RT. In some cases, selecting at least one RT includes: determining whether there is space in a buffer or cache for storing this portion of the subset of graphics surfaces associated with at least one RT. Furthermore, this portion of the subset of graphics surfaces may correspond to the y-coordinate of the subset of graphics surfaces, and determining whether there is space in a buffer or cache includes: determining whether there is space in a buffer or cache for a portion of the subset of graphics surfaces corresponding to the y-coordinate. In some aspects, at least one RT may be associated with the highest amount of memory traffic among a plurality of RTs, such that at least one RT includes memory traffic higher than one or more of the remaining RTs among the plurality of RTs. In addition, the memory traffic for at least one RT may be associated with at least one of the following: surface resolution, blending information, surface type (e.g., color, depth, normal, light, etc.), usage frequency, or surface grade for a subset of the graphics surfaces associated with at least one RT.
[0098] At 1112, the GPU can either store at least one selected RT in a buffer or cache, or avoid storing at least one selected RT in a buffer or cache, such as in combination. Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1060, GPU 1002 may store at least one selected RT in a buffer or cache, or avoid storing at least one selected RT in a buffer or cache. Furthermore, step 1112 may be performed by... Figure 1 The processing unit 120 in the process executes the operation. In some respects, the buffer may be a local buffer in the graphics processing unit (GPU) or the graphics memory in the GPU, and the cache may be a local cache in the GPU or the graphics cache in the GPU.
[0099] Figure 12 This is a flowchart 1200 of an example method for graphics processing according to one or more techniques of this disclosure. The method may be performed by a GPU (such as a device for graphics processing), a graphics processor, a CPU, a wireless communication device, and / or a combination thereof. Figures 1 to 10 The example uses any device capable of performing graphics processing. The methods described herein can provide a variety of benefits, such as improved resource utilization and / or power savings.
[0100] At 1202, the GPU can obtain indications of multiple render targets (RTs) associated with the rendering process, where each of the multiple RTs is associated with a subset of graphics surfaces used in the rendering process, such as in combination. Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1010, GPU 1002 can obtain indications of multiple rendering targets (RTs) associated with the rendering process, wherein each of the multiple RTs is associated with a subset of graphics surfaces used in the rendering process. Furthermore, step 1202 can be... Figure 1 The processing unit 120 executes the process. The graphics surface can be at least one of the following: a textured surface, a static texture, a read / write texture, a procedural texture, a vertex buffer, or a frame buffer. Furthermore, multiple RTs can be associated with a memory location for storing a set of rendered pixels used in the rendering process.
[0101] At 1204, the GPU can select at least one RT from a plurality of RTs based on a subset of the graphics surfaces associated with at least one RT, such as combining Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1020, GPU 1002 can select at least one RT from a plurality of RTs based on a subset of graphics surfaces associated with at least one RT. Furthermore, step 1204 can be performed by… Figure 1 The processing unit 120 performs this operation. In some aspects, selecting at least one RT may include: selecting at least one RT associated with the highest amount of memory traffic among a plurality of RTs. A subset of graphics surfaces associated with at least one RT may be a portion of a subset of graphics surfaces associated with at least one RT, and this portion of the subset of graphics surfaces associated with at least one RT may include less than the entire subset of graphics surfaces associated with at least one RT. In some cases, selecting at least one RT includes: determining whether there is space in a buffer or cache for storing this portion of the subset of graphics surfaces associated with at least one RT. Furthermore, this portion of the subset of graphics surfaces may correspond to the y-coordinate of the subset of graphics surfaces, and determining whether there is space in a buffer or cache includes: determining whether there is space in a buffer or cache for a portion of the subset of graphics surfaces corresponding to the y-coordinate. In some aspects, at least one RT may be associated with the highest amount of memory traffic among a plurality of RTs, such that at least one RT includes memory traffic higher than one or more of the remaining RTs among the plurality of RTs. In addition, the memory traffic for at least one RT may be associated with at least one of the following: surface resolution, blending information, surface type (e.g., color, depth, normal, light, etc.), usage frequency, or surface grade for a subset of the graphics surfaces associated with at least one RT.
[0102] At position 1206, the GPU can determine whether there is space in the buffer or cache for storing at least one RT, such as in combination with Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1030, GPU 1002 can determine whether there is space in a buffer or cache for storing at least one RT. Furthermore, step 1206 can be performed by... Figure 1 The processing unit 120 in the middle executes.
[0103] At 1208, the GPU can allocate a subset of the graphics surfaces associated with at least one RT to the buffer or cache based on the existence of space in the buffer or cache for storing at least one RT, such as in combination with Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1040, GPU 1002 may allocate a subset of graphics surfaces associated with at least one RT to a buffer or cache based on the existence of space in a buffer or cache for storing at least one RT. Furthermore, step 1208 may be... Figure 1 The processing unit 120 in the middle executes.
[0104] At 1210, the GPU may remove a portion of a buffer or cache based on the absence of space in the buffer or cache for storing at least one command level (RT), in order to allocate a subset of graphics surfaces associated with at least one RT; or determine whether an updated command buffer or an updated command list exists for the subset of graphics surfaces associated with at least one RT, such as in combination with... Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1050, GPU 1002 may remove a portion of a buffer or cache based on the absence of space in the buffer or cache for storing at least one RT, in order to allocate a subset of graphics surfaces associated with at least one RT; or determine whether an updated command buffer or an updated command list exists for the subset of graphics surfaces associated with at least one RT. Furthermore, step 1210 may be... Figure 1The processing unit 120 executes the following: If no updated command buffer or updated command list exists for a subset of graphics surfaces associated with at least one RT, the GPU may configure an updated command buffer or updated command list for the subset of graphics surfaces associated with at least one RT. If an updated command buffer or updated command list exists for the subset of graphics surfaces associated with at least one RT, the GPU may perform at least one of the following: identify the existence of a resource barrier for the updated command buffer or updated command list; identify the usage count for the subset of graphics surfaces associated with at least one RT; identify the surface resolution and surface format for the subset of graphics surfaces associated with at least one RT; or identify the surface level for the subset of graphics surfaces associated with at least one RT.
[0105] At position 1212, the GPU can either store at least one selected RT in a buffer or cache, or avoid storing at least one selected RT in a buffer or cache, as in combination with... Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1060, GPU 1002 may store at least one selected RT in a buffer or cache, or avoid storing at least one selected RT in a buffer or cache. Furthermore, step 1212 may be performed by... Figure 1 The processing unit 120 in the process executes the operation. In some respects, the buffer may be a local buffer in the graphics processing unit (GPU) or the graphics memory in the GPU, and the cache may be a local cache in the GPU or the graphics cache in the GPU.
[0106] At 1214, the GPU can write one or more remaining RTs from a plurality of RTs to at least one memory, wherein the one or more remaining RTs do not include at least one selected RT, as combined Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1070, GPU 1002 can write one or more remaining RTs from a plurality of RTs to at least one memory, wherein the one or more remaining RTs do not include the selected at least one RT. Furthermore, step 1214 can be performed by… Figure 1The processing unit 120 executes the process. At least one memory may be system memory in the graphics processing unit (GPU), persistent memory in the GPU, persistent graphics memory (GMEM) in the GPU, persistent video memory in the GPU, or a persistent buffer in the GPU. In some aspects, at least one memory may include a first memory and a second memory, and writing one or more remaining RTs to at least one memory may include writing one or more remaining RTs from a plurality of RTs to the first memory and the second memory. Furthermore, the first memory may be system memory in the graphics processing unit (GPU), and the second memory may be persistent memory in the GPU.
[0107] At position 1216, the GPU can output an indication of at least one selected RT from a plurality of RTs, such as in combination with Figures 1 to 10 The examples described in the document. For example, as... Figure 10 As described in 1080, GPU 1002 can output an indication of at least one selected RT among a plurality of RTs. Furthermore, step 1216 can be performed by... Figure 1 The processing unit 120 performs the operation. In some aspects, outputting an indication of at least one selected RT may include: sending an indication of at least one selected RT among a plurality of RTs. Furthermore, outputting an indication of at least one selected RT may include: storing the indication of at least one selected RT among a plurality of RTs.
[0108] In the configuration, a method or apparatus for graphics processing is provided. The apparatus may be a GPU, a graphics processing unit, or some other processor capable of performing graphics processing. In various aspects, the apparatus may be a processing unit 120 within device 104, or may be some other hardware within device 104 or another device. The apparatus (e.g., processing unit 120) may include components for obtaining indications of a plurality of rendering targets (RTs) associated with a rendering process, wherein each of the plurality of RTs is associated with a subset of graphics surfaces used in the rendering process. The apparatus (e.g., processing unit 120) may also include components for selecting at least one RT from among the plurality of RTs based on a subset of graphics surfaces associated with at least one RT. The apparatus (e.g., processing unit 120) may also include components for storing the selected at least one RT in a buffer or cache or avoiding storing the selected at least one RT in a buffer or cache. The apparatus (e.g., processing unit 120) may also include components for determining whether there is space in a buffer or cache for storing at least one RT. The apparatus (e.g., processing unit 120) may further include components for allocating a subset of graphics surfaces associated with at least one RT to a buffer or cache based on the existence of space in the buffer or cache for storing at least one RT. The apparatus (e.g., processing unit 120) may further include components for removing a portion of the buffer or cache based on the absence of space in the buffer or cache for storing at least one RT, in order to allocate the subset of graphics surfaces associated with at least one RT. The apparatus (e.g., processing unit 120) may further include components for determining whether an updated command buffer or an updated command list exists for the subset of graphics surfaces associated with at least one RT. The apparatus (e.g., processing unit 120) may further include components for writing one or more remaining RTs from a plurality of RTs to at least one memory, wherein the one or more remaining RTs do not include the selected at least one RT. The apparatus (e.g., processing unit 120) may further include components for outputting an indication of the selected at least one RT from a plurality of RTs.
[0109] The subjects described herein can be implemented to achieve one or more benefits or advantages. For example, the described graphics processing techniques can be used by a GPU, a graphics processing unit, or another processor capable of performing graphics processing to implement the persistent graphics memory techniques described herein. This can also be achieved at a lower cost compared to other graphics processing techniques. Furthermore, the graphics processing techniques described herein can improve or accelerate data processing or execution. In addition, the graphics processing techniques described herein can improve resource or data utilization and / or resource efficiency. Additionally, aspects of this disclosure can utilize persistent graphics memory techniques to improve memory bandwidth efficiency and / or increase processing speed at the GPU.
[0110] It should be understood that the specific order or hierarchy of the boxes in the disclosed process / flowcharts is merely an example of exemplary means. It should be understood that the specific order or hierarchy of the boxes in the process / flowcharts can be rearranged based on design preferences. Furthermore, some boxes can be combined or omitted. The appended method claims present the elements of various boxes in a sample order, but this does not imply limitation to the given specific order or hierarchy.
[0111] The foregoing description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be apparent to those skilled in the art, and the general principles defined herein can be applied to other aspects. Therefore, the claims are not intended to be limited to the aspects shown herein, but should be given the full scope consistent with the language of the claims, wherein references to elements in the singular form, unless specifically stated otherwise, are not intended to mean “one and only one,” but rather “one or more.” The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
[0112] Unless otherwise specified, the term "some" means one or more, and unless otherwise specified in the context, the term "or" may be interpreted as "and / or". Combinations such as "at least one of A, B, or C", "one or more of A, B, or C", "at least one of A, B, and C", "one or more of A, B, and C", and "A, B, C, or any combination thereof" include any combination of A, B, and / or C, and may include multiple A, multiple B, or multiple C. Specifically, combinations such as "at least one of A, B, or C", "one or more of A, B, or C", "at least one of A, B, and C", "one or more of A, B, and C", and "A, B, C, or any combination thereof" may be only A, only B, only C, A and B, A and C, B and C, or A and B and C, wherein any such combination may include one or more members of A, B, or C. The various aspects described throughout this disclosure are all structural and functional equivalents known now or hereafter to those skilled in the art, and are expressly incorporated herein by reference and are intended to be covered by the claims. Furthermore, nothing disclosed herein is intended to be offered to the public, whether or not such disclosure is explicitly recited in the claims. Terms such as “module,” “mechanism,” “element,” and “device” cannot replace the word “component.” Therefore, no claim element will be interpreted as a functional component unless the element is explicitly stated using the phrase “component for…”.
[0113] In one or more examples, the functionality described herein may be implemented in hardware, software, firmware, or any combination thereof. For example, although the term "processing unit" is used throughout this disclosure, such a processing unit may be implemented in hardware, software, firmware, or any combination thereof. If any functionality, processing unit, technique, or other module described herein is implemented in software, then such functionality, processing unit, technique, or other module may be stored on or transmitted on a computer-readable medium as one or more instructions or code.
[0114] According to this disclosure, unless otherwise specified in the context, the term "or" may be understood as "and / or". Additionally, while phrases such as "one or more" or "at least one" may be used for some features disclosed herein but not others, features not using such language may be understood to have such implied meaning unless otherwise specified in the context.
[0115] In one or more examples, the functionality described herein may be implemented in hardware, software, firmware, or any combination thereof. For example, although the term “processing unit” is used throughout this disclosure, such a processing unit may be implemented in hardware, software, firmware, or any combination thereof. If any functionality, processing unit, technique, or other module described herein is implemented in software, then the functionality, processing unit, technique, or other module described herein may be stored on or transmitted on a computer-readable medium as one or more instructions or code. A computer-readable medium may include computer data storage media and communication media, including any medium that facilitates the transfer of a computer program from one place to another. In this way, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium or (2) a communication medium such as a signal or carrier wave. A data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, code, and / or data structures for implementing the techniques described herein. By way of example and not limitation, such computer-readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage devices, magnetic disk storage devices, or other magnetic storage devices. As used herein, disks and optical discs include: compact optical discs (CDs), laser optical discs, optical discs, digital multifunction optical discs (DVDs), floppy disks, and Blu-ray discs, wherein disks typically reproduce data magnetically, while optical discs reproduce data optically using lasers. Combinations of the above should also be included within the scope of computer-readable media. Computer program products may include computer-readable media.
[0116] The code can be executed by one or more processors, such as one or more digital signal processors (DSPs), general-purpose microprocessors, application-specific integrated circuits (ASICs), arithmetic logic units (ALUs), field-programmable arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Therefore, the term "processor" as used herein can refer to any of the above-described structures or any other structure suitable for implementing the techniques described herein. Furthermore, these techniques can be fully implemented in one or more circuit or logic elements.
[0117] The techniques disclosed herein can be implemented in a wide variety of devices or apparatuses, including wireless mobile phones, integrated circuits (ICs), or IC sets (e.g., chipsets). Various components, modules, or units are described in this disclosure to emphasize functional aspects of a device configured to perform the disclosed techniques, but they do not necessarily need to be implemented by different hardware units. Rather, as described above, various units can be combined in any hardware unit or provided by a collection of interoperable hardware units (including one or more processors as described above) combined with suitable software and / or firmware. Therefore, the term "processor" as used herein can refer to any of the above-described structures or any other structure suitable for implementing the techniques described herein. Furthermore, these techniques can be fully implemented in one or more circuit or logic elements.
[0118] The following aspects are merely illustrative and may be combined with other aspects or teachings described herein without limitation.
[0119] Aspect 1 is an apparatus for graphics processing, the apparatus comprising at least one processor coupled to a memory and based at least in part on information stored in the memory, the at least one processor being configured to: obtain indications of a plurality of rendering targets (RTs) associated with a rendering process, wherein each of the plurality of RTs is associated with a subset of graphics surfaces for the rendering process; select at least one RT from the plurality of RTs based on the subset of graphics surfaces associated with the at least one RT; and store the selected at least one RT in a buffer or cache or avoid storing the selected at least one RT in the buffer or cache.
[0120] Aspect 2 is the apparatus according to aspect 1, wherein, in order to select the at least one RT, the at least one processor is configured individually or in any combination to select the at least one RT associated with the highest amount of memory traffic among the plurality of RTs.
[0121] Aspect 3 is the apparatus according to aspect 2, wherein the at least one processor is further configured, individually or in any combination, to determine whether there is space in the buffer or the cache for storing the at least one RT.
[0122] Aspect 4 is the apparatus according to aspect 3, wherein the at least one processor is further configured, individually or in any combination, to allocate the subset of the graphics surfaces associated with the at least one RT to the buffer or the cache based on the existence of space in the buffer or the cache for storing the at least one RT.
[0123] Aspect 5 is the apparatus according to aspect 3, wherein the at least one processor is further configured, individually or in any combination, to: remove a portion of the buffer or the cache based on the absence of space in the buffer or the cache for storing the at least one RT, in order to allocate the subset of the graphics surfaces associated with the at least one RT; or to determine whether there is an updated command buffer or an updated command list for the subset of the graphics surfaces associated with the at least one RT.
[0124] Aspect 6 is the apparatus according to aspect 5, wherein the updated command buffer or the updated command list does not exist for the subset of the graphics surfaces associated with the at least one RT, and wherein the at least one processor is further configured individually or in any combination to configure the updated command buffer or the updated command list for the subset of the graphics surfaces associated with the at least one RT.
[0125] Aspect 7 is the apparatus according to aspect 5, wherein the updated command buffer or the updated command list exists for the subset of the graphics surfaces associated with the at least one RT, and wherein the at least one processor is further configured individually or in any combination to perform at least one of the following: identifying the existence of a resource barrier for the updated command buffer or the updated command list; identifying a usage count for the subset of the graphics surfaces associated with the at least one RT; identifying the surface resolution and surface format for the subset of the graphics surfaces associated with the at least one RT; or identifying the surface level for the subset of the graphics surfaces associated with the at least one RT.
[0126] Aspect 8 is an apparatus according to any one of aspects 1 to 7, wherein the at least one processor is further configured, individually or in any combination, to write one or more remaining RTs of the plurality of RTs to at least one first memory, wherein the one or more remaining RTs do not include the selected at least one RT.
[0127] Aspect 9 is an apparatus according to aspect 8, wherein the at least one first memory includes a first memory and a second memory, and wherein, in order to write the one or more remaining RTs to the at least one first memory, the at least one processor is configured individually or in any combination to write the one or more remaining RTs of the plurality of RTs to the first memory and the second memory.
[0128] Aspect 10 is the apparatus according to aspect 9, wherein the first memory is system memory in a graphics processing unit (GPU), and wherein the second memory is persistent memory in the GPU.
[0129] Aspect 11 is the apparatus according to aspect 8, wherein the at least one first memory is a system memory in a graphics processing unit (GPU), a persistent memory in the GPU, a persistent graphics memory (GMEM) in the GPU, a persistent video memory in the GPU, or a persistent buffer in the GPU.
[0130] Aspect 12 is an apparatus according to any one of aspects 1 to 11, wherein the subset of the graphic surfaces associated with the at least one RT is a part of the subset of the graphic surfaces associated with the at least one RT, and wherein the part of the subset of the graphic surfaces associated with the at least one RT includes less than the entire subset of the graphic surfaces associated with the at least one RT.
[0131] Aspect 13 is the apparatus according to aspect 12, wherein, in order to select the at least one RT, the at least one processor is configured individually or in any combination to: determine whether there is space in the buffer or the cache for storing a portion of the subset of the graphics surfaces associated with the at least one RT.
[0132] Aspect 14 is an apparatus according to aspect 13, wherein the portion of the subset of the graphics surface corresponds to the y-coordinate of the subset of the graphics surface, and wherein, in order to determine whether space exists in the buffer or the cache, the at least one processor is configured individually or in any combination to determine whether space exists in the buffer or the cache for the portion of the subset of the graphics surface corresponding to the y-coordinate.
[0133] Aspect 15 is an apparatus according to any one of aspects 1 to 14, wherein the at least one RT is associated with the highest amount of memory traffic among the plurality of RTs, such that the at least one RT includes memory traffic higher than one or more of the remaining RTs among the plurality of RTs.
[0134] Aspect 16 is the apparatus according to aspect 15, wherein the memory traffic for the at least one RT is associated with at least one of the following: surface resolution, blending information, surface type, usage frequency, or surface grade of the subset of the graphics surfaces associated with the at least one RT.
[0135] Aspect 17 is an apparatus according to any one of aspects 1 to 16, wherein the graphics surface is at least one of the following: a textured surface, a static texture, a read / write texture, a procedural texture, a vertex buffer, or a frame buffer.
[0136] Aspect 18 is an apparatus according to any one of aspects 1 to 17, the apparatus further comprising: at least one of an antenna or a transceiver coupled to the at least one processor, wherein, in order to obtain the indication of the plurality of RTs, the at least one processor is configured individually or in any combination to obtain the indication of the plurality of RTs via at least one of the antenna or the transceiver, and wherein the plurality of RTs is associated with a memory location for storing a set of rendered pixels for the rendering process.
[0137] Aspect 19 is an apparatus according to any one of aspects 1 to 18, wherein the buffer is a local buffer in a graphics processing unit (GPU) or a graphics memory in the GPU, and wherein the cache is a local cache in the GPU or a graphics cache in the GPU.
[0138] Aspect 20 is an apparatus according to any one of aspects 1 to 19, wherein the at least one processor is further configured, individually or in any combination, to output an indication of at least one selected RT among the plurality of RTs.
[0139] Aspect 21 is the apparatus according to aspect 20, wherein, in order to output the indication for at least one selected RT, the at least one processor is configured individually or in any combination to: send the indication for at least one selected RT of the plurality of RTs; or store the indication for at least one selected RT of the plurality of RTs.
[0140] Aspect 22 is an apparatus according to any one of aspects 1 to 21, the apparatus further comprising at least one of an antenna or a transceiver coupled to the at least one processor.
[0141] Aspect 23 is a method for implementing the graphics processing of any one of aspects 1 to 21.
[0142] Aspect 24 is a device for graphics processing, the device including components for implementing any one of aspects 1 to 21.
[0143] Aspect 25 is a computer-readable medium (e.g., a non-transitory computer-readable medium) storing computer-executable code that, when executed by at least one processor, causes the at least one processor to implement any one of aspects 1 to 21.
Claims
1. An apparatus for graphics processing, the apparatus comprising: At least one memory; and At least one processor, coupled to the at least one memory, and configured individually or in any combination, based at least in part on information stored in the at least one memory, to: Obtain indications for a plurality of rendering targets (RTs) associated with a rendering process, wherein each of the plurality of RTs is associated with a subset of graphics surfaces used for the rendering process; At least one RT is selected from the plurality of RTs based on the subset of the graphic surfaces associated with the at least one RT; as well as Store at least one selected RT in a buffer or cache, or avoid storing at least one selected RT in the buffer or cache.
2. The apparatus of claim 1, wherein, in order to select the at least one RT, the at least one processor is configured individually or in any combination to: Select at least one RT associated with the highest amount of memory traffic among the plurality of RTs.
3. The apparatus of claim 2, wherein the at least one processor is further configured, alone or in any combination, to: Determine whether there is space in the buffer or the cache for storing the at least one RT.
4. The apparatus of claim 3, wherein the at least one processor is further configured, alone or in any combination, to: The subset of the graphics surfaces associated with the at least one RT is allocated to the buffer or the cache based on the existence of space in the buffer or the cache for storing the at least one RT.
5. The apparatus of claim 3, wherein the at least one processor is further configured, individually or in any combination, to: Based on the absence of space in the buffer or cache for storing the at least one RT, a portion of the buffer or cache is removed to allocate the subset of the graphics surfaces associated with the at least one RT; or Determine whether an updated command buffer or an updated command list exists for the subset of the graphics surfaces associated with the at least one RT.
6. The apparatus of claim 5, wherein the updated command buffer or the updated command list does not exist for the subset of the graphics surfaces associated with the at least one RT, and wherein the at least one processor is further configured individually or in any combination to: Configure the updated command buffer or the updated command list for the subset of the graphics surfaces associated with the at least one RT.
7. The apparatus of claim 5, wherein the updated command buffer or the updated command list exists for the subset of the graphics surfaces associated with the at least one RT, and wherein the at least one processor is further configured, individually or in any combination, to perform at least one of the following: The presence of a resource barrier is identified for the updated command buffer or the updated command list; Identify the usage count for the subset of the graphic surfaces associated with the at least one RT; Identify the surface resolution and surface format for the subset of the graphic surfaces associated with the at least one RT; or Identify the surface level for the subset of the graphic surfaces associated with the at least one RT.
8. The apparatus of claim 1, wherein the at least one processor is further configured, alone or in any combination, to: Write one or more of the remaining RTs from the plurality of RTs to at least one first memory, wherein the one or more remaining RTs do not include at least one selected RT.
9. The apparatus of claim 8, wherein the at least one first memory comprises a first memory and a second memory, and wherein, in order to write the one or more remaining RTs to the at least one first memory, the at least one processor is configured individually or in any combination to write the one or more remaining RTs of the plurality of RTs to the first memory and the second memory.
10. The apparatus of claim 9, wherein the first memory is system memory in a graphics processing unit (GPU), and wherein the second memory is persistent memory in the GPU.
11. The apparatus of claim 8, wherein the at least one first memory is a system memory in a graphics processing unit (GPU), a persistent memory in the GPU, a persistent graphics memory (GMEM) in the GPU, a persistent video memory in the GPU, or a persistent buffer in the GPU.
12. The apparatus of claim 1, wherein the subset of the graphic surfaces associated with the at least one RT is a part of the subset of the graphic surfaces associated with the at least one RT, and wherein the part of the subset of the graphic surfaces associated with the at least one RT includes less than the entire subset of the graphic surfaces associated with the at least one RT.
13. The apparatus of claim 12, wherein, in order to select the at least one RT, the at least one processor is configured individually or in any combination to: determine whether there is space in the buffer or the cache for storing a portion of the subset of the graphics surfaces associated with the at least one RT.
14. The apparatus of claim 13, wherein the portion of the subset of the graphics surface corresponds to the y-coordinate of the subset of the graphics surface, and wherein, in order to determine whether space exists in the buffer or the cache, the at least one processor is configured individually or in any combination to: determine whether space exists in the buffer or the cache for the portion of the subset of the graphics surface corresponding to the y-coordinate.
15. The apparatus of claim 1, wherein the at least one RT is associated with the highest amount of memory traffic among the plurality of RTs, such that the at least one RT includes memory traffic higher than one or more of the remaining RTs among the plurality of RTs.
16. The apparatus of claim 15, wherein the memory traffic for the at least one RT is associated with at least one of the following: surface resolution, blending information, surface type, usage frequency, or surface grade for the subset of the graphics surfaces associated with the at least one RT.
17. The apparatus of claim 1, wherein the graphics surface is at least one of: a textured surface, a static texture, a read / write texture, a procedural texture, a vertex buffer, or a frame buffer.
18. The apparatus according to claim 1, further comprising: At least one of the antennas or transceivers coupled to the at least one processor, wherein, in order to obtain the indication of the plurality of RTs, the at least one processor is configured individually or in any combination to obtain the indication of the plurality of RTs via the antenna or the transceiver, and wherein the plurality of RTs is associated with a memory location for storing a set of rendered pixels for the rendering process.
19. The apparatus of claim 1, wherein the buffer is a local buffer in a graphics processing unit (GPU) or a graphics memory in the GPU, and wherein the cache is a local cache in the GPU or a graphics cache in the GPU.
20. The apparatus of claim 1, wherein the at least one processor is further configured, alone or in any combination, to: Output an indication of at least one selected RT from the plurality of RTs.
21. The apparatus of claim 20, wherein, in order to output the indication for the selected at least one RT, the at least one processor is configured individually or in any combination to: Send the instruction to at least one selected RT among the plurality of RTs; or The indication is stored for at least one selected RT among the plurality of RTs.
22. A method for image processing, the method comprising: Obtain indications for a plurality of rendering targets (RTs) associated with a rendering process, wherein each of the plurality of RTs is associated with a subset of graphics surfaces used for the rendering process; At least one RT is selected from the plurality of RTs based on the subset of the graphic surfaces associated with the at least one RT; as well as Store at least one selected RT in a buffer or cache, or avoid storing at least one selected RT in the buffer or cache.
23. The method of claim 22, wherein selecting the at least one RT comprises: Select at least one RT associated with the highest amount of memory traffic among the plurality of RTs.
24. The method of claim 23, further comprising: Determine whether there is space in the buffer or the cache for storing the at least one RT; as well as The subset of the graphics surfaces associated with the at least one RT is allocated to the buffer or the cache based on the existence of space in the buffer or the cache for storing the at least one RT.
25. The method of claim 23, further comprising: Determine whether there is space in the buffer or the cache for storing the at least one RT; as well as Based on the absence of space in the buffer or the cache for storing the at least one RT, a portion of the buffer or the cache is removed in order to allocate the subset of the graphics surfaces associated with the at least one RT; or Determine whether an updated command buffer or an updated command list exists for the subset of the graphics surfaces associated with the at least one RT.
26. The method according to claim 22, further comprising: Write one or more remaining RTs from the plurality of RTs to at least one memory, wherein the one or more remaining RTs do not include at least one selected RT, wherein the at least one memory includes a first memory and a second memory, and wherein writing the one or more remaining RTs to the at least one memory includes: writing the one or more remaining RTs from the plurality of RTs to the first memory and the second memory.
27. The method of claim 22, wherein the subset of the graphic surfaces associated with the at least one RT is a part of the subset of the graphic surfaces associated with the at least one RT, and wherein the part of the subset of the graphic surfaces associated with the at least one RT includes less than the entire subset of the graphic surfaces associated with the at least one RT.
28. The method of claim 27, wherein selecting the at least one RT comprises: Determining whether there exists space in the buffer or the cache for storing a portion of the subset of the graphics surface associated with the at least one RT, wherein the portion of the subset of the graphics surface corresponds to the y-coordinate of the subset of the graphics surface, and wherein determining whether there exists space in the buffer or the cache includes: determining whether there exists space in the buffer or the cache for the portion of the subset of the graphics surface corresponding to the y-coordinate.
29. An apparatus for graphics processing, the apparatus comprising: A component for obtaining indications of a plurality of rendering targets (RTs) associated with a rendering process, wherein each of the plurality of RTs is associated with a subset of graphics surfaces for the rendering process; A component for selecting at least one RT from the plurality of RTs based on the subset of the graphic surfaces associated with the at least one RT; and Components for storing at least one selected RT in a buffer or cache, or for avoiding storing at least one selected RT in the buffer or cache.
30. A computer-readable medium storing computer-executable code for graphics processing, said code, when executed by at least one processor, causing said at least one processor to: Obtain indications for a plurality of rendering targets (RTs) associated with a rendering process, wherein each of the plurality of RTs is associated with a subset of graphics surfaces used for the rendering process; At least one RT is selected from the plurality of RTs based on the subset of the graphic surfaces associated with the at least one RT; as well as Store at least one selected RT in a buffer or cache, or avoid storing at least one selected RT in the buffer or cache.