Context merge and invalidation draw merge with global events
By merging invalid draw calls and global events and sharing the context slots of the shader system, the problem of low resource utilization efficiency is solved, and more efficient resource management is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- QUALCOMM INC
- Filing Date
- 2024-08-28
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, global events and invalid drawing occupy context slots in the shader system, resulting in low resource utilization efficiency.
By merging invalid draw calls and global events, sharing context slots in the shader system reduces the total amount of resources, utilizes an advanced sequencer to build tables to skip state programming, and reuses fragment shader context slots of the shader processor.
This improves the resource utilization efficiency of context slots in the shader system and reduces the amount of resources required to process workloads.
Smart Images

Figure CN122249829A_ABST
Abstract
Description
Cross-reference to related applications
[0001] This application claims the benefit of Indian Patent Application Serial No. 202321061588, filed on September 13, 2023, entitled “CONTEXT MERGE WITH GLOBAL EVENTAND DEAD DRAW MERGE”, the entire contents of which are expressly incorporated herein by reference. Technical Field
[0002] This disclosure relates generally to processing systems, and more specifically, to one or more techniques for graphics processing. Background Technology
[0003] Computing devices typically perform graphics and / or display processing (e.g., utilizing a graphics processing unit (GPU), a central processing unit (CPU), a display processor, etc.) to render and display visual content. Such computing devices can include, for example, computer workstations, mobile phones (such as smartphones), embedded systems, personal computers, tablet computers, and video game consoles. A GPU is configured to execute a graphics processing pipeline that includes one or more processing stages that operate together to execute graphics processing commands and output frames. A CPU controls the operation of a GPU by issuing one or more graphics processing commands to it. Modern CPUs are typically capable of executing multiple applications concurrently, each of which may require the use of a GPU during execution. A display processor can be configured to convert digital information received from the CPU into analog values and can issue commands to a display panel to display visual content. Devices that provide content for visual presentation on a display can utilize a CPU, GPU, and / or display processor.
[0004] Currently, there is a need to improve graphics processing. For example, a context slot can be a resource group that includes context registers, constant random access memory (RAM), and other resources in the shader system. Global events and deaddraws can consume context slots in various GPU blocks, including the shader system. Therefore, there is a need to improve the resource utilization efficiency of context slots in the shader system. Summary of the Invention
[0005] The following is a simplified summary of one or more aspects to provide a basic understanding of these aspects. This summary is not a broad overview of all anticipated aspects, nor is it intended to identify key or essential elements of all aspects, nor to describe the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that follows.
[0006] In one aspect of this disclosure, a method, computer-readable medium, and apparatus are provided. The apparatus includes: a memory; and at least one processor coupled to the memory, and at least partially based on information stored in the memory, the at least one processor being configured to: obtain information about at least one of a set of draw calls or a set of global events, wherein the set of draw calls and the set of global events are associated with graphics processing or computational processing. The apparatus may be further configured to detect an invalid subset of draw calls and a subset of valid draw calls in the set of draw calls. The apparatus may be further configured to store information about the invalid subset of draw calls and information about the set of global events in a first context slot in a set of context slots. The apparatus may be further configured to store information about valid draw calls in the subset of valid draw calls, as well as the stored information about the invalid subset of draw calls and the stored information about the set of global events, in the first context slot. The apparatus may be further configured to process a set of workloads for valid draw calls based on the stored information about valid draw calls, the stored information about the subset of invalid draw calls, and the stored information about the set of global events.
[0007] To achieve the foregoing and related objectives, one or more aspects include the features fully described below and specifically pointed out in the claims. The following description and drawings set forth some exemplary features of one or more aspects in detail. However, these features indicate only some of the various ways in which the principles of the various aspects may be employed, and this description is intended to include all such aspects and their equivalents. Attached Figure Description
[0008] Figure 1 This is a block diagram illustrating an example of a system for generating content based on one or more techniques of this disclosure.
[0009] Figure 2 Example graphics processors according to one or more technologies of this disclosure are illustrated.
[0010] Figure 3 Example images or surfaces are illustrated according to one or more techniques of this disclosure.
[0011] Figure 4 The diagram illustrates an example technique for providing shared constants, corresponding to one or more techniques according to this disclosure.
[0012] Figure 5 This is a diagram illustrating an example workflow for processing draw calls and global events according to one or more techniques of this disclosure.
[0013] Figure 6 This is a diagram illustrating a context slot for storing drawing calls and global events according to one or more techniques disclosed herein.
[0014] Figure 7 This is a communication flowchart illustrating example communication between a graphics processor, a CPU, and a memory according to one or more techniques of this disclosure.
[0015] Figure 8 This is a flowchart of an example method for graphical processing according to one or more techniques of this disclosure.
[0016] Figure 9 This is a flowchart of an example method for graphical processing according to one or more techniques of this disclosure. Detailed Implementation
[0017] Various aspects of the systems, apparatuses, computer program products, and methods will be described more fully below with reference to the accompanying drawings. However, this disclosure may be embodied in many different forms and should not be construed as limited to any particular structure or function presented throughout this disclosure. Rather, these aspects are provided to make this disclosure comprehensive and complete, and to fully convey the scope of this disclosure to those skilled in the art. Based on the teachings herein, those skilled in the art will understand that the scope of this disclosure is intended to cover any aspect of the systems, apparatuses, computer program products, and methods disclosed herein, whether implemented independently of or in combination with other aspects of this disclosure. For example, any number of aspects set forth herein may be used to implement an apparatus or practice. Furthermore, the scope of this disclosure is intended to cover such apparatuses or methods implemented using structures, functionalities, or structures and functionalities other than or different from the various aspects of the disclosure set forth herein. Any aspect disclosed herein may be embodied by one or more elements of the claims.
[0018] Although various aspects are described herein, many variations and substitutions of these aspects fall within the scope of this disclosure. While some potential benefits and advantages of the aspects of this disclosure are mentioned, the scope of this disclosure is not intended to be limited to a particular benefit, use, or objective. Rather, the aspects of this disclosure are intended to be broadly applicable to different wireless technologies, system configurations, processing systems, networks, and transmission protocols, some of which are illustrated by way of example in the accompanying drawings and the description below. The detailed description and drawings are merely illustrative and not limiting of this disclosure, and the scope of this disclosure is defined by the appended claims and their equivalents.
[0019] Several aspects are presented with reference to various apparatuses and methods. These apparatuses and methods are described in detail and illustrated in the accompanying drawings by various blocks, components, circuits, processes, algorithms, etc. (collectively referred to as "elements"). These elements can be implemented using electronic hardware, computer software, or any combination thereof. Whether these elements are implemented as hardware or software depends on the specific application and the design constraints imposed on the overall system.
[0020] For example, an element, any part of an element, or any combination of elements can be implemented as a “processing system” including one or more processors (which may also be referred to as processing units). Examples of processors include microprocessors, microcontrollers, graphics processing units (GPUs), general-purpose GPUs (GPGPUs), central processing units (CPUs), application processors, digital signal processors (DSPs), reduced instruction set computing (RISC) processors, system-on-a-chip (SoCs), baseband processors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic units, discrete hardware circuits, and other suitable hardware configured to perform the various functionalities described throughout this disclosure. One or more processors in the processing system can execute software. Whether referred to as software, firmware, middleware, microcode, hardware description language, or other names, software is broadly understood to mean instructions, instruction sets, code, code segments, program code, programs, subroutines, software components, applications, software applications, software packages, routines, subroutines, objects, executable files, threads of execution, procedures, functions, etc.
[0021] The term "application" can refer to software. As described herein, one or more technologies can refer to an application (e.g., software) configured to perform one or more functions. In such examples, the application may be stored in memory (e.g., on-chip memory of a processor, system memory, or any other memory). Hardware described herein, such as a processor, may be configured to execute the application. For example, an application may be described as including code that, when executed by the hardware, causes the hardware to perform one or more technologies described herein. As an example, the hardware may access and execute code accessed from memory to perform one or more technologies described herein. In some examples, components are identified in this disclosure. In such examples, a component may be hardware, software, or a combination thereof. Each component may be a separate component or a subcomponent of a single component.
[0022] In one or more examples described herein, the described functionality can be implemented in hardware, software, or any combination thereof. If implemented in software, the functionality can be stored or encoded as one or more instructions or code on a computer-readable medium. Computer-readable media includes computer storage media. Storage media can be any available medium accessible by a computer. By way of example, and not limitation, such computer-readable media can include random access memory (RAM), read-only memory (ROM), electrically erasable programmable ROM (EEPROM), optical disc storage devices, magnetic disk storage devices, other magnetic storage devices, combinations of computer-readable media of the types described above, or any other medium that can be used to store computer-executable code in the form of instructions or data structures accessible by a computer.
[0023] As used herein, instances of the term "content" may refer to "graphic content," "image," etc., regardless of whether the term is used as an adjective, noun, or other part of speech. In some examples, as used herein, the term "graphic content" may refer to content produced by one or more processes in a graphics processing pipeline. In other examples, as used herein, the term "graphic content" may refer to content produced by a processing unit configured to perform graphics processing. In yet another example, as used herein, the term "graphic content" may refer to content produced by a graphics processing unit.
[0024] Context slots can be resource groups that include context registers, constant random access memory (RAM), and other resources in the shader system. Global events and invalid draws can consume context slots across various GPU blocks, including the shader system. Therefore, there is a need to improve the resource utilization efficiency of context slots in the shader system. The aspects presented in this paper enable the merging of invalid draw calls and global events within the context slots used for the shader system, reducing the total amount of resources required to process workloads at the shader. Some aspects can be based on accumulating all consecutive global events and invalid draw calls and merging them with subsequent valid draws, such that a valid draw call, one or more global events, and one or more invalid draw calls will share a fragment shader context slot in the shader system. Invalid draw calls can be identified from full draw rejection information from various units, such as from low-resolution Z (LRZ) or other units, and this information can be fed to the High-Level Sequencer (HLSQ), where tables can be built for skipping state programming and reusing fragment shader context slots in the shader processor (SP). The Vertex Extractor (VFD) can also be used to provide invalid or valid information to the SP vertex shader context slot.
[0025] The examples described herein may relate to the use and functionality of a graphics processing unit (GPU). As used herein, a GPU can be any type of graphics processor, and a graphics processor can be any type of processor designed or configured to process graphical content. For example, a graphics processor or GPU can be a dedicated circuit designed to process graphical content. As an additional example, a graphics processor or GPU can be a general-purpose processor configured to process graphical content.
[0026] Figure 1 This is a block diagram illustrating an example content generation system 100 configured to implement one or more technologies of this disclosure. The content generation system 100 includes a device 104. Device 104 may include one or more components or circuitry for performing the various functions described herein. In some examples, one or more components of device 104 may be components of a System-on-a-Chip (SOC). Device 104 may include one or more components configured to perform one or more technologies of this disclosure. In the illustrated example, device 104 may include a processing unit 120, a content encoder / decoder 122, and a system memory 124. In some aspects, device 104 may include multiple components (e.g., a communication interface 126, a transceiver 132, a receiver 128, a transmitter 130, a display processor 127, and one or more displays 131). Display 131 may refer to one or more displays 131. For example, display 131 may include a single display or multiple displays, which may include a first display and a second display. The first display may be a left-eye display, and the second display may be a right-eye display. In some examples, the first and second displays may receive different frames for presentation on the first and second displays. In other examples, the first and second displays may receive the same frames used for rendering on both displays. In yet another example, the results of graphics processing may not be displayed on the device; for example, the first and second displays may not receive any frames used for rendering on either display. Instead, the frames or graphics processing results may be transferred to another device. In some respects, this can be referred to as split rendering.
[0027] Processing unit 120 may include internal memory 121. Processing unit 120 may be configured to perform graphics processing using graphics processing pipeline 107. Content encoder / decoder 122 may include internal memory 123. In some examples, device 104 may include a processor configured to perform one or more display processing techniques on one or more frames generated by processing unit 120, and then display those frames through one or more displays 131. Although the processor in example content generation system 100 is configured as display processor 127, it should be understood that display processor 127 is one example of a processor and other types of processors, controllers, etc., may be used instead of display processor 127. Display processor 127 may be configured to perform display processing. For example, display processor 127 may be configured to perform one or more display processing techniques on one or more frames generated by processing unit 120. One or more displays 131 may be configured to display or otherwise present the frames processed by display processor 127. In some examples, one or more displays 131 may include one or more of the following: liquid crystal display (LCD), plasma display, organic light-emitting diode (OLED) display, projection display device, augmented reality display device, virtual reality display device, head-mounted display, or any other type of display device.
[0028] Memory (such as system memory 124) external to processing unit 120 and content encoder / decoder 122 may be accessible to processing unit 120 and content encoder / decoder 122. For example, processing unit 120 and content encoder / decoder 122 may be configured to read from and / or write to external memory (such as system memory 124). Processing unit 120 may be communicatively coupled to system memory 124 via a bus. In some examples, processing unit 120 and content encoder / decoder 122 may be communicatively coupled to internal memory 121 via the bus or via a different connection.
[0029] Content encoder / decoder 122 can be configured to receive graphic content from any source, such as system memory 124 and / or communication interface 126. System memory 124 can be configured to store received encoded or decoded graphic content. Content encoder / decoder 122 can be configured to receive encoded or decoded graphic content from system memory 124 and / or communication interface 126, for example, in the form of encoded pixel data. Content encoder / decoder 122 can be configured to encode or decode any graphic content.
[0030] Internal memory 121 or system memory 124 may include one or more volatile or non-volatile memories or storage devices. In some examples, internal memory 121 or system memory 124 may include RAM, static random access memory (SRAM), dynamic random access memory (DRAM), erasable programmable ROM (EPROM), EEPROM, flash memory, magnetic data media or optical storage media, or any other type of memory. According to some examples, internal memory 121 or system memory 124 may be a non-transitory storage medium. The term "non-transitory" may indicate that the storage medium is not embodied in a carrier wave or propagating signal. However, the term "non-transitory" should not be construed as meaning that internal memory 121 or system memory 124 is not removable or that its contents are static. For example, system memory 124 may be removed from device 104 and moved to another device. Alternatively, system memory 124 may not be removable from device 104.
[0031] Processing unit 120 may be a CPU, GPU, GPGPU, or any other processing unit configured to perform graphics processing. In some examples, processing unit 120 may be integrated into the motherboard of device 104. In other examples, processing unit 120 may reside on a graphics card mounted in a port on the motherboard of device 104, or may otherwise be incorporated into a peripheral device configured to interoperate with device 104. Processing unit 120 may include one or more processors, such as one or more microprocessors, GPUs, ASICs, FPGAs, arithmetic logic units (ALUs), DSPs, discrete logic components, software, hardware, firmware, other equivalent integrated or discrete logic circuits, or any combination thereof. If the technology is partially implemented in software, processing unit 120 may store instructions for software in a suitable non-transitory computer-readable storage medium (e.g., internal memory 121) and may use one or more processors to execute instructions in hardware to perform the technology of this disclosure. Any of the foregoing (including hardware, software, combinations of hardware and software, etc.) may be considered as one or more processors.
[0032] The content encoder / decoder 122 can be any processing unit configured to perform content decoding. In some examples, the content encoder / decoder 122 may be integrated into the motherboard of device 104. The content encoder / decoder 122 may include one or more processors, such as one or more microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), arithmetic logic units (ALUs), digital signal processors (DSPs), video processors, discrete logic components, software, hardware, firmware, other equivalent integrated or discrete logic circuits, or any combination thereof. If the technology is partially implemented in software, the content encoder / decoder 122 may store instructions for software in a suitable non-transitory computer-readable storage medium (e.g., internal memory 123) and may use one or more processors to execute instructions in hardware to perform the technology of this disclosure. Any of the foregoing (including hardware, software, combinations of hardware and software, etc.) can be considered as one or more processors.
[0033] In some aspects, the content generation system 100 may include a communication interface 126. The communication interface 126 may include a receiver 128 and a transmitter 130. The receiver 128 may be configured to perform any of the receiving functions described herein with respect to device 104. Additionally, the receiver 128 may be configured to receive information from another device, such as eye or head positioning information, rendering commands, and / or location information. The transmitter 130 may be configured to perform any of the transmitting functions described herein with respect to device 104. For example, the transmitter 130 may be configured to transmit information to another device, which may include a request for content. The receiver 128 and the transmitter 130 may be combined to form a transceiver 132. In such an example, the transceiver 132 may be configured to perform any of the receiving and / or transmitting functions described herein with respect to device 104.
[0034] Refer again Figure 1In some aspects, processing unit 120 may include a global event and invalid draw merger 198, configured to obtain information about at least one of a set of draw calls or a set of global events, wherein the set of draw calls and the set of global events are associated with graphics processing or computational processing. In some aspects, the global event and invalid draw merger 198 may be configured to detect an invalid subset and a valid subset of draw calls within the set of draw calls. In some aspects, the global event and invalid draw merger 198 may be configured to store information about the invalid draw call subset and information about the global event set in a first context slot within a set of context slots. In some aspects, the global event and invalid draw merger 198 may be configured to store information about valid draw calls within the valid subset of draw calls, as well as the stored information about the invalid draw call subset and the stored information about the global event set, in a first context slot. In some respects, the global event and invalid draw merger 198 can be configured to process the workload set for valid draw calls based on stored information for valid draw calls, stored information for a subset of invalid draw calls, and stored information for the global event set. Although the following description may focus on graphics processing, the concepts described herein are applicable to other similar processing techniques.
[0035] Devices such as device 104 can refer to any device, apparatus, or system configured to perform one or more of the technologies described herein. For example, a device can be a server, base station, user equipment, client device, station, access point, computer (such as a personal computer, desktop computer, laptop computer, tablet computer, computer workstation, or mainframe computer), end product, apparatus, telephone, smartphone, server, video game platform or console, handheld device (such as a portable video game device or personal digital assistant (PDA)), wearable computing device (such as a smartwatch, augmented reality device, or virtual reality device), non-wearable device, display or display device, television, set-top box, intermediate network device, digital media player, video streaming device, content streaming device, in-vehicle computer, any mobile device, any device configured to generate graphical content, or any device configured to perform one or more of the technologies described herein. The processes described herein may be described as being performed by a specific component (e.g., GPU), but in other embodiments, other components (e.g., CPU) consistent with the disclosed embodiments may be used to perform them.
[0036] A graphics processing unit (GPU) can process various types of data or data packets within the GPU pipeline. For example, in some aspects, a GPU can process two types of data or data packets, such as context register packets and draw call data. Context register packets can be a set of global state information, such as information about global registers, shaders, or constant data, which can adjust how the graphics context will be processed. For example, a context register packet may include information about the color format. In some aspects of a context register packet, there may be one or more bits indicating which workload belongs to the context register. Additionally, multiple functions or programs can run simultaneously and / or in parallel. For example, a function or program may describe an operation, such as a color mode or color format. Therefore, context registers can define various states of the GPU.
[0037] Context states can be used to determine how individual processing units (e.g., VFDs, vertex shaders (VSs), shader processors, or geometry processors) operate and / or in which mode a processing unit operates. To do this, the graphics processor uses context registers and programming data. In some aspects, the graphics processor can generate workloads in the pipeline based on the context register definitions of modes or states, such as vertex or pixel workloads. Some processing units (e.g., VFDs) can use these states to determine certain functions, such as how to aggregate vertices. Because these modes or states can change, the graphics processor may need to modify the corresponding context. Additionally, the workload corresponding to a mode or state may follow the changed mode or state.
[0038] Figure 2 An example graphics processor 200 according to one or more technologies of this disclosure is illustrated. For example... Figure 2 As shown, the graphics processor 200 includes a command processor (CP) 210, a draw call group 212, a VFD 220, a VS 222, a vertex cache (VPC) 224, a triangle setup engine (TSE) 226, a rasterizer (RAS) 228, a Z-process engine (ZPE) 230, a pixel interpolator (PI) 232, a fragment shader (FS) 234, a rendering backend (RB) 236, an L2 cache (UCHE) 238, and system memory 240. Although Figure 2 The graphics processor 200 includes processing units 220 to 238, but may include multiple additional processing units. Additionally, processing units 220 to 238 are merely examples, and any combination or order of processing units may be used by the graphics processor in accordance with this disclosure. The graphics processor 200 also includes a command buffer 250, a context register group 260, and a context state 261.
[0039] like Figure 2 As shown, the graphics processor can use a CP (e.g., CP 210) or a hardware accelerator to parse the command buffer into context register groups (e.g., context register group 260) and / or draw call data groups (e.g., draw call group 212). Subsequently, CP 210 can transfer the context register group 260 or the draw call group 212 to a processing unit or block within the graphics processor via a separate path. Furthermore, the command buffer 250 can alternate between different states of the context registers and draw calls. For example, the command buffer can simultaneously store the following information: the context register of context N, the draw call of context N, the context register of context N+1, and the draw call of context N+1.
[0040] Graphics processors (such as GPUs) can render images in a variety of different ways. In some cases, a GPU can render an image using direct rendering and / or tiled rendering. In a tiled rendering GPU, an image can be divided or separated into different parts or tiles. After the image is divided, each part or tile can be rendered individually. A tiled rendering GPU can divide a computer graphics image into a grid format, such that each part of the grid (i.e., a tile) is rendered individually. In some aspects of tiled rendering, the image can be divided into different bins or tiles during binning passes. In some aspects, a visibility stream can be constructed during binning passes, where visible primitives or draw calls can be identified. A rendering pass can be performed after a binning pass. In contrast to tiled rendering, direct rendering does not divide a frame into smaller bins or tiles. Instead, in direct rendering, the entire frame is rendered at once (i.e., without binning passes). Additionally, some types of GPUs allow both tiled rendering and direct rendering (e.g., flex rendering).
[0041] In some aspects, a graphics processor (GPU) can apply drawing or rendering processes to different bins or tiles. For example, a GPU can render a bin and perform all drawing for the primitives or pixels within that bin. During the bin-based rendering process, the rendering target can be located in GPU Internal Memory (GMEM). In some instances, after rendering a bin, the contents of the rendering target can be moved to system memory, and GMEM can be freed to render the next bin. Additionally, a GPU can render another bin and perform drawing for the primitives or pixels within that bin. Thus, in some aspects, there may be a small number of bins covering all the drawing on a surface, for example, four bins. Furthermore, a GPU can cycle through all the drawing within a bin but perform drawing for visible drawing calls, i.e., drawing calls that include visible geometry. In some aspects, a visibility stream can be generated, for example, in binning passes, to determine the visibility information for each primitive in an image or scene. For example, such a visibility stream can identify whether a primitive is visible. In some aspects, this information can be used to remove invisible primitives, such that, for example, invisible primitives are not rendered in a rendering pass. Additionally, at least some primitives that are marked as visible can be rendered in the rendering pass.
[0042] In some aspects of tile rendering, there can be multiple processing stages or passes. For example, rendering can be performed in two passes, such as a binning, visibility, or box visibility pass and a rendering or box rendering pass. During the visibility pass, the graphics processor can input the rendering workload, record the positioning of primitives or triangles, and then determine which primitives or triangles fall into which bin or region. In some aspects of the visibility pass, the graphics processor can also identify or mark the visibility of each primitive or triangle in the visibility stream. During the rendering pass, the graphics processor can input the visibility stream and process one bin or region at a time. In some aspects, the visibility stream can be analyzed to determine which primitives or primitive vertices are visible or invisible. Therefore, visible primitives or primitive vertices can be processed. By doing so, the graphics processor can reduce the unnecessary workload of processing or rendering invisible primitives or triangles.
[0043] In some aspects, certain types of primitive geometry, such as only localized geometry, can be processed during a visibility pass. Additionally, primitives can be categorized into different bins or regions based on their localization or position. In some instances, categorizing primitives or triangles into different bins can be performed by determining visibility information for those primitives or triangles. For example, the graphics processor can determine or write visibility information for each primitive in each bin or region to, for example, system memory. This visibility information can be used to determine or generate a visibility stream. In a rendering pass, the primitives in each bin can be rendered individually. In these cases, the visibility stream can be retrieved from memory and used to remove primitives that are not visible to that bin.
[0044] Several aspects of graphics processing unit (GPU) architecture offer multiple different options for rendering (e.g., software rendering and hardware rendering). In software rendering, the driver or CPU can process each view... Figure 1 The entire frame geometry is copied each time. Additionally, some different states can change depending on the viewpoint. Therefore, in software rendering, the software can copy the entire workload by changing some states that can be used for rendering for each viewpoint in the image. In some respects, this can lead to increased overhead because the graphics processor may submit the same workload multiple times for each viewpoint in the image. In hardware rendering, the hardware or GPU may be responsible for copying or processing the geometry for each viewpoint in the image. Therefore, the hardware can manage the copying or processing of primitives or triangles for each viewpoint in the image.
[0045] Figure 3 An image or surface 300 according to one or more techniques of this disclosure is illustrated, including multiple elements divided into multiple boxes. For example... Figure 3 As shown, the image or surface 300 includes a region 302, which includes primitives 321, 322, 323, and 324. Primitives 321, 322, 323, and 324 are divided or placed into different bins, such as bins 310, 311, 312, 313, 314, and 315. Figure 3 An example of tile rendering using multiple viewpoints is shown for primitives 321-324. For example, primitives 321-324 are in a first viewpoint 350 and a second viewpoint 351. Therefore, the graphics processor can process or render an image or surface 300 including region 302 using multi-view or multi-view rendering.
[0046] Primitives, also referred to as “matrix multiplication computation kernels” or simply “kernels,” map one or more data outputs to fibers that represent the basic units of a “wave” (sometimes called a “warp”). A “block” of data outputs can be mapped to a set of fibers that represent waves (or warps). One or more waves can represent workgroups, which represent basic computational workload units. The term “workload” can refer to a collection of one or more waves. In some examples, the input matrix, two input matrices, and / or output matrix generated by performing matrix multiplication can be large and larger than the resources available at the streaming processor. In some such examples, one or more of the input and / or output matrices can be split into slices (or “blocks”) to fit the computation kernel workgroup. In some examples, the size of the computation kernel workgroup can be based on the streaming processor’s physical resources. For example, the size of the computation kernel workgroup can depend on the size of the streaming processor’s general-purpose registers, the size of the streaming processor’s local buffers, and / or the resources associated with the fibers. A computation kernel associated with an invalid draw call can be referred to as an “invalid computation kernel.”
[0047] As indicated in this article, graphics processing units (GPUs) or GPUs can use a tiled rendering architecture to reduce power consumption or save memory bandwidth. As further stated above, this rendering method divides the scene into multiple bins, along with visibility iterations that identify the visible triangles within each bin. Therefore, in tiled rendering, the entire screen can be divided into multiple bins or tiles. The scene can then be rendered multiple times, for example, once or multiple times for each bin.
[0048] In various aspects of graphics rendering, some graphics applications may render a single target (i.e., the rendering target) once or multiple times. For example, in graphics rendering, the frame buffer on system memory can be updated multiple times. The frame buffer can be part of memory or random access memory (RAM) (e.g., containing bitmaps or storage devices) to help store display data for the GPU. The frame buffer can also be a memory buffer containing a complete frame of data. Additionally, the frame buffer can be a logical buffer. In some aspects, updating the frame buffer can be performed in bin or tile rendering, where, as discussed above, the surface is divided into multiple bins or tiles, and each bin or tile can then be rendered individually. Furthermore, in tile rendering, the frame buffer can be divided into multiple bins or tiles.
[0049] As this article points out, in some respects, such as in boxed or tiled rendering architectures, frame buffers allow data to be repeatedly stored or written to them, for example, when rendering from different types of memory. This can be referred to as unresolving the frame buffers or system memory. For example, when storing or writing to one frame buffer and then switching to another, the data or information on the frame buffer can be resolved from the GMEM at the GPU to system memory, i.e., memory in dual data rate (DDR) RAM or dynamic RAM (DRAM).
[0050] In some respects, system memory can also be system-on-chip (SoC) memory or another chip-based memory, such as on a device or smartphone, used for storing data or information. System memory can also be a physical data storage device shared by the CPU and / or GPU. In some respects, system memory can be, for example, a DRAM chip on a device or smartphone. Therefore, SoC memory can be a chip-based method for storing data.
[0051] In some respects, GMEM can be on-chip memory at the graphics processor, which can be implemented using static RAM (SRAM). Alternatively, GMEM can be stored on the device (e.g., a smartphone). As indicated herein, data or information can be transferred between system memory or DRAM and GMEM, for example, at the device. In some respects, system memory or DRAM can reside at the CPU or GPU. Furthermore, data can be stored in DDR or DRAM. In some respects, such as in box or tile rendering, a small portion of the memory can be stored at the graphics processor, for example, at GMEM. In some cases, storing data at GMEM may utilize a larger processing workload and / or consume more power compared to storing data at the frame buffer or system memory.
[0052] As described herein, a graphics processor can use a tiled rendering architecture to reduce power consumption or save memory bandwidth. As further stated above, this rendering method divides the scene into multiple bins, and includes visibility iterations that identify the visible triangles within each bin. Therefore, in tiled rendering, the entire screen can be divided into multiple bins or tiles. The scene can then be rendered multiple times, for example, once for each partition. As noted above, the graphics processor can cycle through each draw call within a bin and execute visible draw calls. As used herein, the term "draw call" can refer to the process by which the CPU submits data to the graphics processor and then issues a command for the graphics processor to render an object. In some respects, a draw call falling into a given bin can be considered and referred to as an "active draw call" or "valid draw call," while a draw call falling outside a bin can be considered and referred to as an "invalid draw call."
[0053] A mechanism can be provided to create backpressure on the next batch of the GPU hardware pipeline in the cluster, such as when the performance counters of all blocks in the cluster have not been read and are being transferred to the memory interface of the current batch. To do this, a driver can be used to provide clear breakpoints in the command flow between two workloads, defining the performance range of the clear breakpoints. These breakpoints can be referred to as “global events” that can be inserted after each draw call or level. These global events may already be completed within the existing context for the draw call, or they may be soft events newly introduced for the level. Additionally, the workload can be a draw call or different types of levels, such as visibility cycles, resolution cycles, etc.
[0054] As an example, the command processor (CP) may allow execution of a single batch per cluster at a time. At this point, all blocks in the cluster may not have completed a batch, and the debug controller (DBGC) may not have yet read the performance counters for those blocks from the register backbone management (RBBM). Furthermore, the CP may not send the next batch of programming to the blocks in that cluster. The GPU block may not initiate the next batch execution until it receives complete programming from the CP.
[0055] After completing a workload, each block in the GPU hardware pipeline can send a copy of the global events at the end of the workload (e.g., context completion for a draw call and soft events for the level) to the DBGC, indicating that the block has completed a workload batch, and the DBGC can begin reading the performance counters associated with that block from the RBBM. The DBGC can utilize its existing trace buffer to buffer the block-specific performance counters read from the RBBM before sending them to the memory interface or trace bus.
[0056] After the DBGC reads all performance counters associated with a block from the RBBM, the DBGC may send an indication to the CP stating that the DBGC has finished reading all performance counters for the corresponding block in the cluster. Once the CP receives the indication from the DBGC for all blocks in the cluster, it can programmatically unsuppress the next batch for that block cluster. This process may continue for all batches.
[0057] As used herein, the term “context slot” refers to a set of resources associated with a shader, including context registers, constant RAM, or other resources associated with the shader. Figure 4 Figure 400 illustrates an example technique for providing shared constants, corresponding to one or more techniques according to this disclosure. In the example, shader processing may be associated with shader code corresponding to GPU instructions, resources (e.g., textures, surfaces, etc.) that can receive data for rendering frames, and / or constant data that can be loaded by an application.
[0058] Constant data (e.g., c1 through c8) can be constant throughout the draw call and can utilize blocks of dedicated constant RAM 410. While the same shader can be executed from draw call to draw call to render a portion of a frame, constants 1 through 8, indicated by constant data c1 through c8, can be changed to render different items / features within the frame, as constants 1 through 8 can indicate, for example, the relationship between lighting and items / features within the frame. In some aspects, new constant data can be loaded into constant RAM 410 for each draw call. For example, a driver can be configured to update the constant data c1 through c8 for each draw call / shader in the draw call / shader. Constants 1 through 8 can be stored in constant buffers (e.g., constant RAM 410), and shaders can index into constant buffers to load constants 1 through 8. In some cases, constants 1 through 8 can be stored in multiple constant buffers. By generating constant buffers, updates to constants 1 through 8 can be performed on the CPU, as the driver can be configured to perform updates.
[0059] Some graphics application programming interfaces (APIs) can be configured to generate shared constants that can be shared across different shader levels (e.g., Value1 and Value2 in Figure 400). For example, a pixel shader configured to shade pixels based on textures, lighting algorithms, etc., may have constants shared with vertex shaders configured to rasterize primitives or triangles in a frame that includes pixels. Shared constants, also known as root constants or push constants, can be pushed directly from application 402 to the graphics processor with low overhead. In some examples, application 402 may include instructions such as SetSharedConstant(position A, Value1) and SetSharedConstant(position D, Value2) to indicate shared constants using register data A through F. The shader foobar(foo) 404 from application 402 may reference shared constants based on instructions such as r0 = r1 + sharedConstantA and r2 = r3 * sharedConstantD. After executing the first draw call for application 402 based on the instruction, application 402 may change the indication of the shared constant (e.g., SetSharedConstant(position A, Value1) may be changed to SetSharedConstant(position A, Value3)) and may execute the second draw call for application 402.
[0060] Depending on the graphics API configuration, shared constants can be 32-bit constants, ranging from 32 to 128 (for example, in contrast, some hardware configurations may have up to 512 regular / non-shared constants). As an example, some shaders may utilize more than 512 constants.
[0061] Context register 406 can be analogous to a pipelined register used for drawing state because context register 406 can be pipelined via GPU hardware. After shared constants are loaded into context register 406 (e.g., registers A and D), the information remains the same unless / until context register 406 is updated by the driver.
[0062] The shader hardware may also include an HLSQ that prepares information for execution by the SP. The HLSQ executes an early preamble / preamble shader, which may be part of the shader code executed before the main body of the shader is executed. The preamble shader may include Store Shared Constants (STSC) instructions that copy shared constants (e.g., Value1 in register A and Value2 in register D) from context register 406 to constant RAM 410 (e.g., constants 2 and 3). In the example, the shader foo preamble 408 may be executed on the shader hardware based on STSC instructions such as movToConstRam(sharedConstantRegA, constantRam2) and movToConstRam(sharedConstantRegD, constantRam3).
[0063] The preamble shader may not be configured to determine the substantive information / values indicated by shared constants. Instead, the preamble shader may determine that certain registers (such as register A and register D) contain shared constants associated with information used to perform a specific draw call. Therefore, the STSC instruction of the preamble shader can be executed to copy the shared constants included in certain registers into constant RAM 410. Other instructions in the preamble shader may include loading regular / non-shared constant data into constant RAM 410. Constant RAM 410 can be used as a cache set up before the main shader is executed by the SP. Because the shader can configure shared constants for the next call of the shader while the current call of the shader is executing, pipelined operations can be performed so that the current call of the shader is not affected by configuring the next call of the shader.
[0064] The HLSQ, which moves data from context register 406 to constant storage / constant RAM 410, executes the preamble shader once before each draw call. Constant data c1 through c8 can be loaded into constant RAM 410 for a specific draw call, and on the next draw call, constant data corresponding to that call can be loaded into constant RAM 410. The preamble shader can copy register data A through F from context register 406 to constant RAM 410 based on STSC instructions.
[0065] The shader compiler can be configured to determine which shared constants the shader will use and can move the determined shared constants from register data A through F into constant RAM 410. After register data A through F are loaded into the constant storage for the graphics processor, shader execution can proceed as normal, regardless of whether constants 1 through 8 are shared constants or regular constants. That is, after the shader constant state is loaded into the shader constant RAM, the shader constant state can be moved into the working constant RAM, similar to other constant data used for shader execution. By allowing the shader compiler to determine the shared constants available for shader execution, such constants can be loaded on a separate basis, rather than loading an entire block of shared constants.
[0066] Figure 5 This is a diagram 500 illustrating an example workflow for processing drawing calls and global events according to one or more techniques of this disclosure.
[0067] like Figure 5 As illustrated, after obtaining the set of draw calls, a first depth buffer, such as a low-resolution Z (LRZ) module 502, can process the set of draw calls and detect valid draw calls falling within the bin and invalid draw calls falling outside the bin. The depth buffer, also referred to as a Z-buffer, can be a data buffer used to represent depth information of objects in three-dimensional (3D) space from a specific viewpoint. LRZ can be a process in which the LRZ module 502 can use the low-resolution buffer to store depth data associated with pixel blocks rather than for each pixel of each of multiple tiles. The low-resolution buffer can be a two-dimensional buffer with multiple storage locations. Each storage location in the low-resolution buffer can correspond to a pixel block. In some examples, the number of storage locations within the low-resolution buffer can be less than the number of pixels. LRZ data can be depth data of a pixel block (e.g., a 2×2 pixel block) containing the last depth value of a given pixel block. A tile can be associated with one or more LRZ data sets. For example, given a tile as an 8×8 pixel block, the tile may include 16 LRZ data points, each LRZ data point associated with a given 2×2 pixel block of the tile, and each of the 16 LRZ data points may contain the last depth value for the associated 2×2 pixel block of the tile. LRZ module 502 can detect invalid draw calls that fall outside the bin and do not need to be performed / processed by SP 512, as well as valid draw calls that fall within the bin. The detection results can be sent to the HLSQ Data Parsing (DP) unit 508 or the Shader Processor Control Module (SPCM) 510, which controls SP 512, so that the detection results can be stored. The detection results can also be sent to the Early Z module 504.
[0068] HLSQ state management block 524 is responsible for managing context slots, making them available (e.g., by SP 512 and SPCM 510) for wave construction. HLSQ state management block 524 receives global event information and associated context information from CP 520 and can provide state programming information about graphics processor state programming commands. As an example, the graphics processor can execute various commands, such as commands issued to the graphics processor by the application processor (e.g., CPU). Commands that can be executed by the graphics processor may include, for example, draw invocation commands, graphics processor state programming commands, memory transfer commands, general computing commands, kernel execution commands, etc.
[0069] After the set of draw calls is processed by the LRZ module 502, it can be further processed by the Early Z module 504. The Early Z module 504 can use the Early Z algorithm to perform an early depth check to remove overdrawn fragments before the graphics processor wastes time running shader work on them. The Early Z module 504 can test fragments or tiles (fragment groups) against the depth buffer (before the fragment shader or SP 512). The Early Z module 504 can detect invalid draw calls that fall outside the bin and do not need to be performed / processed by SP 512, as well as valid draw calls that fall within the bin.
[0070] The output of an early Z module 504, which may include final invalid or final valid information about a draw call, can be fed to a render back-end (RB) sampler 506, which can be used for pixel manipulation. In some aspects, the RB sampler 506 can transmit a workload and corresponding red (R), green (G), and blue (B) (RGB) information about pixels and their positions to the HLSQ DP, which can respond to the transmitted workload and construct the workload into waves that can be sent to the SP 512. The SP 512 can process the waves and transmit them to an RB color (RB-C) or late Z module 514, which is configured to perform blending and send information to the framebuffer.
[0071] In some aspects, the output of the earlier Z module 504 (which may include final invalid or final valid information about draw calls) can be fed into the HLSQ DP unit 508 or SPCM 510. Based on the final invalid or final valid information, the HLSQ DP unit 508 or SPCM 510 can store invalid draw call, valid draw call, and global event information in various context slots (e.g., in constant RAM associated with the various context slots) based on storing information for valid draw calls, one or more invalid draw calls, and one or more global events in each context slot. Information about various invalid draw calls, valid draw calls, and global events can be sequentially fed into the HLSQ DP unit 508 or SPCM 510, and information about various invalid draw calls, valid draw calls, and global events can be sequentially stored in one or more context slots based on storing information for valid draw calls, one or more invalid draw calls, and one or more global events in each context slot.
[0072] For example, in some aspects, HLSQ DP unit 508 or SPCM 510 may consider a first context slot to be full after storing a first valid draw call in a first context. In some aspects, draw calls or global events following the first valid draw call may be stored in subsequent context slots. In some aspects, HLSQ DP unit 508 or SPCM 510 may consider a context slot not full before storing a valid draw call in a context slot. Therefore, HLSQ DP unit 508 or SPCM 510 may keep information about invalid draw calls and global events stored in the context slot until HLSQ DP unit 508 or SPCM 510 processes a valid draw call and stores information about the valid draw call in the context slot.
[0073] In some aspects, instead of HLSQ DP unit 508 or SPCM 510, the output of the earlier Z module 504 (which may include final invalid or final valid information about draw calls) can be fed into HLSQ state management block 524. Because HLSQ state management block 524 can also receive global event information and associated context information from CP 520, in some aspects, HLSQ state management block 524 can store invalid draw calls, valid draw calls, and global event information. For example, in some aspects, HLSQ state management block 524 can consider a first context slot full after storing a first valid draw call in a first context. In some aspects, draw calls or global events following the first valid draw call can be stored in subsequent context slots. In some aspects, HLSQ state management block 524 can consider a context slot not full before storing a valid draw call in a context slot. Therefore, the HLSQ state management block 524 may keep information about invalid draw calls and global events stored in the context slot until the HLSQ state management block 524 processes a valid draw call (e.g., the received information about the valid draw call) and stores the information about the valid draw call in the context slot.
[0074] Figure 6 This is a diagram 600 illustrating a context slot for storing drawing calls and global events according to one or more techniques of this disclosure. For example... Figure 6 As illustrated, for the first SP, SP0 602, information 606A regarding the first draw call may indicate a valid draw call and may be stored in the first context slot 604A associated with SP0 602 (e.g., in constant RAM associated with the first context slot 604A). After storing the information 606A regarding the valid draw call in the first context slot 604A, SPCM 510, HLSQ DP unit 508, or HLSQ status management block 524 may consider the first context slot 604A to be full and store subsequent information in a different context slot.
[0075] In some aspects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive global event information 608 after receiving information 606A regarding a valid draw call, and may store the global event information 608 in a second context slot 604B associated with SP0 602 (e.g., in constant RAM associated with the second context slot 604B). Because the second context slot 604B has not yet stored information regarding any valid draw call after storing the global event information 608, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may continue to store information in the second context slot 604B.
[0076] In some aspects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive information 606B regarding an invalid draw call after receiving global event information 608. SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may store the information 606B regarding the invalid draw call in a second context slot 604B. Because the second context slot 604B has not yet stored any information regarding a valid draw call after storing the information 606B regarding the invalid draw call, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may continue to store information in the second context slot 604B.
[0077] In some aspects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive information 606C regarding an invalid draw call after receiving information 606B regarding an invalid draw call. SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may store information 606C regarding the invalid draw call in a second context slot 604B. Because the second context slot 604B has not yet stored any information regarding a valid draw call after storing information 606C regarding the invalid draw call, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may continue to store information in the second context slot 604B.
[0078] In some aspects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive information 606D about a valid draw call after receiving information 606C about an invalid draw call. SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may store information 606D about the valid draw call in a second context slot 604B. Because the second context slot 604B has already stored information about the valid draw call after storing 606D about the valid draw call, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may not continue to store information in the second context slot 604B.
[0079] like Figure 6 As illustrated, for the second SP, SP1 612, information 616A regarding the first draw call may indicate a valid draw call and may be stored in the first context slot 614A associated with SP1 612 (e.g., in constant RAM associated with the first context slot 614A). After storing the information 616A regarding the valid draw call in the first context slot 614A, SPCM 510, HLSQ DP unit 508, or HLSQ status management block 524 may consider the first context slot 614A to be full and store subsequent information in a different context slot.
[0080] In some respects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive global event information 618 after receiving information 616A, and may store global event information 618 in a second context slot 614B associated with SP1 612 (e.g., in constant RAM associated with the second context slot 614B). Because the second context slot 614B has not yet stored information about any valid draw calls after storing global event information 618, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may continue to store information in the second context slot 614B.
[0081] In some aspects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive information 616B regarding a valid draw call after receiving global event information 618. SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may store the information 616B regarding the valid draw call in the second context slot 614B. After storing the information 616B regarding the valid draw call in the second context slot 614B, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may consider the second context slot 614B to be full and may not store additional draw calls or global event information in the second context slot 614B.
[0082] In some respects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive information 616C regarding an invalid draw call after receiving information 616B regarding a valid draw call. SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may store the information 616C regarding the invalid draw call in a third context slot 614C associated with SP1 612. Because the third context slot 614C has not yet stored any information regarding a valid draw call after storing the information 616C regarding the invalid draw call, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may continue to store information in the third context slot 614C.
[0083] In some respects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive information 616D regarding an invalid draw call after receiving information 616C regarding an invalid draw call. SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may store information 616D regarding the invalid draw call in a third context slot 614C associated with SP1 612. Because the third context slot 614C has not yet stored any information regarding a valid draw call after storing information 616D regarding the invalid draw call, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may continue to store information in the third context slot 614C.
[0084] like Figure 6As illustrated, for the third SP, SP2 622, information 626A regarding the first draw call can indicate a valid draw call and can be stored in the first context slot 624A associated with SP2 622. After storing the information 626A regarding the valid draw call in the first context slot 624A, SPCM 510, HLSQ DP unit 508, or HLSQ status management block 524 can consider the first context slot 624A to be full and store subsequent information in a different context slot.
[0085] In some respects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive global event information 628 after receiving information 626A, and may store global event information 628 in a second context slot 624B associated with SP2 622 (e.g., in constant RAM associated with the second context slot 624B). Because the second context slot 624B has not yet stored information about any valid draw calls after storing global event information 628, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may continue to store information in the second context slot 624B.
[0086] In some respects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive information 626B regarding an invalid draw call after receiving global event information 628. SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may store the information 626B regarding the invalid draw call in a second context slot 624B. Because the second context slot 624B has not yet stored any information regarding a valid draw call after storing the information 626B regarding the invalid draw call in the second context slot 604B, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may continue to store information in the second context slot 624B.
[0087] In some aspects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive information 626C regarding a valid draw call after receiving information 626B regarding an invalid draw call. SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may store the information 626C regarding the valid draw call in the second context slot 624B. After storing the information 626C regarding the valid draw call in the second context slot 624B, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may consider the second context slot 624B to be full and may not store additional draw call or global event information in the second context slot 624B.
[0088] In some respects, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may receive information about an invalid draw call 626D after receiving information about a valid draw call 626C. SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may store the information about the invalid draw call 626D in a third context slot 624C associated with SP2 622. Because the third context slot 624C has not yet stored any information about a valid draw call after storing the information about the invalid draw call 626D, SPCM 510, HLSQ DP unit 508, or HLSQ state management block 524 may continue to store information in the third context slot 624C.
[0089] Figure 7 This is a communication flowchart 700 illustrating example communication between a graphics processor, a CPU, and a memory according to one or more techniques of this disclosure.
[0090] like Figure 7 As shown, Figure 700 includes example communication between a graphics processor 702 (or other graphics processor, such as a GPU), a CPU 704 (or other central processing unit), or another graphics processor component according to one or more technologies of this disclosure, and a memory 706 (e.g., GMEM or SYSMEM).
[0091] At 710, the graphics processor 702 may (e.g., from the CPU 704) obtain instruction 712 for at least one of a set of draw calls or a set of global events, wherein the set of draw calls and the set of global events are associated with graphics processing or computational processing. In some aspects, the set of draw calls and the set of global events are configured for execution by the SP (such as the graphics processor 702) at the GPU.
[0092] At 720, the graphics processor 702 can detect invalid and valid subsets of draw calls in the draw call set. In some aspects, to detect invalid and valid subsets of draw calls in the draw call set, the graphics processor 702 can perform a first detection at a first graphics component to determine whether each draw call in the draw call set is an invalid or valid draw call, and a second detection at a second graphics component to determine whether each draw call in the draw call set is an invalid or valid draw call. In some aspects, the first graphics component is a first depth buffer, and the second graphics component is a second depth buffer. In some aspects, the first depth buffer is a low-resolution Z-buffer, and the second depth buffer is an early Z-buffer. In some aspects, the draw call set is a computed draw call set, and to detect invalid and valid subsets of draw calls in the draw call set, the graphics processor 702 can detect whether each computed draw call in the computed draw call set is an invalid computed draw call or a valid computed draw call. In some aspects, invalid computed draw calls are invalid compute kernels, and wherein invalid compute kernels are associated with a subset of SPs in the shader processor (SP) set.
[0093] At 730, the graphics processor 702 may store information for a subset of invalid draw calls and information for a set of global events in a first context slot of the context slot set. In some aspects, in order to store information for a subset of invalid draw calls and information for a set of global events, the graphics processor 702 may combine the information for the subset of invalid draw calls and the information for the set of global events in the first context slot.
[0094] At 740, the graphics processor 702 may store information about valid draw calls within a subset of valid draw calls, as well as stored information about invalid draw calls and the global event set, in a first context time slot. In some aspects, to store information about valid draw calls, the graphics processor 702 may combine the information about valid draw calls with the information about the subset of invalid draw calls and the global event set in the first context time slot. In some aspects, to combine the information about valid draw calls with the information in the first context time slot, the graphics processor 702 may reuse the storage space in the first context time slot for information about valid draw calls.
[0095] At 770, the graphics processor 702 may perform at least one of a cache warm-up operation or a data preload operation for a valid draw call based on stored information for the valid draw call.
[0096] At 780, the graphics processor 702 may, based on the stored information, avoid performing at least one of a cache warm-up operation or a data preloading operation for an invalid subset of draw calls.
[0097] At 750, the graphics processor 702 can process the set of workloads for valid draw calls based on stored information for valid draw calls, stored information for a subset of invalid draw calls, and stored information for a set of global events.
[0098] At 760, the graphics processor 702 may output an indication of the processed set of workloads for effective drawing calls. In some aspects, the graphics processor 702 may output the indication by sending it to, for example, the CPU 704 at 764. In some aspects, the graphics processor 702 may output the indication by storing it in a cache, for example, at memory 706 or 766. In some aspects, the graphics processor 702 may output the indication by sending it to at least one graphics component, such as the rendering back-end (RB) in a graphics processing unit (GPU) or the vertex positioning cache (VPC) in the GPU.
[0099] Figure 8 This is a flowchart 800 of an example method for graphics processing according to one or more techniques of this disclosure. The method can be performed by means such as: a graphics processing device, a GPU, a CPU, a shader processor, a wireless communication device, etc., in combination with... Figures 1 to 7 Used in all aspects.
[0100] At point 802, the device can obtain information about at least one of a set of drawing calls or a set of global events, wherein the set of drawing calls and the set of global events are associated with graphics processing or computational processing. For example, refer to Figure 5 To initiate the process, the device may (e.g., via CP 520) obtain information about at least one of a set of drawing calls or a set of global events, wherein the set of drawing calls and the set of global events are associated with graphics processing or computational processing. Furthermore, 802 may be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7 As described in 710, the graphics processor 702 may obtain an instruction 712 for at least one of a set of drawing calls or a set of global events, wherein the set of drawing calls and the set of global events are associated with graphics processing or computational processing.
[0101] At 804, the device can detect invalid and valid subsets of draw calls within the set of draw calls. For example, refer to... Figure 5This device can (e.g., based on LRZ module 502 and early Z module 504) detect invalid and valid subsets of draw calls in the set of draw calls. Furthermore, 804 can be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7 As described in 720, the graphics processor 702 can detect invalid and valid subsets of draw calls in the set of draw calls.
[0102] At 806, the device can store information for a subset of invalid draw calls and information for a set of global events in the first context slot of the context slot set. For example, refer to Figure 5 This device (e.g., by utilizing HLSQDP unit 508) can store information for a subset of invalid draw calls (e.g., 606B and 606C) and information for a set of global events (e.g., 608) in a first context slot (e.g., 604B) of the context slot set. Furthermore, 806 can be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7 As described in 730, the graphics processor 702 may store information for a subset of invalid draw calls and information for a set of global events in a first context slot of the context slot set.
[0103] At 808, the device can store, in the first context slot, information about valid draw calls within a subset of valid draw calls, information about invalid draw calls, and information about the global event set. For example, refer to... Figure 5 This device (e.g., by utilizing HLSQ DP unit 508) can store information in the first context slot for valid draw calls (e.g., 606D) within a subset of valid draw calls, as well as stored information for a subset of invalid draw calls and stored information for a set of global events. Furthermore, 808 can be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7 As described in 740, the graphics processor 702 may store, in a first context slot, information about valid draw calls for a subset of valid draw calls, information about invalid draw calls, and information about a set of global events.
[0104] At 810, the device can process the workload set for valid draw calls based on stored information for valid draw calls, stored information for a subset of invalid draw calls, and stored information for a global event set. For example, refer to... Figure 5The device can process a set of workloads for valid draw calls based on stored information for valid draw calls, stored information for a subset of invalid draw calls, and stored information for a global event set (e.g., using SP 512). Furthermore, 810 can be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7 As described in 750, the graphics processor 702 can process a set of workloads for valid draw calls based on stored information for valid draw calls, stored information for a subset of invalid draw calls, and stored information for a set of global events.
[0105] Figure 9 This is a flowchart 900 of an example method for graphics processing according to one or more techniques of this disclosure. The method can be performed by means such as: a graphics processing device, a GPU, a CPU, a shader processor, a wireless communication device, etc., in combination with... Figures 1 to 7 Used in all aspects.
[0106] At 902, the device can obtain information about at least one of a set of drawing calls or a set of global events, wherein the set of drawing calls and the set of global events are associated with graphics processing or computational processing. For example, refer to Figure 5 To initiate the process, the device may (e.g., via CP 520) obtain information about at least one of a set of drawing calls or a set of global events, wherein the set of drawing calls and the set of global events are associated with graphics processing or computational processing. Furthermore, 902 may be... Figure 1 The processing unit 120 executes the drawing call set and the global event set. In some aspects, the drawing call set and the global event set are configured for execution by the shader processor (SP) (e.g., SP 512) at the graphics processor. As another example, such as Figure 7 As described in 710, the graphics processor 702 may obtain an instruction 712 for at least one of a set of drawing calls or a set of global events, wherein the set of drawing calls and the set of global events are associated with graphics processing or computational processing.
[0107] At position 904, the device can detect invalid and valid subsets of draw calls within the set of draw calls. For example, refer to... Figure 5 This device can (e.g., based on LRZ module 502 and early Z module 504) detect invalid and valid subsets of draw calls in the set of draw calls. Furthermore, 904 can be... Figure 1The processing unit 120 executes the operation. In some aspects, as part of 904, to detect invalid and valid subsets of draw calls in the draw call set, at 942, the apparatus may perform a first detection at a first graphics component (e.g., LRZ module 502) to determine whether each draw call in the draw call set is an invalid or valid draw call. In some aspects, as part of 904, to detect invalid and valid subsets of draw calls in the draw call set, at 944, the apparatus may perform a second detection at a second graphics component (e.g., early Z module 504) to determine whether each draw call in the draw call set is an invalid or valid draw call. In some aspects, the first graphics component is a first depth buffer, and the second graphics component is a second depth buffer. In some aspects, the first depth buffer is a low-resolution Z-buffer, and the second depth buffer is an early Z-buffer. In some aspects, the set of draw calls is a set of computed draw calls, and in order to detect invalid and valid subsets of draw calls within the set of draw calls, at 946, the device can detect whether each computed draw call in the set of computed draw calls is an invalid or valid computed draw call. In some aspects, invalid computed draw calls are invalid compute kernels, and wherein invalid compute kernels are associated with a subset of SPs in the set of shader processors (SPs). Furthermore, 942, 944, and 946 can be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7 As described in 720, the graphics processor 702 can detect invalid and valid subsets of draw calls in the set of draw calls.
[0108] At 906, the device can store information for a subset of invalid draw calls and information for a set of global events in the first context slot of the context slot set. For example, refer to Figure 5 This device (e.g., by utilizing HLSQDP unit 508) can store information for a subset of invalid draw calls (e.g., 606B and 606C) and information for a set of global events (e.g., 608) in a first context slot (e.g., 604B) of the context slot set. Furthermore, 906 can be... Figure 1 The processing unit 120 executes this. In some aspects, in order to store information for a subset of invalid draw calls and information for a set of global events, as part of 906, at 962, the device may combine the information for the subset of invalid draw calls with the information for the set of global events in a first context slot. For example, as... Figure 6As illustrated, in context slot 604B, information for a subset of invalid drawing calls (e.g., 606C and 606B) can be combined with information for a global event set (e.g., 608). Furthermore, 962 can be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7 As described in 730, the graphics processor 702 may store information for a subset of invalid draw calls and information for a set of global events in a first context slot of the context slot set.
[0109] In some respects, the device can avoid performing at least one of a cache warm-up operation or a data preloading operation for an invalid subset of draw calls based on the stored information.
[0110] At 908, the device can store, in the first context slot, information about valid draw calls within a subset of valid draw calls, information about invalid draw calls, and information about the global event set. For example, refer to... Figure 5 This device (e.g., by utilizing HLSQ DP unit 508) can store information in the first context slot for valid draw calls (e.g., 606D) within a subset of valid draw calls, as well as stored information for a subset of invalid draw calls and stored information for a set of global events. Furthermore, 908 can be... Figure 1 The processing unit 120 executes this. In some aspects, in order to store information for valid draw calls, as part of 908, at 982, the device may combine information for valid draw calls with information for a subset of invalid draw calls and information for a set of global events in a first context slot. For example, as... Figure 6 As illustrated, in context slot 604B, information for valid draw calls (e.g., 606D) can be combined with information for subsets of invalid draw calls (e.g., 606C and 606B), and with information for a global event set (e.g., 608). In some aspects, as part of 982, at 984, the device can reuse storage space in the first context slot (e.g., the storage space associated with context slot 604B) for information related to valid draw calls. Furthermore, 982 and 984 can be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7 As described in 740, the graphics processor 702 may store, in a first context slot, information about valid draw calls for a subset of valid draw calls, information about invalid draw calls, and information about a set of global events.
[0111] At 910, the device can process the workload set for valid draw calls based on stored information for valid draw calls, stored information for a subset of invalid draw calls, and stored information for the global event set. For example, refer to... Figure 5 The device can process a set of workloads for valid draw calls based on stored information for valid draw calls, stored information for a subset of invalid draw calls, and stored information for a global event set (e.g., using SP 512). Furthermore, 910 can be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7 As described in 750, the graphics processor 702 can process a set of workloads for valid draw calls based on stored information for valid draw calls, stored information for a subset of invalid draw calls, and stored information for a set of global events.
[0112] In some aspects, at 912, the device may output an indication of a set of processed workloads for a valid draw call. As part of 912, in some aspects, at 914, the device may send an indication of a set of processed workloads for a valid draw call to at least one graphics component. In some aspects, the at least one graphics component includes at least one of: a rendering back-end (RB) in the graphics processor (e.g., RB color (RB-C) or post-Z module 514) or a vertex positioning cache (VPC) in the graphics processor. As part of 912, in some aspects, at 916, the device may store the indication of a set of processed workloads for a valid draw call in a first memory or cache. In some aspects, the first memory is system memory or global memory, and the cache is a system cache or a global cache. Furthermore, 912, 914, and 916 may be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7 As described in 760, the graphics processor 702 can output an indication of the processed set of workloads for efficient drawing calls.
[0113] In some aspects, at 918, the device can perform at least one of a cache warm-up operation or a data preloading operation for a valid draw call based on stored information for that valid draw call. For example, SP 512 can perform at least one of a cache warm-up operation or a data preloading operation for a valid draw call based on stored information for that valid draw call. Furthermore, 918 can be... Figure 1 The processing unit 120 in the process executes the operation. As another example, such as... Figure 7As described in 770, the graphics processor 702 can perform at least one of a cache warm-up operation or a data preload operation for a valid draw call based on stored information for the valid draw call.
[0114] In the configuration, a method or apparatus for graphics processing is provided. The apparatus may be a graphics processing unit (e.g., a GPU), a CPU, or some other processor capable of performing graphics processing. In various aspects, the apparatus may be a processing unit 120 within device 104, or it may be some other hardware within device 104 or another device. The apparatus may include components for obtaining information about at least one of a set of draw calls or a set of global events, wherein the set of draw calls and the set of global events are associated with graphics processing or computational processing. The apparatus may also include components for detecting invalid and valid subsets of draw calls within the set of draw calls. The apparatus may further include components for storing information about the invalid subset of draw calls and information about the global event set in a first context slot within a set of context slots. The apparatus may also include components for storing information about valid draw calls within the valid subset of draw calls, as well as the stored information about the invalid subset of draw calls and the stored information about the global event set, in the first context slot. The apparatus may further include components for processing a set of workloads for valid draw calls based on stored information for valid draw calls, stored information for a subset of invalid draw calls, and stored information for a set of global events. The apparatus may further include components for outputting an indication of the processed set of workloads for valid draw calls. The apparatus may further include components for sending the indication of the processed set of workloads for valid draw calls to at least one graphics component. The apparatus may further include components for storing the indication of the processed set of workloads for valid draw calls in a first memory or cache. The apparatus may further include components for performing a first detection at a first graphics component to determine whether each draw call in the set of draw calls is an invalid or valid draw call. The apparatus may further include components for performing a second detection at a second graphics component to determine whether each draw call in the set of draw calls is an invalid or valid draw call. The apparatus may further include components for detecting whether each computational draw call in the set of computational draw calls is an invalid or valid computational draw call. The apparatus may further include components for combining information for a subset of invalid draw calls with information for a set of global events in a first context slot. The apparatus may further include components for avoiding the execution of at least one of a cache warm-up operation or a data preloading operation for a subset of invalid draw calls based on the stored information. The apparatus may also include components for combining information for valid draw calls with information for a subset of invalid draw calls and information for a global event set in a first context slot. The apparatus may further include components for reusing storage space in the first context slot for information for valid draw calls.The apparatus may also include components for performing at least one of a cache warm-up operation or a data preloading operation for a valid draw call based on stored information for the valid draw call.
[0115] It should be understood that the specific order or hierarchy of boxes / steps in the processes, flowcharts, and / or call flowcharts disclosed herein are merely illustrative of example methods. It should be understood that the specific order or hierarchy of boxes / steps in these processes, flowcharts, and / or call flowcharts may be rearranged based on design preferences. Furthermore, some boxes / steps may be combined and / or omitted. Other boxes / steps may also be added. The appended method claims provide the elements of various boxes / steps in an exemplary order, but are not intended to limit one to the given specific order or hierarchy.
[0116] The foregoing description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be apparent to those skilled in the art, and the general principles defined herein may be applied to other aspects. Therefore, the claims are not intended to be limited to the aspects shown herein, but should be given the full scope consistent with the language of the claims, wherein, unless specifically stated otherwise, references to elements in the singular are not intended to mean “one and only one,” but rather “one or more.” The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
[0117] Unless otherwise specified, the term "some" refers to one or more, and unless otherwise specified in the context, the term "or" may be interpreted as "and / or". Combinations such as "at least one of A, B, or C", "one or more of A, B, or C", "at least one of A, B, and C", "one or more of A, B, and C", and "A, B, C, or any combination thereof" include any combination of A, B, and / or C, and may include multiple A, multiple B, or multiple C. Specifically, combinations such as "at least one of A, B, or C", "one or more of A, B, or C", "at least one of A, B, and C", "one or more of A, B, and C", and "A, B, C, or any combination thereof" may be only A, only B, only C, A and B, A and C, B and C, or A and B and C, wherein any such combination may include one or more members of A, B, or C. The various aspects described throughout this disclosure are all structural and functional equivalents known now or hereafter to those skilled in the art, and are expressly incorporated herein by reference and intended to be covered by the claims. Furthermore, nothing disclosed herein is intended to be offered to the public, whether or not such disclosure is explicitly recited in the claims. The terms “module,” “mechanism,” “element,” “device,” etc., cannot replace the word “component.” Therefore, no claim element will be interpreted as a functional component unless the element is explicitly described using the phrase “component for…”. Unless otherwise stated, the phrase “processor” may refer to “any processor in one or more processors” (e.g., one processor in one or more processors, multiple (more than one) processors in one or more processors, or all processors in one or more processors), and the phrase “memory” may refer to “any memory in one or more memories” (e.g., one memory in one or more memories, multiple (more than one) memories in one or more memories, or all memories in one or more memories).
[0118] In one or more examples, the functionality described herein may be implemented in hardware, software, firmware, or any combination thereof. For example, although the term "processing unit" is used throughout this disclosure, such a processing unit may be implemented in hardware, software, firmware, or any combination thereof. If any functionality, processing unit, technique, or other module described herein is implemented in software, then such functionality, processing unit, technique, or other module may be stored on or transmitted on a computer-readable medium as one or more instructions or code.
[0119] Computer-readable media may include computer data storage media and communication media, including any media that facilitates the transfer of computer programs from one place to another. In this way, computer-readable media may generally correspond to: (1) a tangible computer-readable storage medium that is non-transitory; or (2) a communication medium, such as a signal or carrier wave. Data storage media may be any available medium that can be accessed by one or more computers or one or more processors to extract instructions, code, and / or data structures for implementing the techniques described in this disclosure. By way of example and not limitation, such computer-readable media may include RAM, ROM, EEPROM, compressed optical disc read-only memory (CD-ROM) or other optical disc storage devices, magnetic disk storage devices, or other magnetic storage devices. As used herein, magnetic disks and optical discs include compressed optical discs (CD), laser optical discs, optical discs, digital versatile optical discs (DVD), floppy disks, and Blu-ray discs, wherein magnetic disks typically magnetically copy data, while optical discs optically copy data using lasers. Combinations of the above should also be included within the scope of computer-readable media. Computer program products may include computer-readable media.
[0120] The techniques disclosed herein can be implemented in a wide variety of devices or apparatuses, including wireless mobile phones, integrated circuits (ICs), or IC sets (e.g., chipsets). Various components, modules, or units are described in this disclosure to emphasize functional aspects of a device configured to perform the disclosed techniques, but they do not necessarily need to be implemented by different hardware units. Rather, as described above, various units can be combined in any hardware unit or provided by a collection of interoperable hardware units (including one or more processors as described above) combined with suitable software and / or firmware. Therefore, the term "processor" as used herein can refer to any of the above-described structures or any other structure suitable for implementing the techniques described herein. Furthermore, these techniques can be fully implemented in one or more circuit or logic elements.
[0121] The following aspects are merely illustrative and may be combined with other aspects or teachings described herein without limitation.
[0122] Aspect 1 is a graphics processing method, the method comprising: obtaining information about at least one of a set of drawing calls or a set of global events, wherein the set of drawing calls and the set of global events are associated with the graphics processing or computational processing; detecting an invalid subset of drawing calls and a valid subset of drawing calls in the set of drawing calls; storing information about the invalid subset of drawing calls and information about the set of global events in a first context slot of a set of context slots; storing information about valid drawing calls in the valid subset of drawing calls, as well as the stored information about the invalid subset of drawing calls and the stored information about the set of global events, in the first context slot; and processing a set of workloads for the valid drawing calls based on the stored information about the valid drawing calls, the stored information about the invalid subset of drawing calls, and the stored information about the set of global events.
[0123] Aspect 2 may be combined with aspect 1 and also includes: outputting an indication of the processed set of workloads used for the effective drawing call.
[0124] Aspect 3 may be combined with aspect 2 and includes: outputting the indication to the processed set of workloads for the valid drawing call includes sending the indication to at least one graphics component to the processed set of workloads for the valid drawing call.
[0125] Aspect 4 may be combined with aspect 3 and includes: the at least one graphics component includes at least one of the following: a rendering back-end (RB) in the graphics processor or a vertex positioning cache (VPC) in the graphics processor.
[0126] Aspect 5 may be combined with any of aspects 2 to 4 and includes: outputting the indication to the processed set of workloads for the valid draw call includes: storing the indication to the processed set of workloads for the valid draw call in a first memory or cache.
[0127] Aspect 6 may be combined with aspect 5 and includes: the first memory is system memory or global memory, and wherein the cache is system cache or global cache.
[0128] Aspect 7 may be combined with any of aspects 1 to 6 and includes: the set of draw calls and the set of global events are configured to be executed by the shader processor (SP) at the graphics processor.
[0129] Aspect 8 may be combined with any one of aspects 1 to 7 and includes: detecting the invalid subset of drawing calls and the valid subset of drawing calls in the drawing call set includes: performing a first detection at a first graphics component to determine whether each drawing call in the drawing call set is an invalid drawing call or a valid drawing call; and performing a second detection at a second graphics component to determine whether each drawing call in the drawing call set is an invalid drawing call or a valid drawing call.
[0130] Aspect 9 may be combined with aspect 8 and includes: the first graphics component is a first depth buffer, and the second graphics component is a second depth buffer.
[0131] Aspect 10 may be combined with aspect 9 and includes: the first depth buffer is a low-resolution Z-buffer, and the second depth buffer is an early Z-buffer.
[0132] Aspect 11 may be combined with any of aspects 1 to 10 and includes: the set of drawing calls is a set of computational drawing calls, and detecting the invalid subset of drawing calls and the valid subset of drawing calls in the set of drawing calls includes: detecting whether each computational drawing call in the set of computational drawing calls is an invalid computational drawing call or a valid computational drawing call.
[0133] Aspect 12 may be combined with aspect 11 and includes: the invalid computation drawing call is an invalid computation kernel, and wherein the invalid computation kernel is associated with a subset of SPs in the set of shader processors (SPs).
[0134] Aspect 13 may be combined with any one of aspects 1 to 12 and includes: storing the information for the subset of invalid draw calls and the information for the set of global events includes: combining the information for the subset of invalid draw calls and the information for the set of global events in the first context slot.
[0135] Aspect 14 may be combined with any of aspects 1 to 13 and includes: avoiding the execution of at least one of a cache warm-up operation or a data preloading operation for the subset of invalid draw calls based on the stored information.
[0136] Aspect 15 may be combined with any one of aspects 1 to 14 and includes: storing the information for the valid draw call includes: combining the information for the valid draw call with the information for the subset of invalid draw calls and the information for the global event set in the first context slot.
[0137] Aspect 16 may be combined with aspect 15 and includes: combining the information for the valid draw call with the information for the subset of invalid draw calls and the information for the global event set, including: reusing storage space in the first context slot for the information for the valid draw call.
[0138] Aspect 17 may be combined with any of aspects 1 to 16 and includes: performing at least one of a cache warm-up operation or a data preloading operation for the valid draw call based on the stored information for the valid draw call.
[0139] Aspect 18 may be combined with any of aspects 1 to 17 and includes: the set of workloads for the effective drawing call includes at least one of a set of graphics workloads or a set of computational workloads.
[0140] Aspect 19 is an apparatus for graphics processing, the apparatus including at least one processor coupled to a memory and configured to implement the method according to any one of aspects 1 to 18.
[0141] Aspect 20 may be combined with aspect 19 and includes: the device is a wireless communication device.
[0142] Aspect 21 is an apparatus for graphics processing, the apparatus comprising components for implementing the method according to any one of aspects 1 to 18.
[0143] Aspect 22 is a computer-readable medium (e.g., a non-transitory computer-readable medium) storing computer-executable code that, when executed by at least one processor, causes the at least one processor to implement the method according to any one of aspects 1 to 18.
[0144] Various aspects have been described herein. These and other aspects are within the scope of the following claims.
Claims
1. An apparatus for graphics processing, the apparatus comprising: Memory; and A processor, coupled to the memory, and configured based on information stored in the memory, to: Obtain information about at least one of a set of drawing calls and a set of global events, wherein the set of drawing calls and the set of global events are associated with the graphics processing or computational processing; Detect the invalid and valid subsets of drawing calls in the set of drawing calls; The first context slot in the context slot set stores information for the subset of invalid drawing calls and information for the global event set. The first context slot stores information about valid draw calls for the subset of valid draw calls, information about invalid draw calls, and information about the global event set. as well as The workload set for the valid draw calls is processed based on the information stored for the valid draw calls, the information stored for the subset of invalid draw calls, and the information stored for the global event set.
2. The apparatus of claim 1, wherein the processor is further configured to: Output an indication of the processed set of workloads used for the valid draw call.
3. The apparatus of claim 2, wherein, in order to output the indication to the processed set of workloads for the valid drawing call, the processor is configured to: Send the indication to at least one graphics component for the processed set of workloads used for the valid drawing call.
4. The apparatus of claim 3, wherein the at least one graphics component comprises at least one of: a rendering back-end (RB) in a graphics processor or a vertex positioning cache (VPC) in the graphics processor.
5. The apparatus of claim 2, wherein, in order to output the indication to the processed set of workloads for the valid drawing call, the processor is configured to: The indication of the processed workload set for the valid draw call is stored in a first memory or cache.
6. The apparatus of claim 5, wherein the first memory is a system memory or a global memory, and wherein the cache is a system cache or a global cache.
7. The apparatus of claim 1, wherein the set of draw calls and the set of global events are configured to be executed by a shader processor (SP) at a graphics processor.
8. The apparatus of claim 1, wherein, in order to detect the invalid subset of draw calls and the valid subset of draw calls in the set of draw calls, the processor is configured to: At the first graphics component, a first detection is performed to determine whether each draw call in the set of draw calls is an invalid draw call or a valid draw call; and At the second graphics component, a second detection is performed to determine whether each drawing call in the set of drawing calls is an invalid drawing call or a valid drawing call.
9. The apparatus of claim 8, wherein the first graphics component is a first depth buffer, and the second graphics component is a second depth buffer.
10. The apparatus of claim 9, wherein the first depth buffer is a low-resolution Z-buffer, and the second depth buffer is an early Z-buffer.
11. The apparatus of claim 1, wherein the drawing call set is a computed drawing call set, wherein, in order to detect the invalid drawing call subset and the valid drawing call subset in the drawing call set, the processor is configured to: Detect whether each computational drawing call in the computational drawing call set is an invalid or valid computational drawing call.
12. The apparatus of claim 11, wherein the invalid computation drawing call is an invalid computation kernel, and wherein the invalid computation kernel is associated with a subset of SPs in the set of shader processors (SPs).
13. The apparatus of claim 1, wherein, in order to store the information for the subset of invalid draw calls and the information for the set of global events, the processor is configured to: In the first context slot, the information for the subset of invalid draw calls is combined with the information for the set of global events.
14. The apparatus of claim 1, wherein the processor is further configured to: Based on the stored information, at least one of the cache warm-up operation or data preloading operation for the subset of invalid draw calls is avoided.
15. The apparatus of claim 1, wherein, in order to store the information for the valid draw call, the processor is configured to: In the first context slot, the information for the valid draw calls is combined with the information for the subset of invalid draw calls and the information for the global event set.
16. The apparatus of claim 15, wherein, in order to combine the information for the valid draw calls with the information for the subset of invalid draw calls and the information for the global event set, the processor is configured to: The storage space in the first context slot is reused for the information for the valid draw call.
17. The apparatus of claim 1, wherein the processor is further configured to: Based on the stored information for the valid draw call, perform at least one of a cache warm-up operation or a data preloading operation for the valid draw call.
18. The apparatus of claim 1, wherein the set of workloads for the effective drawing call includes at least one of a set of graphics workloads or a set of computational workloads.
19. The apparatus of claim 1, wherein the apparatus is a wireless communication device.
20. A method for graphics processing, the method comprising: Obtain information about at least one of a set of drawing calls or a set of global events, wherein the set of drawing calls and the set of global events are associated with the graphics processing or computational processing; Detect the invalid and valid subsets of drawing calls in the set of drawing calls; The first context slot in the context slot set stores information for the subset of invalid drawing calls and information for the global event set. The first context slot stores information about valid draw calls for the subset of valid draw calls, information about invalid draw calls, and information about the global event set. as well as The workload set for the valid draw calls is processed based on the information stored for the valid draw calls, the information stored for the subset of invalid draw calls, and the information stored for the global event set.
21. The method according to claim 20, further comprising: Output an indication of the processed set of workloads used for the valid draw call.
22. The method of claim 21, wherein outputting the indication to the processed set of workloads for the valid draw call further comprises: Send the indication to at least one graphics component for the processed set of workloads used for the valid drawing call.
23. The method of claim 22, wherein the at least one graphics component comprises at least one of: a rendering back-end (RB) in a graphics processor or a vertex positioning cache (VPC) in the graphics processor.
24. The method of claim 21, wherein outputting the indication to the processed set of workloads for the valid draw call further comprises: The indication of the processed workload set for the valid draw call is stored in a first memory or cache.
25. The method of claim 24, wherein the first memory is a system memory or a global memory, and wherein the cache is a system cache or a global cache.
26. The method of claim 20, wherein the set of draw calls and the set of global events are configured to be executed by a shader processor (SP) at a graphics processor.
27. The method of claim 20, wherein detecting the invalid subset of drawing calls and the valid subset of drawing calls in the set of drawing calls further comprises: At the first graphics component, a first detection is performed to determine whether each drawing call in the set of drawing calls is an invalid drawing call or a valid drawing call; as well as At the second graphics component, a second detection is performed to determine whether each drawing call in the set of drawing calls is an invalid drawing call or a valid drawing call.
28. The method of claim 27, wherein the first graphics component is a first depth buffer, and the second graphics component is a second depth buffer.
29. The method of claim 28, wherein the first depth buffer is a low-resolution Z-buffer, and the second depth buffer is an early Z-buffer.
30. A computer-readable medium storing computer-executable code for graphics processing, said code, when executed by at least one processor, causing said at least one processor to: Obtain information about at least one of a set of drawing calls or a set of global events, wherein the set of drawing calls and the set of global events are associated with the graphics processing or computational processing; Detect the invalid and valid subsets of drawing calls in the set of drawing calls; The first context slot in the context slot set stores information for the subset of invalid drawing calls and information for the global event set. The first context slot stores information about valid draw calls for the subset of valid draw calls, information about invalid draw calls, and information about the global event set. as well as The workload set for the valid draw calls is processed based on the information stored for the valid draw calls, the information stored for the subset of invalid draw calls, and the information stored for the global event set.