Spacetime resampling with decoupled shading and reuse

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By decoupled shading and reuse techniques, combined with spatiotemporal resampling of the reservoir, and optimizing the ray tracing process, the problem of high resource consumption in computer graphics rendering is solved, achieving efficient rendering and improved image quality.

CN115205093BActive Publication Date: 2026-06-12NVIDIA CORP

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: NVIDIA CORP
Filing Date: 2022-04-01
Publication Date: 2026-06-12

Application Information

Patent Timeline

01 Apr 2022

Application

12 Jun 2026

Publication

CN115205093B

IPC: G06T1/20; G06T15/06; G06T15/80

CPC: G06T1/20; G06T15/80; G06T15/06; G06T15/005; G06T15/506; G06T15/205

AI Tagging

Application Domain

Processor architectures/configuration 3D-image rendering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing computer graphics ray tracing technology consumes a lot of time, memory and computing resources when rendering virtual scenes, and it is difficult to efficiently handle a large number of light sources, resulting in low rendering efficiency.

⚗Method used

By employing decoupled shading and reuse techniques, the shading and reuse processes are separated through a reservoir-based spatiotemporal resampling (ReSTIR) rendering pipeline. Light samples are selected using Monte Carlo integration and importance sampling (RIS), and light samples from spatially and temporally neighboring pixels are combined to optimize ray tracing budget and visibility determination.

🎯Benefits of technology

It improves rendering efficiency and image quality, reduces the consumption of computing resources, reduces rendering bias, enhances image visibility and stability, and adapts to rendering needs under different hardware conditions.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115205093B_ABST

Patent Text Reader

Abstract

Spatiotemporal resampling with decoupled shading and reuse is disclosed, specifically apparatuses, systems, and techniques for rendering computer graphics are disclosed. In at least one embodiment, a first one or more lights are selected from a plurality of lights in a virtual scene of a frame to be rendered into a graphic, and a second one or more lights are selected from a plurality of lights used to render one or more pixels in at least one of a previous frame or a current frame. Pixels of the current frame are rendered using the first one or more lights and the second one or more lights, and a light is selected from the first one or more lights and the second one or more lights for reuse in rendering a subsequent frame.

Need to check novelty before this filing date? Find Prior Art

Description

[0001] Claiming priority

[0002] This application claims the benefit of U.S. Provisional Application No. 63 / 170,832, filed April 5, 2021, entitled “Spatiotemporal resampling with decoupled coloring and reuse,” the entire contents of which are incorporated herein by reference. Technical Field

[0003] At least one embodiment relates to computer graphics. For example, at least one embodiment relates to a processor or computing system for rendering graphical images using the various novel techniques described herein. Background Technology

[0004] The handling of light in computer graphics can consume significant amounts of time, memory, processing power, and other computational resources. For example, techniques such as ray tracing can be memory- and computationally intensive. In many cases, the inclusion of numerous lights in a virtual scene can make efficient rendering difficult. Therefore, it is essential to improve the techniques used to handle light in computer graphics. Attached Figure Description

[0005] Figure 1 An example of a system employing spatiotemporal resampling with decoupled coloring and reuse, according to at least one embodiment, is shown;

[0006] Figure 2 An example of decoupled coloring and reuse according to at least one embodiment is shown;

[0007] Figure 3 The shading aspects of a decoupled shading flow according to at least one embodiment are shown;

[0008] Figure 4 Several aspects of a fusion kernel for performing decoupling and reuse, according to at least one embodiment, are illustrated;

[0009] Figure 5 A three-kernel variant for performing decoupling and reuse, according to at least one embodiment, is shown;

[0010] Figure 6 Additional examples of kernels for performing decoupling and reuse, according to at least one embodiment, are shown;

[0011] Figure 7 An example process for rendering a depiction of a virtual scene using decoupled reuse and shading according to at least one embodiment is shown;

[0012] Figure 8 An exemplary data center according to at least one embodiment is shown;

[0013] Figure 9 A processing system according to at least one embodiment is shown;

[0014] Figure 10 A computer system according to at least one embodiment is shown;

[0015] Figure 11 A system according to at least one embodiment is shown;

[0016] Figure 12 An exemplary integrated circuit according to at least one embodiment is shown;

[0017] Figure 13 A computing system according to at least one embodiment is shown;

[0018] Figure 14 An APU according to at least one embodiment is shown;

[0019] Figure 15 A CPU according to at least one embodiment is shown;

[0020] Figure 16 An exemplary accelerator integration slice according to at least one embodiment is shown;

[0021] Figures 17A-17B An exemplary graphics processor according to at least one embodiment is shown;

[0022] Figure 18A A graphics core according to at least one embodiment is shown;

[0023] Figure 18B A GPGPU according to at least one embodiment is shown;

[0024] Figure 19A A parallel processor according to at least one embodiment is shown;

[0025] Figure 19B A processing cluster according to at least one embodiment is shown;

[0026] Figure 19C A graphics multiprocessor according to at least one embodiment is shown;

[0027] Figure 20 A graphics processor according to at least one embodiment is shown;

[0028] Figure 21 A processor according to at least one embodiment is shown;

[0029] Figure 22 A processor according to at least one embodiment is shown;

[0030] Figure 23A graphics processor core according to at least one embodiment is shown;

[0031] Figure 24 A PPU according to at least one embodiment is shown;

[0032] Figure 25 A GPC according to at least one embodiment is shown;

[0033] Figure 26 A streaming multiprocessor according to at least one embodiment is illustrated;

[0034] Figure 27 A software stack of a programming platform according to at least one embodiment is shown;

[0035] Figure 28 The illustration shows an embodiment according to at least one of the embodiments. Figure 27 The CUDA implementation of the software stack;

[0036] Figure 29 The illustration shows an embodiment according to at least one of the embodiments. Figure 27 The ROCm implementation of the software stack;

[0037] Figure 30 The illustration shows an embodiment according to at least one of the embodiments. Figure 27 The OpenCL implementation of the software stack;

[0038] Figure 31 Software supported by a programming platform according to at least one embodiment is shown;

[0039] Figure 32 The illustration shows an embodiment of at least one of the following: Figure 27-30 Compiled code executed on the programming platform;

[0040] Figure 33 The illustration shows an embodiment of at least one of the following: Figure 27-30 More detailed compiled code executed on the programming platform;

[0041] Figure 34 This illustrates the transformation of source code before compilation, according to at least one embodiment;

[0042] Figure 35A A system configured to compile and execute CUDA source code using different types of processing units, according to at least one embodiment, is shown;

[0043] Figure 35B The diagram illustrates a configuration, according to at least one embodiment, to compile and execute using a CPU and a CUDA-enabled GPU. Figure 35A The system of CUDA source code;

[0044] Figure 35C The diagram illustrates a configuration, according to at least one embodiment, to compile and execute using a CPU and a GPU with CUDA disabled. Figure 35A The system of CUDA source code;

[0045] Figure 36 The diagram illustrates a method according to at least one embodiment. Figure 35C An example kernel converted by the CUDA to HIP conversion tool;

[0046] Figure 37 A more detailed description is provided according to at least one embodiment. Figure 35C GPUs without CUDA enabled; and

[0047] Figure 38 This illustrates how threads of an exemplary CUDA grid, according to at least one embodiment, are mapped to... Figure 37 Different computing units; and

[0048] Figure 39 This paper illustrates how to migrate existing CUDA code to data-parallel C++ code according to at least one embodiment. Detailed Implementation

[0049] In the example, the method for rendering computer graphics scenes incorporates reservoir-based lighting techniques, such as Reservoir-based Spatiotemporal Importance Resampling (ReSTIR), which uses techniques to decouple the shading and reuse portions of the ReSTIR rendering pipeline.

[0050] In at least one embodiment, the ReSTIR rendering pipeline uses Monte Carlo integration to render consecutive frames of computer graphics. These techniques may involve using an estimator based on importance sampling (“RIS”), sometimes described as resampling, or simply as resampling techniques. In at least one embodiment, ReSTIR expands the resampling by iteratively applying RIS. This iteration allows the reuse of samples used to render other pixels in both the spatial and temporal domains to improve the sample of the current pixel. Here, the spatial domain refers to pixels near or adjacent to the pixel being rendered. The temporal domain refers to the most recently rendered pixel in the scene, such as pixels rendered to generate previous frames of computer graphics.

[0051] The ReSTIR rendering pipeline efficiently generates multiple lighting samples for the pixels to be rendered. Due to the trade-off between efficiently selecting samples and finding the best possible sample, these samples may be of low quality. However, one of these samples is selected via RIS to aggregate the contributions of the candidate samples. Therefore, the RIS sample tends to have higher quality than the individual candidates.

[0052] In at least one embodiment, the selected per-pixel candidate is combined with light samples from previous frames and light samples used to render nearby pixels. This combination also uses RIS (Reference-Induced Lighting) to aggregate samples of the current pixel, samples aggregated in previous frames, and samples associated with nearby pixels. This typically results in a better sample than using only the candidate for the current pixel. It is presumed that spatially and temporally adjacent pixels have similar lighting. This provides additional context for the current lighting and provides one (or a small number) of light samples that can be used to shade the current pixel.

[0053] A non-decoupled ReSTIR pipeline can run as a whole, producing a single light (or a small number of lights) that is randomly the best sample for rendering a given pixel. This light sample can then be used to shading the pixel and also reused in future frames. However, shading and reuse have different objectives. For shading, the objective is to produce the best-looking image for the current frame. For reuse, the objective is to forward the most useful data to improve the results in future frames. Therefore, decoupling shading and reuse allows these objectives to be directed or optimized separately.

[0054] The objectives for colorization can include several different factors. These may include the need for visibility to obtain pixel shadows, maximizing the number of colorized samples, improving quality, and minimizing zero-value pixels in noisy images. Generally, more colorized samples are usually better than fewer. For colorization, quality is often more important than bias because pixel colors can be reused, preventing bias from compounding over time and exploding in subsequent frames. Furthermore, noisy images with fewer zero-value pixels are generally visually better than noisy images with randomly scattered zero-value pixels. Reducing zero-value pixels can also improve the functionality of denoising algorithms.

[0055] The objectives used for reuse can also include a variety of factors, including those different from those used for shading. For example, reusing visibility determination can be difficult to achieve efficiently, especially when bias needs to be avoided. However, many benefits from reuse can come from the overall sample average rather than the number of individual samples. For example, for the purpose of reusing samples, one light might be far better than 10. Bias is also more important in reuse than in shading. Bias can be fed forward in time, and thus can explode numerically over a small number of frames. This can be detrimental to image quality. Furthermore, there is no problem with reusing zero-value pixels. In the context of reuse, a zero-value pixel can simply indicate that there are no good samples to be reused and that different samples should be obtained.

[0056] In at least one embodiment, the decoupled ReSTIR pipeline improves the utilization of ray tracing. For illustration, first consider an example of a non-decoupled ReSTIR pipeline that evaluates two rays per pixel. The visibility of a selected candidate ray is tested, and another test is performed during the shading of the final ray sample for the current pixel. However, for a fast (biased) renderer, the second visibility sample is not spatially reused, as this could lead to bias explosion. Furthermore, the second ray has a reasonable chance (e.g., 1 / 3) of replicating the first shadow ray. Additionally, there may be visibility information already available from samples in the previous frame. This can be reused in the current frame. The final ray during shading will then have a high chance of replicating (e.g., 2 / 3). The decoupled ReSTIR pipeline can utilize these observations to improve efficiency.

[0057] For a given pixel, there may exist M c 8 to 32 candidate lights, M associated with previous frames T Individual light (typically one, but potentially more), and M associated with neighboring pixels. n There are 1 to 3 light rays (typically from 1 to 3). Therefore, in this example, there could be 10 to 36 samples. This is typically too many samples to color. However, after resampling, three types of rays can exist: a candidate ray, a temporal sample, and a spatial sample. In at least one embodiment, the decoupled ReSTIR uses three visibility queries: a candidate ray for RIS selection, a ray for reuse from a previous frame, and a ray for reuse from neighboring pixels. In at least one embodiment, some of these queries can reuse previous shadow rays to reduce the ray budget.

[0058] In at least one embodiment, the system implementing decoupled ReSTIR includes at least one processor and at least one memory. The memory includes instructions that cause the system to select one or more lights from a first group of lights associated with a virtual scene. The scene may contain many lights, possibly thousands or more. For a given pixel, in at least one embodiment, RIS is used to select one or more lights.

[0059] The memory of this example system also includes instructions for the system to select one or more lights from a second set. This set of lights is selected from those lights sampled to render pixels that are spatially or temporally adjacent to the current pixel. Here, spatial proximity refers to pixels near the current pixel, including, but not limited to, pixels adjacent to the current pixel. Temporal proximity refers to pixels from frames that are temporally close to the current frame (such as previous frames). In at least one embodiment of this example system, this set of lights includes lights selected based on their association with spatially adjacent pixels and another light selected based on its association with temporally adjacent pixels.

[0060] Then, by executing these instructions, the system selects a light from one or more of the first and second groups of lights for use in rendering one or more pixels in subsequent frames of the graphics. This light is provided as a candidate for use in subsequent frames. In at least one embodiment, in subsequent frames, the light is selected based on its association with pixels that are temporally or spatially neighboring.

[0061] Executing the instructions further causes the system to render the pixels of a frame of the graphic by determining the shading of at least one or more of the first group of lights and one or more of the second group of lights relative to the pixels. In at least one embodiment, three samples are used: one sample selected from a general candidate pool via RIS, one sample from spatially neighboring pixels, and one sample from temporally neighboring pixels. Note that, as described in the previous paragraph, the process of selecting samples for reuse is decoupled from the shading.

[0062] In at least one embodiment, the example system reuses one or more visibility determinations to balance the ray tracing budget. Here, visibility determination involves determining whether or how much light emitted from a light source can illuminate the pixel in question. Visibility determination can be used in conjunction with shading to determine how a pixel should be rendered based on which light contributes to its appearance. Thus, in at least one embodiment, the shading of a pixel is determined based on visibility determination. These can utilize considerable time and computational resources, but in some cases, decoupled ReSTIR can reuse visibility determinations. For example, some embodiments may reuse visibility determinations made for light used to render pixels in temporally adjacent frames (e.g., previous frames), or reuse visibility determinations made for light used to render pixels in spatially adjacent frames.

[0063] In at least one embodiment, the visibility determinations used for shading are adjusted based on the system's available computational capacity. When more capacity is available (e.g., when the system has relatively high-end hardware and a larger ray tracing budget), fewer determinations can be reused. Conversely, for lower-end hardware and a lower ray tracing budget, more visibility determinations can be reused.

[0064] For decoupled reuse determination, the visibility determination result can be stored for use in subsequent frames or for other pixels within the current frame. In at least one embodiment, the reservoir for storing illumination information is extended to include flags or other data indicating whether one or more lights associated with the reservoir are determined to be visible. The reservoir generally refers to a data structure that stores information representing at least one sampled light. In at least one embodiment, the reservoir contains information about one or more lights, the probability that the light was selected from the original light pool, and the number of samples arriving at the light.

[0065] For decoupled reuse determination, a stochastic process is used to select the light to be used to render pixels in subsequent frames of the graphics. A stochastic process is one that is at least partially based on randomness. An example of a stochastic process is RIS, as explained in more detail here. For decoupled reuse, one or more lights to be considered as candidates for reuse can be obtained by resampling from the lights associated with the virtual scene. Other lights can be drawn by resampling from the lights used to render pixels in previous frames, or by resampling from the lights associated with pixels adjacent to the pixel being rendered.

[0066] Figure 1 An example of a system employing decoupled shading and spatiotemporal resampling with reuse, according to at least one embodiment, is shown. In example 100, computing device 102 generates graphics output to drive a display on screen 108. In at least one embodiment, the graphics output includes a series of images constituting frames of a video, which are displayed to depict an animated virtual scene 120. The virtual scene 120 includes a simulated or computer-generated environment, such as a landscape, buildings, a playing field, or other area. The virtual scene 120 may sometimes be referred to as or include a virtual environment. A virtual environment may be associated with data structures, graphics assets, and other data that defines the content and structure of the virtual environment. For example, in at least one embodiment, the virtual scene is based on a virtual environment that includes a wireframe model of a landscape, various textures and objects residing within the scene, etc. The virtual environment may also include lights placed at various locations within the scene. To render frames of the virtual scene 120, the system determines how each pixel within a frame will be presented to the observer.

[0067] In Example 100, computing device 102 uses graphics pipeline 104 and graphics card 106 to generate graphics output. In at least one embodiment, graphics card 106 includes one or more processors, such as graphics processing units. In at least one embodiment, graphics pipeline 104 includes software, hardware, or a combination of software and hardware for generating graphics output. Graphics pipeline 104 can be configured according to a multi-stage process (e.g., including...) Figure 1 The process (stages 110-118 depicted within the graphics pipeline 104) generates graphical output. Note that although stages 110-118 are depicted as a sequence in Example 100, some embodiments may omit some of the depicted stages 110-118, perform some of the operations in a different order than depicted (such as in parallel), or otherwise... Figure 1 In addition to those described herein, there are other stages or operations.

[0068] The graphics pipeline 104 may include software, hardware, or a combination of both to implement a multi-stage process for converting application data into graphics data suitable for display on screen 108 (with or without certain post-pipeline steps). For example, the graphics pipeline 104 may generate video data frames, which may then be converted into signals to drive the display of the frame on screen 108. In at least one embodiment, these stages may include an application stage 110, a geometry stage 112, a transformation stage 114, a lighting and shading stage 116, and a rasterization and texturing stage 118.

[0069] One or more of stages 110-118 may utilize a light sampling algorithm to incorporate lighting effects into the rendering of the virtual scene. In at least one embodiment, the light sampling algorithm includes the ReSTIR algorithm, in which shading and reuse are decoupled.

[0070] In some cases, there may be a large number of such lights, which can present many challenges when rendering a depiction of a virtual scene 120. Handling numerous lights is a difficult problem in computer graphics, especially for ray tracing-based algorithms. For example, one approach to rendering a virtual scene would be to evaluate all light sources in the scene for each shading point. However, increasing the light count also increases the number and complexity of rays to be traced, and consequently, the rendering process's time, computational resources, and complexity.

[0071] In at least one embodiment, a subset of light is selected from a list 122 of all light in the virtual scene 120. This subset may be selected using a process that incorporates randomness, and the selected subset is stored in a portion of memory. From this subset, one or more general samples 124 of light from the virtual scene are selected and stored in a reservoir. In at least one embodiment, the reservoir includes one or more sampled light and information indicating further statistical properties of the one or more samples. In at least one embodiment, these properties include information indicating the relative importance of the light samples stored in the reservoir. For example, the reservoir may include information describing the illumination provided by the samples, the sum of weights, and the amount of light considered to obtain the samples.

[0072] A general sample 124 can be selected by performing a random sampling technique (such as RIS) on the light from the list of light 122. In at least one embodiment, random or stochastic sampling refers to using one or more random, pseudo-random, or quasi-random processes to select light from a set of light (such as the list of light 122 associated with the virtual scene 120) or from a subset of such light. In at least one embodiment, the random process selects light based on a probability proportional to the light intensity, such that brighter light is more likely to be selected than darker light.

[0073] In at least one embodiment, light refers to a virtual lighting source. In at least one embodiment, this may include a light source that emits or reflects light. Light may be associated with attributes including the position and intensity values of the light within the virtual scene. For example, light may be associated with x, y, z values indicating the position of the light within the virtual scene, and values indicating how bright the light is. Light may also be associated with additional attributes, such as parameters describing intensity, color, diffusion pattern, etc. As used herein, depending on the context, the term light may also refer to data describing virtual light, such as data representing the properties and parameters of virtual light.

[0074] In at least one embodiment, the graphics pipeline 104 uses general samples 124 to render pixels of a virtual scene. In at least one embodiment, the computing device 102 renders pixels of a graphics frame by identifying reservoirs of general samples 124 associated with pixels and performing shading or other operations based on light information stored in the reservoirs.

[0075] In at least one embodiment, the graphics pipeline 104 uses decoupled shading and reuses pixels rendering the virtual scene. In decoupled shading, pixels are rendered using general, temporal, and spatial samples 126, and new or reused visibility determinations may be incorporated, depending on factors that may include, but are not limited to, available computational capacity and desired image quality. Meanwhile, decoupled reuse may also rely on general, temporal, and spatial samples 126, but the samples used for reuse may be selected based on criteria independent of or separate from those used for shading.

[0076] Figure 2 An example of decoupled shading and reuse according to at least one embodiment is shown. As depicted in Example 200, the ReSTIR pipeline merge determines the decoupled shading flow from reuse.

[0077] In at least one embodiment, the process 216 for selecting samples for reuse is based on samples selected from general candidates 202, previous frame samples 208, and neighboring pixel samples 210. Three resampling passes 204, 212, and 214 may exist, one for each of the three types of candidates. These are general candidates 202, temporally neighboring samples 208, and spatially neighboring samples 210. In embodiments, spatial and temporal RIS passes may be excluded, particularly if only one spatial sample and one temporal sample are used. Another RIS pass may combine the three light types to select samples for reuse in future frames or by neighboring pixels.

[0078] Temporally neighboring samples 208 can be optionally selected via RIS-based resampling 212, and spatially neighboring samples 210 can be optionally selected via resampling 214, which can also be RIS-based. Among these three candidate regions, RIS-based resampling 206 is applied to select the final sample for reuse. The coloring process 218 is independent of the reuse process 216 because it does not depend on the samples selected for reuse 216.

[0079] Figure 3 The shading aspect of a decoupled shading flow according to at least one embodiment is illustrated. As depicted in Example 300, in this embodiment, with Figure 2 The coloring process 218, equivalent to coloring process 318, uses three visibility queries, as shown in elements 302, 304, and 306. The first visibility query 302 is performed for candidates selected from the general pool. The second visibility query 304 is performed for temporal samples, and the third visibility query 306 is performed for spatial samples.

[0080] In at least one embodiment, the visibility query 304 for time samples is reused. This can add bias by illuminating the shadows of rapidly moving geometry, but in many cases, the result of such bias is largely imperceptible.

[0081] In at least one embodiment, the visibility query 306 of the spatial sample is reused. This can improve efficiency, but should be used with caution. Always reusing the determination of the spatial sample can mitigate shadows and lead to a significant quality degradation. Furthermore, reusing the visibility determination only when neighbors are within a few pixels can mitigate shadows near the boundary.

[0082] In at least one embodiment, when visibility is reused, the visibility data is stored in the reservoir for transmission to future frames. For example... Figure 3 As shown, visibility information 308 from the coloring process 318 can be provided and stored together with the sample selected for reuse.

[0083] Figure 4 Several aspects of a fused kernel in a decoupled shader stream are illustrated according to at least one embodiment. A kernel (sometimes called a compute kernel) is a routine, function, or process executed by a processor (e.g., a GPU or PPU). A kernel may also sometimes be described as a compute shader. In at least one embodiment, the kernel is executed as a discrete unit by the GPU or PPU, typically in parallel with many other kernels.

[0084] As depicted in Example 400, certain portions of the decoupled ReSTIR pipeline can be executed within a single kernel 402. In at least one embodiment, adjacent samples are reused from a temporal buffer rather than from the current frame. Thus, the resampling 410 for reuse is drawn from general candidates 404, previous frame samples 406, and spatially adjacent previous frame samples 408 drawn from the same frame as the temporally adjacent samples. This avoids using global barriers to reuse samples drawn from the current frame and makes data dependencies only relevant to the last frame. Therefore, in at least one embodiment, fusion is aided by removing intra-frame dependencies (e.g., spatial samples in the current frame) and instead using spatially adjacent samples from previous frames.

[0085] Figure 5 A three-kernel variant for performing decoupling and reuse according to at least one embodiment is illustrated. In example 500, the decoupled ReSTIR rendering pipeline includes a first kernel 502 that performs candidate generation and selection via RIS, a second kernel 504 that tracks shadow rays, and a third kernel 506 that shades pixels and performs decoupling resampling.

[0086] Figure 6Additional examples of kernels for performing decoupled shading and reuse according to at least one embodiment are shown. In example 600, two potential strategies are depicted. In the first, a decoupled ReSTIR rendering pipeline is implemented using a single kernel 602. In the second, two kernels are used, wherein a first kernel 604 selects candidates from a general pool, and a second kernel 606 combines ray tracing, shading, and reuse computations.

[0087] Figure 7 An example process for rendering a virtual scene using decoupled reuse and shading according to at least one embodiment is shown. Although the example process 700 is depicted as a series of steps or operations, it will be understood that various embodiments of the depicted process may include modified or reordered steps or operations, or certain steps or operations may be omitted unless explicitly stated or logically required, such as when the output of one step or operation is used as the input of another step or operation.

[0088] An embodiment of Example Process 700 can be implemented by any suitable system, such as a personal computer, smartphone, tablet computer, system-on-a-chip (“SoC”), microprocessor, graphics card, graphics processing unit, parallel processing unit, etc.

[0089] At 702, in at least one embodiment, the system generates a geometry buffer or g-buffer. The geometry buffer includes a data structure that stores data about the geometry and materials of the scene. This may include information about the location, orientation, texture, color, specularity, albedo characteristics, diffuse characteristics, and reflective properties of surfaces within the scene.

[0090] At point 704, in at least one embodiment, the system collects initial samples from a list of lights associated with the virtual scene. The initial samples can be selected from a list of all lights in the scene using random, pseudo-random, or quasi-random techniques.

[0091] At 706, in at least one embodiment, the system computes a probability distribution function of light based at least in part on the potential contribution of each light to the rendering of a pixel. Thus, although a random process is used to select the light, the selection tends to favor those lights that have the greatest visual impact on the rendering of the pixel.

[0092] At point 708, in at least one embodiment, the system generates a reservoir by selecting one or more lights from an initial sample using a probability distribution function. The reservoir represents the statistical properties of the selected light.

[0093] At 710, in at least one embodiment, the system performs RIS-based resampling of light from a general candidate pool. In at least one embodiment, this is done from a subset of the general candidate pool.

[0094] At 712, in at least one embodiment, the system performs spatial resampling to select candidate lights for rendering pixels spatially adjacent to the current pixel. In at least one embodiment, sampling is performed from a previous frame or otherwise temporally adjacent frames to avoid intra-frame dependencies.

[0095] At 714, in at least one embodiment, the system performs temporal resampling to select candidate lights for rendering pixels in a previous frame or otherwise temporally adjacent frames.

[0096] At point 716, in at least one embodiment, the system selects samples to be reused from general samples, spatial samples, and temporal samples.

[0097] At point 718, in at least one embodiment, the system uses one or more of a general sample, a spatial sample, and a temporal sample to render the current pixel. In at least one embodiment, one of each type is used. Rendering a pixel may include: ray tracing each sample to determine its visibility to the selected pixel, or in some cases reusing previous visibility determinations.

[0098] In at least one embodiment, light from one or more pixels used to render the graphics in subsequent frames is stored at a lower frequency or pixel resolution than light from one or more pixels used to render the graphics in the current frame. For example, pixels intended for reuse in subsequent frames may be resampled and stored less frequently, or may be stored with less detail.

[0099] The embodiments described herein offer various advantages. First, they decouple shading from reuse. This allows shading of more light samples than is reused. As described in more detail below, this improves shading quality at little or no cost, aside from potential changes in the ray budget. Second, the embodiments shading one of each type of light sample, such as per-pixel candidates, temporal samples, and spatial samples. As computational power potentially increases in the future, the embodiments can use an increased number of shading samples, such as more spatial samples. It should be noted that the original ReSTIR merely shading the final samples that no longer have an identifiable type (e.g., candidate, temporal, or spatial). Third, decoupled shading allows visibility and shading to be computed in a variety of different ways. In one approach, three shadow rays per pixel (one for each sample type) are used. This provides high image quality despite using more rays than a non-decoupled ReSTIR. In another approach, the visibility of temporal samples is reused. This has the same ray count as a non-decoupled ReSTIR and also provides high image quality. In yet another approach, the visibility of temporal samples is always reused, and the visibility of spatial samples is sometimes reused. This can significantly degrade image quality, but it should be done in a controlled and parameterized manner. Fourth, visibility can be stored as a single bit in the reservoir, allowing visibility to be optionally reused in future frames, for example, reused for shading instead of resampling. Finally, the decoupling of ReSTIR steps can be combined into a reduced number of kernels, or a single kernel, to improve performance.

[0100] The embodiments disclosed herein can be used in a variety of applications, devices, and environments, including but not limited to those explicitly described herein. Embodiments of the techniques described herein can be used to render complex graphical scenes, such as those that can be generated in video games, special effects, computer animation, computer-aided design, etc.

[0101] In various embodiments, the techniques described herein are applied to non-graphics applications and problem spaces that share characteristics similar to rendering or ray tracing. For example, the embodiments described herein may be adapted for simulating the effects of acoustic emissions, radio emissions, or other similar situations involving multiple transmitters whose effects are to be simulated. In at least one embodiment, a reservoir of transmitters is created by sampling from a set of transmitters. The reservoir is used on a simulation basis to predict the effects of one or more emission impacts from the sampled transmitters on the surface. In subsequent units of the simulation, this reservoir is merged with another reservoir. Various aspects of this technique can be further understood from the example embodiments described herein with respect to graphics rendering techniques.

[0102] Given techniques such as ReSTIR or other sampling-based algorithms, embodiments using decoupled coloring and reuse can be further understood. In at least one embodiment, these techniques include converting RIS into a streaming algorithm using weighted reservoir sampling (“WRS”) with resampled importance sampling (“RIS”). In at least one embodiment, this includes updating the reservoir with sequentially generated candidates xi and corresponding weights, as shown in the following algorithm:

[0103] 1. foreach pixel q∈Image do

[0104] 2.|Image[q]←shadePixel(RIS(q),q)

[0105] 3. function RIS(q)

[0106] 4.|Reservoir r

[0107] 5.|for i←1 to M do

[0108] 6.||generate x i ~p

[0109] 7.||

[0110] 8.|

[0111] 9.|return r

[0112] 10.function shadePixel(Reservoir r,q)

[0113] 11.|return f q (ry)·rW

[0114] The algorithm generates candidates at each pixel q and uses the target probability distribution function. They are resampled. There can be correlations between the target probability distribution functions in neighboring pixels. For example, if shadowless lighting is used... Spatial proximity can lead to similar geometry and bidirectional scattering distribution function factors between adjacent pixels. In at least one embodiment, the correlation between “similar” pixels is utilized by generating and storing candidate samples and their weights for each pixel and reusing the computation performed at adjacent pixels by combining the candidates of each pixel with neighboring candidates using a second pass. Weight computation can occur during the first pass, and therefore this reuse of candidates from neighbors can be computationally cheaper than generating an equivalent number of new candidates.

[0115] However, this method may require storing data for each reused candidate. To avoid this problem, embodiments may employ a technique to combine multiple reservoirs without needing to access their input streams. In at least one embodiment, the reservoir state information includes the currently selected sample y and the weights w of all candidates seen so far. sum The sum of . To combine the two reservoirs, the y of each reservoir can be treated as having a weight w. sum Fresh samples are taken and fed as input to a new reservoir. This result can be similar to reservoir sampling already performed on a combined input stream of two reservoirs, but considering access based on the current state of each reservoir, the result can involve constant time and avoid storing or retrieving elements of any input stream. In at least one embodiment, the input streams of any number of reservoirs are combined in this manner as depicted in the following algorithm:

[0116] Input: Reservoirs r i to combine

[0117] Output:A combined reservoir equivalent to the concatenated inputstreams of(r1,...,r k )

[0118] 1.function combineReservoirs(q, r1, r2,..., r k )

[0119] 2.|Reservoir s

[0120] 3. | for each r∈{r1, ..., r2} k}do

[0121] 4.||

[0122] 5.|sM←r1.M+r2.M+…r k .M

[0123] 6.|

[0124] 7.|return s

[0125] The algorithm described above illustrates the combination of input streams from k reservoirs. This is to address the input streams from different target distributions. The samples of the next neighboring pixels q′ that were then resampled were adjusted using a factor. The samples are reweighted to account for regions that may be oversampled or undersampled in neighboring pixels compared to the current pixel. The resulting item... It can be written more concisely as

[0126] To perform spatial reuse, the embodiment can use RIS(q) to generate M candidates for each pixel q, and store the resulting reservoir in a buffer the size of the image. Each pixel can then select k of its neighbors and combine its reservoir with itself using a reservoir combination algorithm (such as one described above). The cost per pixel can be O(k+M), but each pixel effectively sees k·M candidates. This spatial reuse can be repeated using the output of the previous reuse pass as input. Performing n iterations requires O(nk+M) computations, but effectively produces k candidates per pixel. n There are M candidates, assuming that different neighboring pixels are used in each step.

[0127] To perform temporal reuse, note that images can be rendered as part of an animation sequence. In this case, previous or subsequent frames can provide additional candidates for reuse. After rendering a frame, the implementation can store a final reservoir of each pixel for reuse in the next frame. If frames are rendered sequentially and their reservoirs are fed forward, a frame can combine candidates not only with those of previous frames, but also with all or many previous frames in the sequence, which can improve image quality.

[0128] Another possibility involves using only visible samples. Even with an infinite number of candidates, RIS cannot achieve noise-free rendering. While the distribution of samples approximates the target PDF as M increases. but The sampling of the integral f is not perfect. In fact, Typically set to a path contribution with no shadows, this means that as M increases, noise due to visibility may begin to dominate. Visibility noise can be severe in large scenes. To address this issue, embodiments may also perform visibility reuse. Before performing spatial or temporal reuse, embodiments may evaluate the visibility of selected samples y of the reservoir for each pixel. If y is occluded, the reservoir can be discarded. This means that occluded samples may not propagate to neighboring pixels, and if visibility is locally coherent, the final sample generated by spatial resampling is likely to be unoccluded.

[0129] In at least one embodiment, the algorithm first generates and resamples M independent per-pixel light candidates. Visibility testing can be performed on the samples from this step, and occluded samples can be discarded. The embodiment can then combine the selected samples from the reservoir of each pixel with the output of the previous frame determined using backprojection. The embodiment can perform n rounds of spatial reuse to utilize information from the pixels' neighbors. The embodiment can then colorize the image and forward the final reservoir to the next frame. This method is described in pseudocode as follows:

[0130] Input:Image-sized buffer contain the previous frame's reservoirs

[0131] Output:The current frame's reservoirs

[0132] 1.function reservoirReuse(prevFrameReservoirs)

[0133] 2.|reservoirs←new Array[ImageSize]

[0134] 3.| / / generate initial candidates

[0135] 4. | for each pixel q∈Image do

[0136] 5.||reservoirs[q]←RIS(q)

[0137] 6.| / / evaluate visibility for initial candidates

[0138] 7. | for each pixel q∈Image do

[0139] 8.||if shadowed(reservoirs[q].y)then

[0140] 9.|||reservoirs[q].W←0

[0141] 10. | / / temporal reuse

[0142] 11.|foreach pixel q∈Image do

[0143] 12.||q′←pickTemporalNeighbor(q)

[0144] 13.||reservoirs[q]←

[0145] 14.||combineReservoirs(q,reservoirs[q],prevFrameReservoirs[q′])

[0146] 15. | / / spatial reuse

[0147] 16. | for iteration i ← 1tondo

[0148] 17.||foreach pixel q∈Image do

[0149] 18.||Q←pickSpatialNeighbors(q)

[0150] 19.||

[0151] 20.||

[0152] 21.| / / compute pixel color

[0153] 22.|foreach pixel q∈Image do

[0154] 23.||Image[q]←shadePixel(reservoirs[q], q)

[0155] 24. | return reservoirs

[0156] In the following description, numerous specific details are set forth in order to provide a more thorough understanding of at least one embodiment. However, it will be apparent to those skilled in the art that the inventive concept can be practiced without one or more of these specific details.

[0157] Data Center

[0158] Figure 8 An example data center 800 according to at least one embodiment is shown. In at least one embodiment, the data center 800 includes, but is not limited to, a data center infrastructure layer 810, a framework layer 820, a software layer 830, and an application layer 840.

[0159] In at least one embodiment, such as Figure 8As shown, the data center infrastructure layer 810 may include a resource coordinator 812, grouped computing resources 814, and node computing resources (“nodes CR”) 816(1)-816(N), where “N” represents any complete positive integer. In at least one embodiment, nodes CR 816(1)-816(N) may include, but are not limited to, any number of central processing units (“CPUs”) or other processors (including accelerators, field-programmable gate arrays (“FPGAs”), data processing units (DPUs) in network devices, graphics processors, etc.), memory devices (e.g., dynamic read-only memory), storage devices (e.g., solid-state drives or disk drives), network input / output (“NW I / O”) devices, network switches, virtual machines (“VMs”), power modules, and cooling modules, etc. In at least one embodiment, one or more nodes CR 816(1)-816(N) may be servers having one or more of the aforementioned computing resources.

[0160] In at least one embodiment, the grouped computing resources 814 may include individual groups (not shown) of node CRs housed in one or more racks, or a plurality of racks (also not shown) housed in data centers in various geographical locations. The individual groups of node CRs within the grouped computing resources 814 may include computing, networking, memory, or storage resources that can be configured or allocated to support groups of one or more workloads. In at least one embodiment, several node CRs, including CPUs or processors, may be grouped within one or more racks to provide computing resources to support one or more workloads. In at least one embodiment, the one or more racks may also include any number of power modules, cooling modules, and network switches, in any combination.

[0161] In at least one embodiment, resource coordinator 812 may configure or otherwise control one or more nodes CR816(1)-816(N) and / or grouped computing resources 814. In at least one embodiment, resource coordinator 812 may include a software design infrastructure (“SDI”) management entity for data center 800. In at least one embodiment, resource coordinator 812 may include hardware, software, or some combination thereof.

[0162] In at least one embodiment, such as Figure 8As shown, the framework layer 820 includes, but is not limited to, a job scheduler 832, a configuration manager 834, a resource manager 836, and a distributed file system 838. In at least one embodiment, the framework layer 820 may include a framework of software 852 supporting the software layer 830 and / or one or more applications 842 of the application layer 840. In at least one embodiment, the software 852 or application 842 may respectively include web-based service software or applications, such as services or applications provided by Amazon Web Services, Google Cloud, and Microsoft Azure. In at least one embodiment, the framework layer 820 may be, but is not limited to, a free and open-source software web application framework, such as Apache Spark, which can utilize the distributed file system 838 for large-scale data processing (e.g., "big data"). TM (Hereinafter referred to as "Spark"). In at least one embodiment, the job scheduler 832 may include a Spark driver to facilitate the scheduling of workloads supported by various layers of the data center 800. In at least one embodiment, the configuration manager 834 may be able to configure different layers, such as the software layer 830 and the framework layer 820, which includes Spark and a distributed file system 838 for supporting large-scale data processing. In at least one embodiment, the resource manager 836 is able to manage cluster or group computing resources mapped to or allocated to support the distributed file system 838 and the job scheduler 832. In at least one embodiment, the cluster or group computing resources may include grouped computing resources 814 on the data center infrastructure layer 810. In at least one embodiment, the resource manager 836 may coordinate with the resource coordinator 812 to manage these mapped or allocated computing resources.

[0163] In at least one embodiment, the software 852 included in the software layer 830 may include software used by at least a portion of nodes CR816(1)-816(N), grouped computing resources 814, and / or the distributed file system 838 of the framework layer 820. One or more types of software may include, but are not limited to, Internet web page search software, email virus scanning software, database software, and streaming video content software.

[0164] In at least one embodiment, one or more applications 842 included in the application layer 840 may include one or more types of applications used by at least a portion of nodes CR816(1)-816(N), grouped computing resources 814, and / or the distributed file system 838 of the framework layer 820. One or more types of applications may include, but are not limited to, CUDA applications.

[0165] In at least one embodiment, any of the configuration manager 834, resource manager 836, and resource coordinator 812 can implement any number and type of self-modification actions based on any amount and type of data acquired in any technically feasible manner. In at least one embodiment, self-modification actions can mitigate potentially poor configuration decisions by data center operators of data center 800 and can prevent underutilization and / or poor performance of the data center.

[0166] Computer-based systems

[0167] The following figures present, but are not limited to, exemplary computer-based systems that can be used to implement at least one embodiment.

[0168] Figure 9 A processing system 900 according to at least one embodiment is illustrated. In at least one embodiment, the system 900 includes one or more processors 902 and one or more graphics processors 908, and may be a single-processor desktop system, a multi-processor workstation system, or a server system having a large number of processors 902 or processor cores 907. In at least one embodiment, the processing system 900 is a processing platform incorporated within a system-on-a-chip (SoC) integrated circuit for use in mobile, handheld, or embedded devices.

[0169] In at least one embodiment, the processing system 900 may include or be integrated into a server-based gaming platform, including a game console, mobile game console, handheld game console, or online game console, which are game and media consoles. In at least one embodiment, the processing system 900 is a mobile phone, smartphone, tablet computing device, or mobile internet device. In at least one embodiment, the processing system 900 may also include components coupled to or integrated into a wearable device, such as a smartwatch, smart glasses, augmented reality, or virtual reality device. In at least one embodiment, the processing system 900 is a television or set-top box device having one or more processors 902 and a graphical interface generated by one or more graphics processors 908.

[0170] In at least one embodiment, one or more processors 902 each include one or more processor cores 907 to process instructions that, when executed, perform operations against the system and user software. In at least one embodiment, each of the one or more processor cores 907 is configured to process a particular instruction set 909. In at least one embodiment, the instruction set 909 may facilitate Complex Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), or computation via Very Long Instruction Word (VLIW). In at least one embodiment, the plurality of processor cores 907 may each process a different instruction set 909, which may include instructions that facilitate the emulation of other instruction sets. In at least one embodiment, the processor cores 907 may also include other processing devices, such as digital signal processors (DSPs).

[0171] In at least one embodiment, processor 902 includes cache memory 904. In at least one embodiment, processor 902 may have a single internal cache or multiple levels of internal caches. In at least one embodiment, the cache memory is shared among various components of processor 902. In at least one embodiment, processor 902 also uses an external cache (e.g., a Level 3 (L3) cache or a last-level cache (LLC)) (not shown), which can be shared among processor cores 907 using known cache coherence techniques. In at least one embodiment, processor 902 further includes a register file 906, which may include different types of registers (e.g., integer registers, floating-point registers, status registers, and instruction pointer registers) for storing different types of data. In at least one embodiment, register file 906 may include general-purpose registers or other registers.

[0172] In at least one embodiment, one or more processors 902 are coupled to one or more interface buses 910 to transmit communication signals, such as address, data, or control signals, between the processor 902 and other components in the system 900. In at least one embodiment, the interface bus 910 may be a processor bus, such as a version of the Direct Media Interface (DMI) bus. In at least one embodiment, the interface bus 910 is not limited to the DMI bus and may include one or more peripheral component interconnect buses (e.g., PCI, PCI Express), memory buses, or other types of interface buses. In at least one embodiment, the processor 902 includes an integrated memory controller 916 and a platform controller hub 930. In at least one embodiment, the memory controller 916 facilitates communication between storage devices and other components of the processing system 900, while the platform controller hub (PCH) 930 provides connectivity to input / output (I / O) devices via a local I / O bus.

[0173] In at least one embodiment, storage device 920 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, a phase-change memory device, or a device with suitable performance for use as processor memory. In at least one embodiment, storage device 920 may be used as system memory of processing system 900 to store data 922 and instructions 921 for use when one or more processors 902 execute an application or process. In at least one embodiment, memory controller 916 is also coupled to an optional external graphics processor 912, which may communicate with one or more graphics processors 908 of processor 902 to perform graph and media operations. In at least one embodiment, display device 911 may be connected to processor 902. In at least one embodiment, display device 911 may include one or more internal display devices, such as those in mobile electronic devices or portable computer devices, or external display devices connected via a display interface (e.g., DisplayPort). In at least one embodiment, display device 911 may include a head-mounted display (HMD), such as a stereoscopic display device for virtual reality (VR) or augmented reality (AR) applications.

[0174] In at least one embodiment, the platform controller hub 930 enables peripheral devices to connect to the storage device 920 and the processor 902 via a high-speed I / O bus. In at least one embodiment, the I / O peripheral devices include, but are not limited to, an audio controller 946, a network controller 934, a firmware interface 928, a wireless transceiver 926, a touch sensor 925, and a data storage device 924 (e.g., a hard disk drive, flash memory, etc.). In at least one embodiment, the data storage device 924 may be connected via a memory interface (e.g., SATA) or via a peripheral bus, such as a peripheral component interconnect bus (e.g., PCI, PCIe). In at least one embodiment, the touch sensor 925 may include a touchscreen sensor, a pressure sensor, or a fingerprint sensor. In at least one embodiment, the wireless transceiver 926 may be a Wi-Fi transceiver, a Bluetooth transceiver, or a mobile network transceiver, such as a 3G, 4G, or LTE transceiver. In at least one embodiment, the firmware interface 928 enables communication with the system firmware and may be, for example, a Unified Extensible Firmware Interface (UEFI). In at least one embodiment, the network controller 934 may enable network connectivity to a wired network. In at least one embodiment, a high-performance network controller (not shown) is coupled to an interface bus 910. In at least one embodiment, the audio controller 946 is a multi-channel high-definition audio controller. In at least one embodiment, the processing system 900 includes an optional legacy I / O controller 940 for coupling legacy (e.g., Personal System 2 (PS / 2)) devices to the processing system 900. In at least one embodiment, the platform controller hub 930 may also be connected to one or more Universal Serial Bus (USB) controllers 942 that connect input devices, such as a keyboard and mouse combination 943, a camera 944, or other USB input devices.

[0175] In at least one embodiment, instances of the memory controller 916 and platform controller hub 930 may be integrated into a discrete external graphics processor, such as external graphics processor 912. In at least one embodiment, the platform controller hub 930 and / or the memory controller 916 may be external to one or more processors 902. For example, in at least one embodiment, the processing system 900 may include an external memory controller 916 and a platform controller hub 930, which may be configured as a memory controller hub and a peripheral controller hub in a system chipset communicating with the processor 902.

[0176] Figure 10A computer system 1000 according to at least one embodiment is illustrated. In at least one embodiment, the computer system 1000 may be a system having interconnected devices and components, a System-on-a-Chip (SoC), or some combination thereof. In at least one embodiment, the computer system 1000 is formed by a processor 1002, which may include execution units for executing instructions. In at least one embodiment, the computer system 1000 may include, but is not limited to, components such as the processor 1002, which employs execution units including logic to execute algorithms for process data. In at least one embodiment, the computer system 1000 may include a processor, such as one available from Intel Corporation of Santa Clara, California. Processor family, Xeon™ XScale™ and / or StrongARM™ Core TM or A microprocessor may be used, although other systems (including PCs, engineering workstations, set-top boxes, etc.) with other microprocessors may also be used. In at least one embodiment, computer system 1000 may execute a version of the Windows operating system available from Microsoft Corporation of Redmond, Washington, although other operating systems (such as UNIX and Linux), embedded software, and / or graphical user interfaces may also be used.

[0177] In at least one embodiment, the computer system 1000 can be used in other devices, such as handheld devices and embedded applications. Some examples of handheld devices include cellular phones, Internet Protocol (IP) devices, digital cameras, personal digital assistants (“PDAs”), and handheld PCs. In at least one embodiment, the embedded application may include a microcontroller, a digital signal processor (“DSP”), a system-on-a-chip (SoC), a network computer (“NetPC”), a set-top box, a network hub, a wide area network (“WAN”) switch, or any other system capable of executing one or more instructions according to at least one embodiment.

[0178] In at least one embodiment, the computer system 1000 may include, but is not limited to, a processor 1002, which may include, but is not limited to, one or more execution units 1008 configured to execute a Computational Unified Device Architecture (“CUDA”). (Developed by NVIDIA Corporation, Santa Clara, California) In at least one embodiment, the CUDA program is at least a part of a software application written in the CUDA programming language. In at least one embodiment, the computer system 1000 is a single-processor desktop or server system. In at least one embodiment, the computer system 1000 may be a multiprocessor system. In at least one embodiment, the processor 1002 may include, but is not limited to, a CISC microprocessor, a RISC microprocessor, a VLIW microprocessor, a processor implementing instruction set combinations, or any other processor device, such as a digital signal processor. In at least one embodiment, the processor 1002 may be coupled to a processor bus 1010, which can transmit data signals between the processor 1002 and other components in the computer system 1000.

[0179] In at least one embodiment, processor 1002 may include, but is not limited to, a Level 1 (“L1”) internal cache memory (“cache”) 1004. In at least one embodiment, processor 1002 may have a single internal cache or multiple levels of internal cache. In at least one embodiment, the cache memory may reside external to processor 1002. In at least one embodiment, processor 1002 may include a combination of internal and external caches. In at least one embodiment, register file 1006 may store different types of data in various registers, including but not limited to integer registers, floating-point registers, status registers, and instruction pointer registers.

[0180] In at least one embodiment, an execution unit 1008, including but not limited to logic for performing integer and floating-point operations, is also located within the processor 1002. The processor 1002 may also include a microcode (“ucode”) read-only memory (“ROM”) for storing microcode of certain macro instructions. In at least one embodiment, the execution unit 1008 may include logic for processing a packaged instruction set 1009. In at least one embodiment, by including the packaged instruction set 1009 in the instruction set of the general-purpose processor 1002, along with the associated circuitry for executing the instructions, packaged data in the general-purpose processor 1002 can be used to perform operations used by numerous multimedia applications. In at least one embodiment, many multimedia applications can be executed more quickly and efficiently by using the full width of the processor’s data bus to perform operations on the packaged data, which may eliminate the need to transfer smaller data units on the processor’s data bus to perform one or more operations on a data element at a time.

[0181] In at least one embodiment, the execution unit 1008 may also be used in a microcontroller, embedded processor, graphics device, DSP, and other types of logic circuitry. In at least one embodiment, the computer system 1000 may include, but is not limited to, the memory 1020. In at least one embodiment, the memory 1020 may be implemented as a DRAM device, SRAM device, flash memory device, or other storage device. The memory 1020 may store instructions 1019 and / or data 1021 represented by data signals that can be executed by the processor 1002.

[0182] In at least one embodiment, the system logic chip may be coupled to the processor bus 1010 and the memory 1020. In at least one embodiment, the system logic chip may include, but is not limited to, a memory controller hub (“MCH”) 1016, and the processor 1002 may communicate with the MCH 1016 via the processor bus 1010. In at least one embodiment, the MCH 1016 may provide a high-bandwidth memory path 1018 to the memory 1020 for instruction and data storage, as well as for storage of graphics commands, data, and textures. In at least one embodiment, the MCH 1016 may initiate data signals between the processor 1002, the memory 1020, and other components in the computer system 1000, and bridge data signals between the processor bus 1010, the memory 1020, and the system I / O 1022. In at least one embodiment, the system logic chip may provide a graphics port for coupling to a graphics controller. In at least one embodiment, MCH 1016 can be coupled to memory 1020 via high-bandwidth memory path 1018, and graphics / video card 1012 can be coupled to MCH 1016 via Accelerated Graphics Port (“AGP”) interconnect 1014.

[0183] In at least one embodiment, the computer system 1000 may use system I / O 1022 as a proprietary hub interface bus to couple MCH 1016 to I / O controller hub (“ICH”) 1030. In at least one embodiment, ICH 1030 may provide direct connectivity to certain I / O devices via a local I / O bus. In at least one embodiment, the local I / O bus may include, but is not limited to, a high-speed I / O bus for connecting peripheral devices to memory 1020, chipset, and processor 1002. Examples may include, but are not limited to, an audio controller 1029, a firmware hub (“Flash BIOS”) 1028, a wireless transceiver 1026, data storage 1024, a conventional I / O controller 1023 and keyboard interface including user input 1025, a serial expansion port 1027 (e.g., USB), and a network controller 1034. Data storage 1024 may include a hard disk drive, floppy disk drive, CD-ROM device, flash memory device, or other mass storage device.

[0184] In at least one embodiment, Figure 10 A system comprising interconnected hardware devices or "chips" is shown. In at least one embodiment, Figure 10 An exemplary SoC can be shown. In at least one embodiment, Figure 10 The devices shown can be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, one or more components of system 1000 are interconnected using a compute fast link (CXL) interconnect.

[0185] Figure 11 A system 1100 according to at least one embodiment is illustrated. In at least one embodiment, system 1100 is an electronic device utilizing processor 1110. In at least one embodiment, system 1100 may be, for example, but not limited to, a laptop computer, tower server, rack server, blade server, edge device communicatively coupled to one or more local or cloud service providers, laptop computer, desktop computer, tablet computer, mobile device, telephone, embedded computer, or any other suitable electronic device.

[0186] In at least one embodiment, system 1100 may include, but is not limited to, processor 1110 communicatively coupled to any suitable number or type of components, peripherals, modules, or devices. In at least one embodiment, processor 1110 is coupled using a bus or interface, such as I... 2C-bus, System Management Bus (“SMBus”), Low Pin Count (LPC) bus, Serial Peripheral Interface (“SPI”), High Definition Audio (“HDA”) bus, Serial Advanced Technology Accessory (“SATA”) bus, USB (versions 1, 2, and 3) or Universal Asynchronous Receiver / Transmitter (“UART”) bus. In at least one embodiment, Figure 11 A system is illustrated, comprising interconnected hardware devices or "chips". In at least one embodiment, Figure 11 An exemplary SoC can be shown. In at least one embodiment, Figure 11 The device shown can be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, Figure 11 One or more components are interconnected using Computational Fast Link (CXL) interconnects.

[0187] In at least one embodiment, Figure 11 This may include a display 1124, a touchscreen 1125, a touchpad 1130, a near-field communication unit (“NFC”) 1145, a sensor hub 1140, a thermal sensor 1146, a fast chipset (“EC”) 1135, a trusted platform module (“TPM”) 1138, a BIOS / firmware / flash memory (“BIOS, FW Flash”) 1122, a DSP 1160, a solid-state drive (“SSD”) or hard disk drive (“HDD”) 1120, a wireless local area network unit (“WLAN”) 1150, a Bluetooth unit 1152, a wireless wide area network unit (“WWAN”) 1156, a global positioning system (GPS) 1155, a camera (“USB 3.0 camera”) 1154 (e.g., a USB 3.0 camera), or a low-power double data rate (“LPDDR”) memory unit (“LPDDR3”) 1115 implemented in, for example, the LPDDR3 standard. These components may each be implemented in any suitable manner.

[0188] In at least one embodiment, other components may be communicatively coupled to processor 1110 via the components discussed above. In at least one embodiment, accelerometer 1141, ambient light sensor (“ALS”) 1142, compass 1143, and gyroscope 1144 may be communicatively coupled to sensor hub 1140. In at least one embodiment, thermal sensor 1139, fan 1137, keyboard 1136, and touchpad 1130 may be communicatively coupled to EC 1135. In at least one embodiment, speaker 1163, earphone 1164, and microphone (“mic”) 1165 may be communicatively coupled to audio unit (“audio codec and Class D amplifier”) 1162, which in turn may be communicatively coupled to DSP 1160. In at least one embodiment, audio unit 1162 may include, for example, but not limited to, audio encoder / decoder (“codec”) and Class D amplifier. In at least one embodiment, SIM card (“SIM”) 1157 may be communicatively coupled to WWAN unit 1156. In at least one embodiment, components such as WLAN unit 1150, Bluetooth unit 1152, and WWAN unit 1156 can be implemented as next-generation form factor (NGFF).

[0189] Figure 12 An exemplary integrated circuit 1200 according to at least one embodiment is illustrated. In at least one embodiment, the exemplary integrated circuit 1200 is a SoC (System-on-a-Chip) that can be fabricated using one or more IP cores. In at least one embodiment, the integrated circuit 1200 includes one or more application processors 1205 (e.g., CPU, DPU), at least one graphics processor 1210, and may additionally include an image processor 1215 and / or a video processor 1220, any of which may be a modular IP core. In at least one embodiment, the integrated circuit 1200 includes peripheral or bus logic, which includes a USB controller 1225, a UART controller 1230, an SPI / SDIO controller 1235, and an I... 2 S / I 2 C controller 1240. In at least one embodiment, integrated circuit 1200 may include display device 1245 coupled to one or more of high-definition multimedia interface (HDMI) controller 1250 and mobile industrial processor interface (MIPI) display interface 1255. In at least one embodiment, storage may be provided by flash memory subsystem 1260, including flash memory and flash memory controller. In at least one embodiment, a memory interface may be provided via memory controller 1265 for accessing SDRAM or SRAM memory devices. In at least one embodiment, some integrated circuits also include embedded security engine 1270.

[0190] Figure 13A computing system 1300 according to at least one embodiment is illustrated. In at least one embodiment, the computing system 1300 includes a processing subsystem 1301 having one or more processors 1302 and a system memory 1304 communicating via an interconnect path that may include a memory hub 1305. In at least one embodiment, the memory hub 1305 may be a separate component within a chipset assembly or may be integrated within one or more processors 1302. In at least one embodiment, the memory hub 1305 is coupled to an I / O subsystem 1311 via a communication link 1306. In at least one embodiment, the I / O subsystem 1311 includes an I / O hub 1307 that enables the computing system 1300 to receive input from one or more input devices 1308. In at least one embodiment, the I / O hub 1307 may enable a display controller, included in one or more processors 1302, for providing output to one or more display devices 1310A. In at least one embodiment, one or more display devices 1310A coupled to the I / O hub 1307 may include local, internal, or embedded display devices.

[0191] In at least one embodiment, the processing subsystem 1301 includes one or more parallel processors 1312 coupled to the memory hub 1305 via a bus or other communication link 1313. In at least one embodiment, the communication link 1313 may be one of many standards-based communication link technologies or protocols, such as, but not limited to, PCIe, or may be a vendor-specific communication interface or communication architecture. In at least one embodiment, the one or more parallel processors 1312 form a computationally concentrated parallel or vector processing system that may include a large number of processing cores and / or processing clusters, such as a multi-core integrated (MIC) processor. In at least one embodiment, the one or more parallel processors 1312 form a graphics processing subsystem capable of outputting pixels to one or more display devices 1310A coupled via an I / O hub 1307. In at least one embodiment, the one or more parallel processors 1312 may also include a display controller and a display interface (not shown) to enable direct connection to one or more display devices 1310B.

[0192] In at least one embodiment, system storage unit 1314 may be connected to I / O hub 1307 to provide a storage mechanism for computing system 1300. In at least one embodiment, I / O switch 1316 may be used to provide an interface mechanism to enable connectivity between I / O hub 1307 and other components, such as network adapter 1318 and / or wireless network adapter 1319 that may be integrated into the platform, and various other devices that may be added via one or more additional devices 1320. In at least one embodiment, network adapter 1318 may be an Ethernet adapter or another wired network adapter. In at least one embodiment, wireless network adapter 1319 may include one or more Wi-Fi, Bluetooth, NFC, or other network devices comprising one or more radios.

[0193] In at least one embodiment, the computing system 1300 may include other components not explicitly shown, including USB or other port connections, optical storage drives, video capture devices, etc., and may also be connected to the I / O hub 1307. In at least one embodiment, for Figure 13 The communication paths that interconnect the various components can be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect) based protocols (e.g., PCIe), or other bus or point-to-point communication interfaces and / or protocols (e.g., NVLink high-speed interconnect or interconnect protocols).

[0194] In at least one embodiment, one or more parallel processors 1312 include circuitry optimized for graphics and video processing (including, for example, video output circuitry) and constitute a graphics processing unit (GPU). In at least one embodiment, one or more parallel processors 1312 include circuitry optimized for general-purpose processing. In at least one embodiment, components of the computing system 1300 may be integrated with one or more other system elements on a single integrated circuit. For example, in at least one embodiment, one or more parallel processors 1312, memory hub 1305, processor 1302, and I / O hub 1307 may be integrated into a system-on-a-chip (SoC) integrated circuit. In at least one embodiment, components of the computing system 1300 may be integrated into a single package to form a system-in-package (SIP) configuration. In at least one embodiment, at least a portion of the components of the computing system 1300 may be integrated into a multi-chip module (MCM) that can interconnect with other MCMs to a modular computing system. In at least one embodiment, the I / O subsystem 1311 and display device 1310B are omitted from the computing system 1300.

[0195] Processing system

[0196] The following figures illustrate, but are not limited to, exemplary processing systems that can be used to implement at least one embodiment.

[0197] Figure 14 An accelerated processing unit (“APU”) 1400 according to at least one embodiment is illustrated. In at least one embodiment, the APU 1400 was developed by AMD Inc. of Santa Clara, California. In at least one embodiment, the APU 1400 can be configured to execute applications, such as CUDA programs. In at least one embodiment, the APU 1400 includes, but is not limited to, a core complex 1410, a graphics complex 1440, an architecture 1460, an I / O interface 1470, a memory controller 1480, a display controller 1492, and a multimedia engine 1494. In at least one embodiment, the APU 1400 can be, but is not limited to, any combination of any number of core complexes 1410, any number of graphics complexes 1440, any number of display controllers 1492, and any number of multimedia engines 1494. For illustrative purposes, multiple instances of similar objects are indicated herein by reference numerals, wherein the reference numerals identify the object, and the numbers in parentheses identify the desired instances.

[0198] In at least one embodiment, the core complex 1410 is a CPU, the graphics complex 1440 is a GPU, and the APU 1400 is a processing unit that integrates, but is not limited to, the core complex 1410 and the graphics complex 1440 onto a single chip. In at least one embodiment, some tasks may be assigned to the core complex 1410, while other tasks may be assigned to the graphics complex 1440. In at least one embodiment, the core complex 1410 is configured to execute main control software associated with the APU 1400, such as an operating system. In at least one embodiment, the core complex 1410 is the main processor of the APU 1400, which controls and coordinates the operation of other processors. In at least one embodiment, the core complex 1410 issues commands to control the operation of the graphics complex 1440. In at least one embodiment, the core complex 1410 may be configured to execute host executable code derived from CUDA source code, and the graphics complex 1440 may be configured to execute device executable code derived from CUDA source code.

[0199] In at least one embodiment, the core complex 1410 includes, but is not limited to, cores 1420(1)-1420(4) and L3 cache 1430. In at least one embodiment, the core complex 1410 may include, but is not limited to, any combination of any number of cores 1420 and any number and type of cache. In at least one embodiment, the cores 1420 are configured to execute instructions of a specific instruction set architecture (“ISA”). In at least one embodiment, each core 1420 is a CPU core.

[0200] In at least one embodiment, each core 1420 includes, but is not limited to, a fetch / decode unit 1422, an integer execution engine 1424, a floating-point execution engine 1426, and an L2 cache 1428. In at least one embodiment, the fetch / decode unit 1422 fetches instructions, decodes these instructions, generates micro-operations, and dispatches individual micro-instructions to the integer execution engine 1424 and the floating-point execution engine 1426. In at least one embodiment, the fetch / decode unit 1422 may simultaneously dispatch one micro-instruction to the integer execution engine 1424 and another micro-instruction to the floating-point execution engine 1426. In at least one embodiment, the integer execution engine 1424 performs operations not limited to integer and memory operations. In at least one embodiment, the floating-point engine 1426 performs operations not limited to floating-point and vector operations. In at least one embodiment, the fetch-decode unit 1422 dispatches micro-instructions to a single execution engine, which replaces both the integer execution engine 1424 and the floating-point execution engine 1426.

[0201] In at least one embodiment, each core 1420(i) can access an L2 cache 1428(i) included in core 1420(i), where i is an integer representing a specific instance of core 1420. In at least one embodiment, each core 1420 included in core complex 1410(j) is connected to other cores 1420 included in core complex 1410(j) via an L3 cache 1430(j) included in core complex 1410(j), where j is an integer representing a specific instance of core complex 1410. In at least one embodiment, a core 1420 included in core complex 1410(j) can access all L3 caches 1430(j) included in core complex 1410(j), where j is an integer representing a specific instance of core complex 1410. In at least one embodiment, the L3 cache 1430 may include, but is not limited to, any number of slices.

[0202] In at least one embodiment, the graphics complex 1440 can be configured to perform computational operations in a highly parallel manner. In at least one embodiment, the graphics complex 1440 is configured to perform graphics pipeline operations, such as drawing commands, pixel operations, geometric calculations, and other operations associated with rendering an image to a display. In at least one embodiment, the graphics complex 1440 is configured to perform graphics-independent operations. In at least one embodiment, the graphics complex 1440 is configured to perform both graphics-related and graphics-independent operations.

[0203] In at least one embodiment, the graphics complex 1440 includes, but is not limited to, any number of computing units 1450 and an L2 cache 1442. In at least one embodiment, the computing units 1450 share the L2 cache 1442. In at least one embodiment, the L2 cache 1442 is partitioned. In at least one embodiment, the graphics complex 1440 includes, but is not limited to, any number of computing units 1450 and any number (including zero) and type of cache. In at least one embodiment, the graphics complex 1440 includes, but is not limited to, any number of dedicated graphics hardware.

[0204] In at least one embodiment, each computing unit 1450 includes, but is not limited to, any number of SIMD units 1452 and shared memory 1454. In at least one embodiment, each SIMD unit 1452 implements a SIMD architecture and is configured to execute operations in parallel. In at least one embodiment, each computing unit 1450 may execute any number of thread blocks, but each thread block executes on a single computing unit 1450. In at least one embodiment, a thread block includes, but is not limited to, any number of execution threads. In at least one embodiment, a workgroup is a thread block. In at least one embodiment, each SIMD unit 1452 executes a different warp. In at least one embodiment, a warp is a group of threads (e.g., 16 threads), where each thread in the warp belongs to a single thread block and is configured to process different datasets based on a single instruction set. In at least one embodiment, prediction can be used to disable one or more threads in a warp. In at least one embodiment, a channel is a thread. In at least one embodiment, a work item is a thread. In at least one embodiment, a wavefront is a warp. In at least one embodiment, different wavefronts in a thread block can be synchronized together and communicate via shared memory 1454.

[0205] In at least one embodiment, structure 1460 is a system interconnect that facilitates data and control transfers across core complex 1410, graphics complex 1440, I / O interface 1470, memory controller 1480, display controller 1492, and multimedia engine 1494. In at least one embodiment, in addition to or instead of structure 1460, APU 1400 may also include, but is not limited to, any number and type of system interconnects that facilitate data and control transfers across any number and type of components that may be directly or indirectly linked, either internally or externally to APU 1400. In at least one embodiment, I / O interface 1470 represents any number and type of I / O interface (e.g., PCI, PCI-Extended (“PCI-X”), PCIe, Gigabit Ethernet (“GBE”), USB, etc.). In at least one embodiment, various types of peripheral devices are coupled to I / O interface 1470. In at least one embodiment, the peripheral device coupled to the I / O interface 1470 may include, but is not limited to, a keyboard, mouse, printer, scanner, joystick or other types of game controllers, media recording devices, external storage devices, network interface cards, etc.

[0206] In at least one embodiment, the display controller AMD92 displays images on one or more display devices (e.g., liquid crystal display (LCD) devices). In at least one embodiment, the multimedia engine 240 includes, but is not limited to, any number and type of multimedia-related circuitry, such as video decoders, video encoders, image signal processors, etc. In at least one embodiment, the memory controller 1480 facilitates data transfer between the APU 1400 and the unified system memory 1490. In at least one embodiment, the core complex 1410 and the graphics complex 1440 share the unified system memory 1490.

[0207] In at least one embodiment, the APU 1400 implements a memory subsystem, which includes, but is not limited to, any number and type of memory controllers 1480 and memory devices (e.g., shared memory 1454) that may be dedicated to a single component or shared among multiple components. In at least one embodiment, the APU 1400 implements a cache subsystem, which includes, but is not limited to, one or more cache memories (e.g., L2 cache 1528, L3 cache 1430, and L2 cache 1442), each cache memory may be component-private or shared among any number of components (e.g., core 1420, core complex 1410, SIMD unit 1452, compute unit 1450, and graphics complex 1440).

[0208] Figure 15A CPU 1500 according to at least one embodiment is illustrated. In at least one embodiment, the CPU 1500 was developed by AMD Inc. of Santa Clara, California. In at least one embodiment, the CPU 1500 can be configured to execute an application. In at least one embodiment, the CPU 1500 is configured to execute host control software, such as an operating system. In at least one embodiment, the CPU 1500 issues commands to control the operation of an external GPU (not shown). In at least one embodiment, the CPU 1500 can be configured to execute host executable code derived from CUDA source code, and the external GPU can be configured to execute device executable code derived from such CUDA source code. In at least one embodiment, the CPU 1500 includes, but is not limited to, any number of core complexes 1510, architectures 1560, I / O interfaces 1570, and memory controllers 1580.

[0209] In at least one embodiment, the core complex 1510 includes, but is not limited to, cores 1520(1)-1520(4) and L3 cache 1530. In at least one embodiment, the core complex 1510 may include, but is not limited to, any combination of any number of cores 1520 and any number and type of cache. In at least one embodiment, the cores 1520 are configured to execute instructions of a specific ISA. In at least one embodiment, each core 1520 is a CPU core.

[0210] In at least one embodiment, each core 1520 includes, but is not limited to, a fetch / decode unit 1522, an integer execution engine 1524, a floating-point execution engine 1526, and an L2 cache 1528. In at least one embodiment, the fetch / decode unit 1522 fetches instructions, decodes these instructions, generates micro-operations, and dispatches individual micro-instructions to the integer execution engine 1524 and the floating-point execution engine 1526. In at least one embodiment, the fetch / decode unit 1522 may simultaneously dispatch one micro-instruction to the integer execution engine 1524 and another micro-instruction to the floating-point execution engine 1526. In at least one embodiment, the integer execution engine 1524 performs operations not limited to integer and memory operations. In at least one embodiment, the floating-point engine 1526 performs operations not limited to floating-point and vector operations. In at least one embodiment, the fetch-decode unit 1522 dispatches micro-instructions to a single execution engine, which replaces both the integer execution engine 1524 and the floating-point execution engine 1526.

[0211] In at least one embodiment, each core 1520(i) can access an L2 cache 1528(i) included in core 1520(i), where i is an integer representing a specific instance of core 1520. In at least one embodiment, each core 1520 included in core complex 1510(j) is connected to other cores 1520 in core complex 1510(j) via an L3 cache 1530(j) included in core complex 1510(j), where j is an integer representing a specific instance of core complex 1510. In at least one embodiment, a core 1520 included in core complex 1510(j) can access all L3 caches 1530(j) included in core complex 1510(j), where j is an integer representing a specific instance of core complex 1510. In at least one embodiment, the L3 cache 1530 may include, but is not limited to, any number of slices.

[0212] In at least one embodiment, structure 1560 is a system interconnect that facilitates data and control transfers across core complexes 1510(1)-1510(N) (where N is a positive integer), I / O interface 1570, and memory controller 1580. In at least one embodiment, in addition to or instead of structure 1560, CPU 1500 may also include, but is not limited to, any number and type of system interconnects that facilitate data and control transfers across any number and type of components that may be directly or indirectly linked, either inside or outside CPU 1500. In at least one embodiment, I / O interface 1570 represents any number and type of I / O interfaces (e.g., PCI, PCI-X, PCIe, GBE, USB, etc.). In at least one embodiment, various types of peripheral devices are coupled to I / O interface 1570. In at least one embodiment, peripheral devices coupled to I / O interface 1570 may include, but are not limited to, displays, keyboards, mice, printers, scanners, joysticks or other types of game controllers, media recording devices, external storage devices, network interface cards, etc.

[0213] In at least one embodiment, memory controller 1580 facilitates data transfer between CPU 1500 and system memory 1590. In at least one embodiment, core complex 1510 and graphics complex 1540 share system memory 1590. In at least one embodiment, CPU 1500 implements a memory subsystem, which includes, but is not limited to, any number and type of memory controllers 1580 and memory devices that may be dedicated to a component or shared among multiple components. In at least one embodiment, CPU 1500 implements a cache subsystem, which includes, but is not limited to, one or more cache memories (e.g., L2 cache 1528 and L3 cache 1530), each cache memory may be component-private or shared among any number of components (e.g., core 1520 and core complex 1510).

[0214] Figure 16 An exemplary accelerator integration slice 1690 according to at least one embodiment is illustrated. As used herein, a "slice" includes a designated portion of the processing resources of an accelerator integrated circuit. In at least one embodiment, the accelerator integrated circuit provides cache management, memory access, environment management, and interrupt management services for multiple graphics processing engines among multiple graphics acceleration modules. Each graphics processing engine may comprise a separate GPU. Optionally, the graphics processing engine may include different types of graphics processing engines within the GPU, such as graphics execution units, media processing engines (e.g., video encoders / decoders), samplers, and blit engines. In at least one embodiment, a graphics acceleration module may be a GPU having multiple graphics processing engines. In at least one embodiment, the graphics processing engines may be individual GPUs integrated on a general-purpose package, line card, or chip.

[0215] The application's effective address space 1682 within system memory 1614 stores process element 1683. In one embodiment, process element 1683 is stored in response to a GPU call 1681 from an application 1680 executing on processor 1607. Process element 1683 contains the processing state of the corresponding application 1680. A job descriptor (WD) 1684 contained in process element 1683 may be a single job requested by the application or may contain pointers to job queues. In at least one embodiment, WD 1684 is a pointer to a job request queue in the application's effective address space 1682.

[0216] The graphics acceleration module 1646 and / or the various graphics processing engines may be shared by all or some processes in the system. In at least one embodiment, infrastructure may be included for establishing a processing state and sending the WD 1684 to the graphics acceleration module 1646 to begin operation in a virtualized environment.

[0217] In at least one embodiment, a dedicated process programming model is used for the implementation. In this model, a single process owns the graphics acceleration module 1646 or an individual graphics processing engine. Since the graphics acceleration module 1646 is owned by a single process, the hypervisor initializes the accelerator integrated circuit for the owned partition, and the operating system initializes the accelerator integrated circuit for the owned partition when the graphics acceleration module 1646 is allocated.

[0218] During operation, the WD fetch unit 1691 in the accelerator integrated slice 1690 fetches the next WD 1684, which includes instructions for the work to be performed by one or more graphics processing engines of the graphics acceleration module 1646. Data from the WD 1684 can be stored in register 1645 and used by the memory management unit (MMU) 1639, interrupt management circuitry 1647, and / or environment management circuitry 1648, as shown. For example, one embodiment of the MMU 1639 includes segment / page roaming circuitry for accessing segment / page tables 1686 within the OS virtual address space 1685. The interrupt management circuitry 1647 can handle interrupt events (INT) 1692 received from the graphics acceleration module 1646. When performing graph operations, the effective address 1693 generated by the graphics processing engine is translated into an actual address by the MMU 1639.

[0219] In one embodiment, the same register set 1645 is copied for each graphics processing engine and / or graphics acceleration module 1646 and can be initialized by the hypervisor or operating system. Each of these copied registers can be included in the accelerator integration slice 1690. Exemplary registers that can be initialized by the hypervisor are shown in Table 1.

[0220] Table 1 – Registers for Supervisor Initialization

[0221]

[0222]

[0223] Table 2 shows exemplary registers that can be initialized by the operating system.

[0224] Table 2 – Operating System Initialization Registers

[0225] 1 Process and thread identification 2 Valid Address (EA) Environment Save / Restore Pointer 3 Virtual Address (VA) accelerator utilization record pointer 4 Virtual address (VA) stores segment table pointers 5 mask of authority 6 Job descriptor

[0226] In one embodiment, each WD 1684 is specific to a particular graphics acceleration module 1646 and / or a particular graphics processing engine. It contains all the information required for the graphics processing engine to perform its work or to do so, or it may be a pointer to a memory location where the application has established a command queue for the work to be done.

[0227] Figures 17A-17B An exemplary graphics processor according to at least one embodiment herein is illustrated. In at least one embodiment, any exemplary graphics processor may be manufactured using one or more IP cores. In addition to the illustrations, other logic and circuitry may be included in at least one embodiment, including additional graphics processor / cores, peripheral interface controllers, or general-purpose processor cores. In at least one embodiment, the exemplary graphics processor is used within a System-on-a-Chip (SoC).

[0228] Figure 17A An exemplary graphics processor 1710 of a SoC integrated circuit according to at least one embodiment is shown, which can be manufactured using one or more IP cores. Figure 17B An additional exemplary graphics processor 1740 of a SoC integrated circuit according to at least one embodiment is shown, which can be manufactured using one or more IP cores. In at least one embodiment, Figure 17A The graphics processor 1710 is a low-power graphics processor core. In at least one embodiment, Figure 17B The graphics processor 1740 is a higher-performance graphics processor core. In at least one embodiment, each graphics processor 1710, 1740 may be... Figure 12 A variant of the 1210 graphics processor.

[0229] In at least one embodiment, the graphics processor 1710 includes a vertex processor 1705 and one or more fragment processors 1715A-1715N (e.g., 1715A, 1715B, 1715C, 1715D to 1715N-1 and 1715N). In at least one embodiment, the graphics processor 1710 can execute different shader programs via separate logic, such that the vertex processor 1705 is optimized to perform operations for the vertex shader program, while one or more fragment processors 1715A-1715N perform fragment (e.g., pixel) shading operations for fragments or pixels or shader programs. In at least one embodiment, the vertex processor 1705 performs the vertex processing stage of the 3D graphics pipeline and generates primitive and vertex data. In at least one embodiment, the fragment processors 1715A-1715N use the primitive and vertex data generated by the vertex processor 1705 to generate framebuffers for display on a display device. In at least one embodiment, the fragment processors 1715A-1715N are optimized to execute fragment shader programs as provided in the OpenGL API, which can be used to perform operations similar to those of pixel shader programs provided in the Direct 3D API.

[0230] In at least one embodiment, the graphics processor 1710 additionally includes one or more MMUs 1720A-1720B, caches 1725A-1725B, and circuit interconnects 1730A-1730B. In at least one embodiment, one or more MMUs 1720A-1720B provide a virtual-to-physical address mapping for the graphics processor 1710, including for vertex processors 1705 and / or fragment processors 1715A-1715N, which can reference vertex or image / texture data stored in memory, in addition to vertex or image / texture data stored in one or more caches 1725A-1725B. In at least one embodiment, one or more MMUs 1720A-1720B can be synchronized with other MMUs within the system, including with... Figure 12 One or more application processors 1205, image processors 1215, and / or video processors 1220 are associated with one or more MMUs, enabling each processor 1205-1220 to participate in a shared or unified virtual memory system. In at least one embodiment, one or more circuit interconnects 1730A-1730B enable the graphics processor 1710 to connect to other IP cores within the SoC via the SoC's internal bus or via a direct connection.

[0231] In at least one embodiment, the graphics processor 1740 includes Figure 17AThe graphics processor 1710 includes one or more MMUs 1720A-1720B, caches 1725A-1725B, and circuit interconnects 1730A-1730B. In at least one embodiment, the graphics processor 1740 includes one or more shader cores 1755A-1755N (e.g., 1755A, 1755B, 1755C, 1755D, 1755E, 1755F, to 1755N-1 and 1755N) that provide a unified shader core architecture, wherein a single core or type of core can execute all types of programmable shader code, including shader program code for implementing vertex shaders, fragment shaders, and / or compute shaders. In at least one embodiment, the number of shader cores may vary. In at least one embodiment, the graphics processor 1740 includes an inter-core task manager 1745 that acts as a thread dispatcher to assign execution threads to one or more shader cores 1755A-1755N and a tile unit 1758 to accelerate tile-based rendering operations, wherein rendering operations of a scene are subdivided in image space, for example, to take advantage of local spatial consistency within the scene or to optimize the use of internal caches.

[0232] Figure 18A A graphics core 1800 according to at least one embodiment is shown. In at least one embodiment, the graphics core 1800 may include... Figure 12 The graphics processor 1210 is located within it. In at least one embodiment, the graphics core 1800 may be... Figure 17B The graphics core 1800 uses a unified shader core 1755A-1755N. In at least one embodiment, the graphics core 1800 includes a shared instruction cache 1802, texture units 1818, and cache / shared memory 1820, which are common to execution resources within the graphics core 1800. In at least one embodiment, the graphics core 1800 may include multiple slices 1801A-1801N or partitions of each core, and the graphics processor may include multiple instances of the graphics core 1800. Slices 1801A-1801N may include supporting logic, including local instruction caches 1804A-1804N, thread schedulers 1806A-1806N, thread dispatchers 1808A-1808N, and a set of registers 1810A-1810N. In at least one embodiment, slices 1801A-1801N may include a set of additional function units (AFU) 1812A-1812N, floating-point units (FPU) 1814A-1814N, integer arithmetic logic units (ALU) 1816A-1816N, address calculation units (ACU) 1813A-1813N, double-precision floating-point units (DPFPU) 1815A-1815N, and matrix processing units (MPU) 1817A-1817N.

[0233] In one embodiment, the FPU 1814A-1814N can perform single-precision (32-bit) and half-precision (16-bit) floating-point operations, while the DPFPU 1815A-1815N can perform double-precision (64-bit) floating-point operations. In at least one embodiment, the ALU 1816A-1816N can perform variable-precision integer operations with 8-bit, 16-bit, and 32-bit precision, and can be configured for mixed-precision operations. In at least one embodiment, the MPU 1817A-1817N can also be configured for mixed-precision matrix operations, including half-precision floating-point operations and 8-bit integer operations. In at least one embodiment, the MPU 1817A-1817N can perform various matrix operations to accelerate CUDA programs, including enabling accelerated Generalized Matrix-to-Matrix Multiplication (GEMM). In at least one embodiment, the AFU 1812A-1812N can perform additional logical operations not supported by floating-point or integer units, including trigonometric operations (e.g., Sine, Cosine, etc.).

[0234] Figure 18B A general-purpose graphics processing unit (GPGPU) 1830 is illustrated in at least one embodiment. In at least one embodiment, the GPGPU 1830 is highly parallel and suitable for deployment on a multi-chip module. In at least one embodiment, the GPGPU 1830 can be configured to enable highly parallel computational operations to be performed by a GPU array. In at least one embodiment, the GPGPU 1830 can be directly linked to other instances of the GPGPU 1830 to create a multi-GPU cluster to improve execution time for CUDA programs. In at least one embodiment, the GPGPU 1830 includes a host interface 1832 for connection to a host processor. In at least one embodiment, the host interface 1832 is a PCIe interface. In at least one embodiment, the host interface 1832 can be a vendor-specific communication interface or communication structure. In at least one embodiment, the GPGPU 1830 receives commands from the host processor and uses a global scheduler 1834 to assign execution threads associated with those commands to a set of compute clusters 1836A-1836H. In at least one embodiment, computing clusters 1836A-1836H share cache memory 1838. In at least one embodiment, cache memory 1838 can be used as an advanced cache of cache memory within computing clusters 1836A-1836H.

[0235] In at least one embodiment, the GPGPU 1830 includes memory 1844A-1844B coupled to the computing cluster 1836A-1836H via a set of memory controllers 1842A-1842B. In at least one embodiment, memory 1844A-1844B may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory.

[0236] In at least one embodiment, computing clusters 1836A-1836H each include a set of graphics cores, such as Figure 18A The graphics core 1800 may include various types of integer and floating-point logic units, capable of performing computational operations at various precisions, including computations suitable for CUDA programs. For example, in at least one embodiment, at least a subset of the floating-point units in each computing cluster 1836A-1836H may be configured to perform 16-bit or 32-bit floating-point operations, while different subsets of the floating-point units may be configured to perform 64-bit floating-point operations.

[0237] In at least one embodiment, multiple instances of the GPGPU 1830 can be configured to operate as a computing cluster. The computing clusters 1836A-1836H can implement any technically feasible communication technology for synchronization and data exchange. In at least one embodiment, the multiple instances of the GPGPU 1830 communicate via a host interface 1832. In at least one embodiment, the GPGPU 1830 includes an I / O hub 1839 that couples the GPGPU 1830 to a GPU link 1840, enabling direct connection to other instances of the GPGPU 1830. In at least one embodiment, the GPU link 1840 is coupled to a dedicated GPU-to-GPU bridge, enabling communication and synchronization among the multiple instances of the GPGPU 1830. In at least one embodiment, the GPU link 1840 is coupled to a high-speed interconnect for sending and receiving data to and from other GPGPUs or parallel processors. In at least one embodiment, the multiple instances of the GPGPU 1830 reside in a separate data processing system and communicate via a network device accessible via the host interface 1832. In at least one embodiment, the GPU link 1840 may be configured to connect to a host processor, supplementing or replacing the host interface 1832. In at least one embodiment, the GPGPU 1830 may be configured to execute CUDA programs.

[0238] Figure 19AA parallel processor 1900 according to at least one embodiment is shown. In at least one embodiment, various components of the parallel processor 1900 may be implemented using one or more integrated circuit devices, such as programmable processors, application-specific integrated circuits (ASICs), or FPGAs.

[0239] In at least one embodiment, the parallel processor 1900 includes a parallel processing unit 1902. In at least one embodiment, the parallel processing unit 1902 includes an I / O unit 1904 that enables communication with other devices, including other instances of the parallel processing unit 1902. In at least one embodiment, the I / O unit 1904 can be directly connected to other devices. In at least one embodiment, the I / O unit 1904 is connected to other devices using a hub or switch interface (e.g., a memory hub 1905). In at least one embodiment, the connection between the memory hub 1905 and the I / O unit 1904 forms a communication link. In at least one embodiment, the I / O unit 1904 is connected to a host interface 1906 and a memory crossbar switch 1916, wherein the host interface 1906 receives commands for performing processing operations, and the memory crossbar switch 1916 receives commands for performing memory operations.

[0240] In at least one embodiment, when host interface 1906 receives a command buffer via I / O unit 1904, host interface 1906 can direct work operations to execute those commands to front end 1908. In at least one embodiment, front end 1908 is coupled to scheduler 1910, which is configured to assign commands or other work items to processing array 1912. In at least one embodiment, scheduler 1910 ensures that processing array 1912 is correctly configured and in an active state before assigning tasks to processing array 1912. In at least one embodiment, scheduler 1910 is implemented via firmware logic executed on a microcontroller. In at least one embodiment, the microcontroller-implemented scheduler 1910 can be configured to perform complex scheduling and work assignment operations at both coarse and fine granular levels, enabling fast preemption and context switching of threads executing on processing array 1912. In at least one embodiment, host software can demonstrate workloads scheduled on processing array 1912 via one of multiple graphics processing doorbells. In at least one embodiment, the workload can then be automatically distributed on the processing array 1912 by the scheduler 1910 logic within the microcontroller, which includes the scheduler 1910.

[0241] In at least one embodiment, the processing array 1912 may include up to "N" processing clusters (e.g., clusters 1914A, 1914B to 1914N). In at least one embodiment, each cluster 1914A-1914N of the processing array 1912 may execute a large number of concurrent threads. In at least one embodiment, the scheduler 1910 may use various scheduling and / or work allocation algorithms to allocate work to the clusters 1914A-1914N of the processing array 1912, which may vary depending on the workload generated by each type of program or computation. In at least one embodiment, scheduling may be handled dynamically by the scheduler 1910, or may be partially assisted by compiler logic during the compilation of program logic configured to be executed by the processing array 1912. In at least one embodiment, different clusters 1914A-1914N of the processing array 1912 may be assigned to process different types of programs or to perform different types of computations.

[0242] In at least one embodiment, the processing array 1912 can be configured to perform various types of parallel processing operations. In at least one embodiment, the processing array 1912 is configured to perform general-purpose parallel computing operations. For example, in at least one embodiment, the processing array 1912 may include logic for performing processing tasks, including filtering video and / or audio data, performing modeling operations, including physical operations, and performing data transformations.

[0243] In at least one embodiment, the processing array 1912 is configured to perform parallel graphics processing operations. In at least one embodiment, the processing array 1912 may include additional logic to support the execution of such graphics processing operations, including but not limited to texture sampling logic for performing texture operations, as well as tessellation logic and other vertex processing logic. In at least one embodiment, the processing array 1912 may be configured to execute shader programs related to graphics processing, such as, but not limited to, vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. In at least one embodiment, the parallel processing unit 1902 may transfer data from system memory via I / O unit 1904 for processing. In at least one embodiment, during processing, the transferred data may be stored in on-chip memory (e.g., parallel processor memory 1922) and then written back to system memory.

[0244] In at least one embodiment, when the parallel processing unit 1902 is used to perform graph processing, the scheduler 1910 may be configured to divide the processing workload into tasks of approximately equal size to better distribute graphics processing operations among the multiple clusters 1914A-1914N of the processing array 1912. In at least one embodiment, portions of the processing array 1912 may be configured to perform different types of processing. For example, in at least one embodiment, a first portion may be configured to perform vertex shading and topology generation, a second portion may be configured to perform tessellation and geometry shading, and a third portion may be configured to perform pixel shading or other screen-space operations to generate a rendered image for display. In at least one embodiment, intermediate data generated by one or more of the clusters 1914A-1914N may be stored in a buffer to allow intermediate data to be transferred between the clusters 1914A-1914N for further processing.

[0245] In at least one embodiment, the processing array 1912 may receive processing tasks to be executed via a scheduler 1910, which receives commands defining the processing tasks from a front end 1908. In at least one embodiment, the processing task may include an index of data to be processed, such as surface (patch) data, raw data, vertex data, and / or pixel data, as well as state parameters and commands defining how the data is processed (e.g., what program to execute). In at least one embodiment, the scheduler 1910 may be configured to acquire an index corresponding to a task, or may receive an index from the front end 1908. In at least one embodiment, the front end 1908 may be configured to ensure that the processing array 1912 is configured to be active before initiating a workload specified by an incoming command buffer (e.g., a batch buffer, push buffer, etc.).

[0246] In at least one embodiment, each of one or more instances of the parallel processing unit 1902 may be coupled to the parallel processor memory 1922. In at least one embodiment, the parallel processor memory 1922 may be accessed via a memory crossbar switch 1916, which may receive memory requests from the processing array 1912 and the I / O unit 1904. In at least one embodiment, the memory crossbar switch 1916 may be accessed via a memory interface 1918. In at least one embodiment, the memory interface 1918 may include a plurality of partition units (e.g., partition units 1920A, 1920B to 1920N), each of which may be coupled to a portion (e.g., a memory cell) of the parallel processor memory 1922. In at least one embodiment, the plurality of partition units 1920A-1920N are configured to be equal to the number of memory units, such that the first partition unit 1920A has a corresponding first memory unit 1924A, the second partition unit 1920B has a corresponding memory unit 1924B, and the Nth partition unit 1920N has a corresponding Nth memory unit 1924N. In at least one embodiment, the number of partition units 1920A-1920N may not be equal to the number of memory devices.

[0247] In at least one embodiment, memory cells 1924A-1924N may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. In at least one embodiment, memory cells 1924A-1924N may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM). In at least one embodiment, rendering targets such as frame buffers or texture maps may be stored across memory cells 1924A-1924N, allowing partitioning cells 1920A-1920N to write portions of each rendering target in parallel, to efficiently utilize the available bandwidth of the parallel processor memory 1922. In at least one embodiment, local instances of the parallel processor memory 1922 may be excluded to facilitate a unified memory design that combines system memory with local cache memory.

[0248] In at least one embodiment, any of the clusters 1914A-1914N of the processing array 1912 can process data to be written to any memory cell 1924A-1924N within the parallel processor memory 1922. In at least one embodiment, the memory crossbar switch 1916 can be configured to transfer the output of each cluster 1914A-1914N to any partition cell 1920A-1920N or another cluster 1914A-1914N, and the clusters 1914A-1914N can perform further processing operations on the output. In at least one embodiment, each cluster 1914A-1914N can communicate with the memory interface 1918 via the memory crossbar switch 1916 to read from or write to various external storage devices. In at least one embodiment, the memory crossbar switch 1916 has a connection to a memory interface 1918 for communication with I / O unit 1904, and a connection to a local instance of parallel processor memory 1922, thereby enabling processing units within different processing clusters 1914A-1914N to communicate with system memory or other memory not local to parallel processing unit 1902. In at least one embodiment, the memory crossbar switch 1916 may use virtual channels to separate traffic flows between clusters 1914A-1914N and partition units 1920A-1920N.

[0249] In at least one embodiment, multiple instances of the parallel processing unit 1902 may be provided on a single insert card, or multiple insert cards may be interconnected. In at least one embodiment, different instances of the parallel processing unit 1902 may be configured to interoperate, even if the different instances have different numbers of processing cores, different numbers of local parallel processor memories, and / or other configuration differences. For example, in at least one embodiment, some instances of the parallel processing unit 1902 may include higher-precision floating-point units relative to other instances. In at least one embodiment, a system combining one or more instances of the parallel processing unit 1902 or the parallel processor 1900 may be implemented in various configurations and form factors, including but not limited to desktop, laptop, or handheld personal computers, servers, workstations, game consoles, and / or embedded systems.

[0250] Figure 19B A processing cluster 1994 according to at least one embodiment is illustrated. In at least one embodiment, the processing cluster 1994 is included within a parallel processing unit. In at least one embodiment, the processing cluster 1994 is... Figure 19AAn instance of one of the processing clusters 1914A-1914N. In at least one embodiment, the processing cluster 1994 can be configured to execute a number of threads in parallel, wherein the term "thread" refers to an instance of a specific program executing on a particular set of input data. In at least one embodiment, a Single Instruction Multiple Data (SIMD) instruction issuing technique is used to support the parallel execution of a large number of threads without providing multiple independent instruction units. In at least one embodiment, a Single Instruction Multiple Threading (SIMT) technique is used to support the parallel execution of a large number of generally synchronous threads, which uses a common instruction unit configured to issue instructions to a set of processing engines within each processing cluster 1994.

[0251] In at least one embodiment, the operation of the processing cluster 1994 can be controlled by a pipeline manager 1932 that assigns processing tasks to the SIMT parallel processors. In at least one embodiment, the pipeline manager 1932... Figure 19A The scheduler 1910 receives instructions and manages the execution of these instructions via the graphics multiprocessor 1934 and / or texture unit 1936. In at least one embodiment, the graphics multiprocessor 1934 is an exemplary instance of a SIMT parallel processor. However, in at least one embodiment, the processing cluster 1994 may include various types of SIMT parallel processors with different architectures. In at least one embodiment, the processing cluster 1994 may include one or more instances of the graphics multiprocessor 1934. In at least one embodiment, the graphics multiprocessor 1934 can process data, and the data cross switch 1940 can be used to distribute the processed data to one of a number of possible destinations, including other shader units. In at least one embodiment, the pipeline manager 1932 can facilitate the distribution of processed data by specifying the destination of the processed data to be distributed via the data cross switch 1940.

[0252] In at least one embodiment, each graphics multiprocessor 1934 within the processing cluster 1994 may include the same set of functional execution logic (e.g., arithmetic logic units, load-memory units (LSUs), etc.). In at least one embodiment, the functional execution logic may be configured in a pipelined manner, wherein new instructions may be issued before previous instructions complete. In at least one embodiment, the functional execution logic supports a variety of operations, including integer and floating-point arithmetic, comparison operations, Boolean operations, shift operations, and computation of various algebraic functions. In at least one embodiment, the same functional unit hardware may be used to perform different operations, and any combination of functional units may exist.

[0253] In at least one embodiment, instructions transmitted to the processing cluster 1994 constitute threads. In at least one embodiment, a group of threads executed across a set of parallel processing engines is a thread group. In at least one embodiment, the thread group executes programs on different input data. In at least one embodiment, each thread within the thread group may be assigned to a different processing engine within the graphics multiprocessor 1934. In at least one embodiment, the thread group may include fewer threads than the number of processing engines within the graphics multiprocessor 1934. In at least one embodiment, when the number of threads included in the thread group is less than the number of processing engines, one or more processing engines may be idle during a loop that is processing the thread group. In at least one embodiment, the thread group may also include more threads than the number of processing engines within the graphics multiprocessor 1934. In at least one embodiment, when the thread group includes more threads than the number of processing engines within the graphics multiprocessor 1934, processing can be performed in consecutive clock cycles. In at least one embodiment, multiple thread groups can be executed simultaneously on the graphics multiprocessor 1934.

[0254] In at least one embodiment, the graphics multiprocessor 1934 includes an internal cache memory for performing load and store operations. In at least one embodiment, the graphics multiprocessor 1934 may forgo the internal cache and use a cache memory within the processing cluster 1994 (e.g., L1 cache 1948). In at least one embodiment, each graphics multiprocessor 1934 may also access partition units (e.g., Figure 19A The L2 cache is located within partition units 1920A-1920N, which are shared among all processing clusters 1994 and can be used to transfer data between threads. In at least one embodiment, the graphics multiprocessor 1934 can also access off-chip global memory, which may include one or more of local parallel processor memory and / or system memory. In at least one embodiment, any memory outside of the parallel processing unit 1902 can be used as global memory. In at least one embodiment, the processing cluster 1994 includes multiple instances of the graphics multiprocessor 1934, which can share common instructions and data that can be stored in the L1 cache 1948.

[0255] In at least one embodiment, each processing cluster 1994 may include an MMU 1945 configured to map virtual addresses to physical addresses. In at least one embodiment, one or more instances of the MMU 1945 may reside in Figure 19AThe memory interface 1918 is located within the MMU. In at least one embodiment, the MMU 1945 includes a set of page table entries (PTEs) for mapping virtual addresses to physical addresses of tiles (more information about tiles is discussed) and optionally to cache line indices. In at least one embodiment, the MMU 1945 may include an address translation back buffer (TLB) or a cache that may reside within the graphics multiprocessor 1934, L1 cache 1948, or processing cluster 1994. In at least one embodiment, physical addresses are processed to allocate surface data access locality for efficient request interleaving between partition units. In at least one embodiment, cache line indices may be used to determine whether a request for a cache line is a hit or a miss.

[0256] In at least one embodiment, the processing cluster 1994 can be configured such that each graphics multiprocessor 1934 is coupled to a texture unit 1936 to perform texture mapping operations, which may involve determining texture sample locations, reading texture data, and filtering texture data. In at least one embodiment, texture data is read as needed from an internal texture L1 cache (not shown) or from an L1 cache within the graphics multiprocessor 1934, and texture data is also retrieved from an L2 cache, local parallel processor memory, or system memory. In at least one embodiment, each graphics multiprocessor 1934 outputs a processed task to a data crossbar switch 1940 to provide the processed task to another processing cluster 1994 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via a memory crossbar switch 1916. In at least one embodiment, a pre-raster operation unit (preROP) 1942 is configured to receive data from the graphics multiprocessor 1934 and direct the data to a ROP unit, which may be associated with a partitioning unit (e.g., [missing information]). Figure 19A The PreROP1942 unit is located together with the partition units 1920A-1920N. In at least one embodiment, the PreROP1942 unit can perform optimizations for color blending, organize pixel color data, and perform address translation.

[0257] Figure 19C A graphics multiprocessor 1996 according to at least one embodiment is illustrated. In at least one embodiment, the graphics multiprocessor 1996 is... Figure 19BThe graphics multiprocessor 1934 is included. In at least one embodiment, the graphics multiprocessor 1996 is coupled to the pipeline manager 1932 of the processing cluster 1994. In at least one embodiment, the graphics multiprocessor 1996 has an execution pipeline including, but not limited to, an instruction cache 1952, an instruction unit 1954, an address mapping unit 1956, a register file 1958, one or more GPGPU cores 1962, and one or more LSUs 1966. The GPGPU cores 1962 and LSUs 1966 are coupled to cache memory 1972 and shared memory 1970 via memory and cache interconnect 1968.

[0258] In at least one embodiment, instruction cache 1952 receives a stream of instructions to be executed from pipeline manager 1932. In at least one embodiment, instructions are cached in instruction cache 1952 and dispatched to instruction unit 1954 for execution. In one embodiment, instruction unit 1954 may dispatch instructions as thread groups (e.g., thread bundles), assigning each thread of the thread group to a different execution unit within GPGPU core 1962. In at least one embodiment, instructions can access any local, shared, or global address space by specifying an address within a unified address space. In at least one embodiment, address mapping unit 1956 may be used to translate addresses in the unified address space into different memory addresses that can be accessed by LSU 1966.

[0259] In at least one embodiment, register file 1958 provides a set of registers for the functional units of graphics multiprocessor 1996. In at least one embodiment, register file 1958 provides temporary storage for operands of data paths connected to functional units of graphics multiprocessor 1996 (e.g., GPGPU core 1962, LSU 1966). In at least one embodiment, register file 1958 is partitioned among each functional unit, such that a dedicated portion of register file 1958 is allocated to each functional unit. In at least one embodiment, register file 1958 is partitioned among different thread groups being executed by graphics multiprocessor 1996.

[0260] In at least one embodiment, each of the GPGPU cores 1962 may include an FPU and / or an ALU for executing instructions of the graph multiprocessor 1996. The GPGPU cores 1962 may be architecturally similar or may differ in architecture. In at least one embodiment, a first portion of the GPGPU core 1962 includes a single-precision FPU and an integer ALU, while a second portion of the GPGPU core includes a double-precision FPU. In at least one embodiment, the FPU may implement the IEEE 754-2008 standard for floating-point arithmetic or enable variable-precision floating-point arithmetic. In at least one embodiment, the graphics multiprocessor 1996 may additionally include one or more fixed-function or special-function units to perform specific functions, such as copying rectangles or pixel blending operations. In at least one embodiment, one or more of the GPGPU cores 1962 may also include fixed-function or special-function logic.

[0261] In at least one embodiment, the GPGPU core 1962 includes SIMD logic capable of executing a single instruction on multiple sets of data. In at least one embodiment, the GPGPU core 1962 can physically execute SIMD4, SIMD8, and SIMD9 instructions, and logically execute SIMD1, SIMD2, and SIMD32 instructions. In at least one embodiment, the SIMD instructions for the GPGPU core can be generated by a shader compiler at compile time, or automatically generated when executing a program written and compiled for a Single Program Multiple Data (SPMD) or SIMT architecture. In at least one embodiment, multiple threads of a program configured for a SIMT execution model can be executed using a single SIMD instruction. For example, in at least one embodiment, eight SIMD threads performing the same or similar operations can be executed in parallel using a single SIMD8 logic unit.

[0262] In at least one embodiment, the memory and cache interconnect 1968 is an interconnect network connecting each functional unit of the graphics multiprocessor 1996 to the register file 1958 and the shared memory 1970. In at least one embodiment, the memory and cache interconnect 1968 is a cross-switch interconnect that allows the LSU 1966 to perform load and store operations between the shared memory 1970 and the register file 1958. In at least one embodiment, the register file 1958 can operate at the same frequency as the GPGPU core 1962, resulting in very low latency for data transfer between the GPGPU core 1962 and the register file 1958. In at least one embodiment, the shared memory 1970 can be used to enable communication between threads executing on functional units within the graphics multiprocessor 1996. In at least one embodiment, the cache memory 1972 can be used, for example, as a data cache to cache texture data communicated between functional units and texture units 1936. In at least one embodiment, the shared memory 1970 can also be used as a program-managed cache. In at least one embodiment, in addition to the data automatically cached in cache memory 1972, the thread executing on GPGPU core 1962 can also programmatically store data in shared memory.

[0263] In at least one embodiment, a parallel processor or GPGPU, as described herein, is communicatively coupled to the host / processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various general-purpose GPU (GPGPU) functions. In at least one embodiment, the GPU may be communicatively coupled to the host processor / core via a bus or other interconnect (e.g., high-speed interconnects such as PCIe or NVLink). In at least one embodiment, the GPU may be integrated with the core on the same package or chip and communicatively coupled to the core via an internal processor bus / interconnect (i.e., within the package or chip). In at least one embodiment, regardless of how the GPU is connected, the processor core may assign work to the GPU in the form of a sequence of commands / instructions contained in the WD. In at least one embodiment, the GPU then uses dedicated circuitry / logic to efficiently process these commands / instructions.

[0264] Figure 20A graphics processor 2000 according to at least one embodiment is illustrated. In at least one embodiment, the graphics processor 2000 includes a ring interconnect 2002, a pipeline front end 2004, a media engine 2037, and graphics cores 2080A-2080N. In at least one embodiment, the ring interconnect 2002 couples the graphics processor 2000 to other processing units, including other graphics processors or one or more general-purpose processor cores. In at least one embodiment, the graphics processor 2000 is one of many processors integrated within a multi-core processing system.

[0265] In at least one embodiment, the graphics processor 2000 receives multiple batches of commands via a ring interconnect 2002. In at least one embodiment, the input commands are interpreted by a command stream converter 2003 in a pipeline front-end 2004. In at least one embodiment, the graphics processor 2000 includes scalable execution logic to perform 3D geometry processing and media processing via graphics cores 2080A-2080N. In at least one embodiment, for 3D geometry processing commands, the command stream converter 2003 provides commands to the geometry pipeline 2036. In at least one embodiment, for at least some media processing commands, the command stream converter 2003 provides commands to a video front-end 2034, which is coupled to a media engine 2037. In at least one embodiment, the media engine 2037 includes a video quality engine (VQE) 2030 for video and image post-processing, and a multi-format encoding / decoding (MFX) engine 2033 for providing hardware-accelerated media data encoding and decoding. In at least one embodiment, the geometry pipeline 2036 and the media engine 2037 each generate an execution thread for thread execution resources provided by at least one graphics core 2080A.

[0266] In at least one embodiment, the graphics processor 2000 includes scalable thread execution resources characterized by modular graphics cores 2080A-2080N (sometimes referred to as core slices), each modular core having multiple sub-cores 2050A-2050N, 2060A-2060N (sometimes referred to as core sub-slices). In at least one embodiment, the graphics processor 2000 may have any number of graphics cores 2080A to 2080N. In at least one embodiment, the graphics processor 2000 includes a graphics core 2080A having at least a first sub-core 2050A and a second sub-core 2060A. In at least one embodiment, the graphics processor 2000 is a low-power processor having a single sub-core (e.g., 2050A). In at least one embodiment, the graphics processor 2000 includes multiple graphics cores 2080A-2080N, each graphics core including a set of first sub-cores 2050A-2050N and a set of second sub-cores 2060A-2060N. In at least one embodiment, each of the first sub-cores 2050A-2050N includes at least a first set of execution units (EUs) 2052A-2052N and media / texture samplers 2054A-2054N. In at least one embodiment, each of the second sub-cores 2060A-2060N includes at least a second set of execution units 2062A-2062N and samplers 2064A-2064N. In at least one embodiment, each sub-core 2050A-2050N and 2060A-2060N shares a set of shared resources 2070A-2070N. In at least one embodiment, the shared resources include shared cache memory and pixel operation logic.

[0267] Figure 21 A processor 2100 is illustrated according to at least one embodiment. In at least one embodiment, the processor 2100 may include, but is not limited to, logic circuitry for executing instructions. In at least one embodiment, the processor 2100 can execute instructions, including x86 instructions, ARM instructions, special-purpose instructions for ASICs, etc. In at least one embodiment, the processor 2110 may include registers for storing packaged data, such as the 64-bit wide MMX™ registers in an Intel microprocessor enabled by MMX technology in Santa Clara, California. In at least one embodiment, the MMX registers available in integer and floating-point forms can operate with packaged data elements accompanied by SIMD and Streaming SIMD Extensions (“SSE”) instructions. In at least one embodiment, a 128-bit wide XMM register associated with SSE2, SSE3, SSE4, AVX, or later (generally referred to as “SSEx”) technologies can hold such packaged data operands. In at least one embodiment, the processor 2110 can execute instructions to accelerate CUAD programs.

[0268] In at least one embodiment, processor 2100 includes an ordered front end (“front end”) 2101 to fetch instructions to be executed and prepare instructions for later use in the processor pipeline. In at least one embodiment, front end 2101 may include several units. In at least one embodiment, instruction prefetcher 2126 fetches instructions from memory and provides the instructions to instruction decoder 2128, which in turn decodes or interprets the instructions. For example, in at least one embodiment, instruction decoder 2128 decodes the received instructions into one or more operations, so-called “micro-instructions” or “micro-operations” (also referred to as “micro-operations” or “micro-instructions”), for execution. In at least one embodiment, instruction decoder 2128 parses the instructions into opcodes and corresponding data and control fields, which can be used by the microarchitecture to perform the operations. In at least one embodiment, trace cache 2130 may assemble the decoded micro-instructions into a program-ordered sequence or trace in micro-instruction queue 2134 for execution. In at least one embodiment, when trace cache 2130 encounters complex instructions, microcode ROM 2132 provides the micro-instructions required to complete the operation.

[0269] In at least one embodiment, some instructions may be converted into a single micro-operation, while others require several micro-operations to complete the entire operation. In at least one embodiment, if more than four micro-instructions are required to complete an instruction, the instruction decoder 2128 may access the microcode ROM 2132 to execute the instruction. In at least one embodiment, an instruction may be decoded into a small number of micro-instructions for processing at the instruction decoder 2128. In at least one embodiment, if multiple micro-instructions are required to complete an operation, the instructions may be stored in the microcode ROM 2132. In at least one embodiment, the trace cache 2130 references an entry point programmable logic array (“PLA”) to determine the correct micro-instruction pointer for reading a microcode sequence from the microcode ROM 2132 to complete one or more instructions, according to at least one embodiment. In at least one embodiment, after the microcode ROM 2132 has completed the micro-operation ordering of the instructions, the machine front end 2101 may resume fetching micro-operations from the trace cache 2130.

[0270] In at least one embodiment, an out-of-order execution engine (“out-of-order engine”) 2103 can prepare instructions for execution. In at least one embodiment, the out-of-order execution logic has multiple buffers to smooth and reorder the instruction flow to optimize performance as instructions descend the pipeline and are scheduled for execution. The out-of-order execution engine 2103 includes, but is not limited to, an allocator / register renamer 2140, a memory microinstruction queue 2142, an integer / floating-point microinstruction queue 2144, a memory scheduler 2146, a fast scheduler 2102, a slow / general-purpose floating-point scheduler (“slow / general-purpose FP scheduler”) 2104, and a simple floating-point scheduler (“simple FP scheduler”) 2106. In at least one embodiment, the fast scheduler 2102, the slow / general-purpose floating-point scheduler 2104, and the simple floating-point scheduler 2106 are also collectively referred to as “microinstruction schedulers 2102, 2104, 2106”. The allocator / register renamer 2140 allocates the machine buffers and resources required for the sequential execution of each microinstruction. In at least one embodiment, allocator / register renaming unit 2140 renames logical registers to entries in a register file. In at least one embodiment, allocator / register renaming unit 2140 also assigns entries for each microinstruction in one of two microinstruction queues, memory microinstruction queue 2142 for memory operations and integer / floating-point microinstruction queue 2144 for non-memory operations, preceding memory scheduler 2146 and microinstruction schedulers 2102, 2104, 2106. In at least one embodiment, microinstruction schedulers 2102, 2104, 2106 determine when they are ready to execute a microinstruction based on the readiness of their dependent input register operand sources and the availability of the execution resource microinstructions that need to be completed. In at least one embodiment, fast scheduler 2102 of at least one embodiment can schedule on each half of the master clock cycle, while slow / general-purpose floating-point scheduler 2104 and simple floating-point scheduler 2106 can schedule once per master processor clock cycle. In at least one embodiment, microinstruction schedulers 2102, 2104, and 2106 arbitrate the scheduling ports to schedule microinstructions for execution.

[0271] In at least one embodiment, execution block 2111 includes, but is not limited to, integer register file / branch network 2108, floating-point register file / branch network (“FP register file / branch network”) 2110, address generation units (“AGU”) 2112 and 2114, fast arithmetic logic units (“fast ALU”) 2116 and 2118, slow ALU 2120, floating-point ALU (“FP”) 2122, and floating-point movement unit (“FP movement”) 2124. In at least one embodiment, integer register file / branch network 2108 and floating-point register file / bypass network 2110 are also referred to herein as “register files 2108, 2110”. In at least one embodiment, AGUS 2112 and 2114, fast ALU 2116 and 2118, slow ALU 2120, floating-point ALU 2122, and floating-point movement unit 2124 are also referred to herein as "execution units 2112, 2114, 2116, 2118, 2120, 2122, and 2124". In at least one embodiment, the execution block may include, but is not limited to, any number (including zero) and type of register files, branch networks, address generation units, and execution units (in any combination).

[0272] In at least one embodiment, register files 2108, 2110 may be arranged between microinstruction schedulers 2102, 2104, 2106 and execution units 2112, 2114, 2116, 2118, 2120, 2122, and 2124. In at least one embodiment, integer register file / tribute network 2108 performs integer operations. In at least one embodiment, floating-point register file / tribute network 2110 performs floating-point operations. In at least one embodiment, each of register files 2108, 2110 may include, but is not limited to, a tribute network that can bypass or forward recently completed results not yet written to the register file to a new dependent object. In at least one embodiment, register files 2108, 2110 may communicate data with each other. In at least one embodiment, integer register file / tribute network 2108 may include, but is not limited to, two separate register files, one register file for low-order 32-bit data and a second register file for high-order 32-bit data. In at least one embodiment, the floating-point register file / branch network 2110 may include, but is not limited to, entries with a width of 128 bits, since floating-point instructions typically have operands with a width of 64 to 128 bits.

[0273] In at least one embodiment, execution units 2112, 2114, 2116, 2118, 2120, 2122, and 2124 can execute instructions. In at least one embodiment, register files 2108 and 2110 store integer and floating-point data operation values that the microinstructions need to execute. In at least one embodiment, processor 2100 may include, but is not limited to, any number of execution units 2112, 2114, 2116, 2118, 2120, 2122, and 2124, and combinations thereof. In at least one embodiment, floating-point ALU 2122 and floating-point move unit 2124 can perform floating-point, MMX, SIMD, AVX, and SSE or other operations, including specialized machine learning instructions. In at least one embodiment, floating-point ALU 2122 may include, but is not limited to, a 64-bit multiplication-64-bit floating-point divider to perform division, square root, and remainder micro-operations. In at least one embodiment, floating-point hardware can be used to process instructions involving floating-point values. In at least one embodiment, ALU operations can be passed to fast ALUs 2116 and 2118. In at least one embodiment, fast ALUs 2116 and 2118 can perform fast operations with an effective delay of half a clock cycle. In at least one embodiment, most complex integer operations are routed to slow ALU 2120, because slow ALU 2120 can include, but is not limited to, integer execution hardware for long-latency type operations, such as multipliers, shifters, flag logic, and branching. In at least one embodiment, memory load / store operations can be performed by ALUs 2112 and 2114. In at least one embodiment, fast ALU 2116, fast ALU 2118, and slow ALU 2120 can perform integer operations on 64-bit data operands. In at least one embodiment, fast ALU 2116, fast ALU 2118, and slow ALU 2120 can be implemented to support various data bit sizes including 16, 32, 128, 256, etc. In at least one embodiment, the floating-point ALU 2122 and the floating-point movement unit 2124 can be implemented to support a range of operands with various bit widths. In at least one embodiment, the floating-point ALU 2122 and the floating-point movement unit 2124 can operate on 128-bit wide packaged data operands in conjunction with SIMD and multimedia instructions.

[0274] In at least one embodiment, microinstruction schedulers 2102, 2104, and 2106 schedule dependent operations before the parent load completes execution. In at least one embodiment, since microinstructions can be speculatively scheduled and executed within processor 2100, processor 2100 may also include logic for handling memory misses. In at least one embodiment, if a data load miss occurs in the data cache, there may be a dependent operation running in the pipeline that temporarily deprives the scheduler of the correct data. In at least one embodiment, a replay mechanism tracks and re-executes instructions that use incorrect data. In at least one embodiment, it may be necessary to replay dependent operations and may allow independent operations to be completed. In at least one embodiment, the scheduler and replay mechanism of at least one embodiment of the processor may also be designed to capture instruction sequences for text string comparison operations.

[0275] In at least one embodiment, the term "register" may refer to an onboard processor storage location that can be used as part of an instruction that identifies operands. In at least one embodiment, a register may be one that can be used externally to the processor (from a programmer's perspective). In at least one embodiment, a register may not be limited to a particular type of circuit. Rather, in at least one embodiment, a register may store data, provide data, and perform the functions described herein. In at least one embodiment, the registers described herein may be implemented using a variety of different techniques via circuitry within the processor, such as dedicated physical registers, dynamically allocated physical registers renamed using register renaming, a combination of dedicated and dynamically allocated physical registers, etc. In at least one embodiment, an integer register stores 32-bit integer data. The register file of at least one embodiment also includes eight multimedia SIMD registers for encapsulating data.

[0276] Figure 22 A processor 2200 according to at least one embodiment is illustrated. In at least one embodiment, the processor 2200 includes, but is not limited to, one or more processor cores (cores) 2202A-2202N, an integrated memory controller 2214, and an integrated graphics processor 2208. In at least one embodiment, the processor 2200 may include additional cores up to and including additional processor cores 2202N, indicated by dashed boxes. In at least one embodiment, each processor core 2202A-2202N includes one or more internal cache units 2204A-2204N. In at least one embodiment, each processor core may also access one or more units 2206 of a shared cache.

[0277] In at least one embodiment, internal cache units 2204A-2204N and shared cache unit 2206 represent a cache memory hierarchy within processor 2200. In at least one embodiment, cache memory units 2204A-2204N may include at least one level of instruction and data within each processor core, and one or more levels of cache in a shared intermediate cache, such as L2, L3, L4, or other levels of cache, wherein the highest level of cache is classified as LLC before external memory. In at least one embodiment, cache coherence logic maintains coherence between the various cache units 2206 and 2204A-2204N.

[0278] In at least one embodiment, the processor 2200 may further include a group of one or more bus controller units 2216 and a system agent core 2210. In at least one embodiment, the one or more bus controller units 2216 manage a group of peripheral buses, such as one or more PCI or PCI Express buses. In at least one embodiment, the system agent core 2210 provides management functions for various processor components. In at least one embodiment, the system agent core 2210 includes one or more integrated memory controllers 2214 to manage access to various external memory devices (not shown).

[0279] In at least one embodiment, one or more processor cores 2202A-2202N include support for multi-threaded concurrent processing. In at least one embodiment, system agent core 2210 includes components for coordinating and operating processor cores 2202A-2202N during multi-threaded processing. In at least one embodiment, system agent core 2210 may additionally include a power control unit (PCU) including logic and components for regulating one or more power states of processor cores 2202A-2202N and graphics processor 2208.

[0280] In at least one embodiment, processor 2200 further includes graphics processor 2208 to perform graphics processing operations. In at least one embodiment, graphics processor 2208 is coupled to a shared cache unit 2206 and a system proxy core 2210 including one or more integrated memory controllers 2214. In at least one embodiment, system proxy core 2210 further includes a display controller 2211 for driving graphics processor output to one or more coupled displays. In at least one embodiment, display controller 2211 may also be a separate module coupled to graphics processor 2208 via at least one interconnect, or it may be integrated within graphics processor 2208.

[0281] In at least one embodiment, ring-based interconnect unit 2212 is used to couple internal components of processor 2200. In at least one embodiment, alternative interconnect units, such as point-to-point interconnects, switched interconnects, or other technologies, may be used. In at least one embodiment, graphics processor 2208 is coupled to ring interconnect 2212 via I / O link 2213.

[0282] In at least one embodiment, I / O link 2213 represents at least one of a variety of I / O interconnects, including packaged I / O interconnects that facilitate communication between various processor components and high-performance embedded memory module 2218 (e.g., eDRAM module). In at least one embodiment, each of processor cores 2202A-2202N and graphics processor 2208 uses embedded memory module 2218 as a shared LLC.

[0283] In at least one embodiment, processor cores 2202A-2202N are homogeneous cores executing a common instruction set architecture. In at least one embodiment, processor cores 2202A-2202N are heterogeneous in terms of the instruction set architecture (ISA), with one or more processor cores 2202A-2202N executing a common instruction set, while one or more other processor cores 2202A-2202N execute a common instruction set or a subset of a different instruction set. In at least one embodiment, processor cores 2202A-2202N are heterogeneous in terms of microarchitecture, with one or more cores having relatively high power consumption coupled to one or more power cores having lower power consumption. In at least one embodiment, processor 2200 can be implemented on one or more chips or implemented as a SoC integrated circuit.

[0284] Figure 23 A graphics processing unit (GPU) core 2300 according to at least one embodiment described is illustrated. In at least one embodiment, the GPU core 2300 is included within a GPU core array. In at least one embodiment, the GPU core 2300 (sometimes referred to as a core slice) may be one or more GPU cores within a modular GPU. In at least one embodiment, the GPU core 2300 is an example of a GPU core slice, and the GPU described herein may include multiple GPU core slices based on target power and performance envelopes. In at least one embodiment, each GPU core 2300 may include a fixed-function block 2330, also referred to as a sub-slice, coupled to a plurality of sub-cores 2301A-2301F, which includes modular blocks of general-purpose and fixed-function logic.

[0285] In at least one embodiment, the fixed-function block 2330 includes a geometry / fixed-function pipeline 2336, which, for example, may be shared by all sub-cores of the graphics processor 2300 in a lower-performance and / or lower-power graphics processor implementation. In at least one embodiment, the geometry / fixed-function pipeline 2336 includes a 3D fixed-function pipeline, a video front-end unit, a thread generator and a thread dispatcher, and a unified return buffer manager that manages a unified return buffer.

[0286] In at least one embodiment, fixed function block 2330 further includes a graphics SoC interface 2337, a graphics microcontroller 2338, and a media pipeline 2339. The graphics SoC interface 2337 provides an interface between the graphics core 2300 and other processor cores in the SoC integrated circuit system. In at least one embodiment, the graphics microcontroller 2338 is a programmable subprocessor configurable to manage various functions of the graphics processor 2300, including thread dispatch, scheduling, and preemption. In at least one embodiment, the media pipeline 2339 includes logic that facilitates decoding, encoding, preprocessing, and / or post-processing of multimedia data, including image and video data. In at least one embodiment, the media pipeline 2339 implements media operations via requests for computation or sampling logic within subcores 2301-2301F.

[0287] In at least one embodiment, the SoC interface 2337 enables the graphics core 2300 to communicate with a general-purpose application processor core (e.g., a CPU) and / or other components within the SoC, including memory hierarchy elements such as shared LLC memory, system RAM, and / or embedded on-chip or packaged DRAM. In at least one embodiment, the SoC interface 2337 also enables communication with fixed-function devices within the SoC (e.g., a camera imaging pipeline) and enables the use and / or implementation of global memory atoms that can be shared between the graphics core 2300 and the CPU within the SoC. In at least one embodiment, the SoC interface 2337 also implements power management control for the graphics core 2300 and enables interfacing between the clock domain of the graphics core 2300 and other clock domains within the SoC. In at least one embodiment, the SoC interface 2337 enables the reception of command buffers from a command stream converter and a global thread dispatcher, configured to provide commands and instructions to each of one or more graphics cores within the graphics processor. In at least one embodiment, when a media operation is to be performed, commands and instructions can be dispatched to media pipeline 2339, or when a graph processing operation is to be performed, they can be assigned to geometry and fixed function pipelines (e.g., geometry and fixed function pipelines 2336 and 2314).

[0288] In at least one embodiment, the graphics microcontroller 2338 may be configured to perform various scheduling and management tasks on the graphics core 2300. In at least one embodiment, the graphics microcontroller 2338 may perform graph and / or computation workload scheduling on various graphics parallel engines within the execution unit (EU) arrays 2302A-2302F, 2304A-2304F in the subcores 2301A-2301F. In at least one embodiment, host software executing on the CPU core of the SoC including the graphics core 2300 may submit a workload of one of a plurality of graphics processor doorbells, which invokes scheduling operations on the appropriate graphics engine. In at least one embodiment, the scheduling operation includes determining which workload should be run next, submitting the workload to a command stream converter, preempting existing workloads running on the engine, monitoring the progress of the workload, and notifying the host software when the workload is completed. In at least one embodiment, the graphics microcontroller 2338 may also facilitate a low-power or idle state of the graphics core 2300, thereby providing the graphics core 2300 with the ability to save and restore registers across low-power state transitions within the graphics core 2300, independent of the operating system and / or the graphics driver software on the system.

[0289] In at least one embodiment, the graphics core 2300 may have more or fewer subcores than the illustrated subcores 2301A-2301F, up to N modular subcores. For each group of N subcores, in at least one embodiment, the graphics core 2300 may further include shared functional logic 2310, shared and / or cache memory 2312, geometry / fixed-function pipeline 2314, and additional fixed-function logic 2316 to accelerate various graphics and computational processing operations. In at least one embodiment, the shared functional logic 2310 may include logic units (e.g., samplers, mathematical and / or inter-thread communication logic) that can be shared by each of the N subcores within the graphics core 2300. The shared and / or cache memory 2312 may be an LLC of the N subcores 2301A-2301F within the graphics core 2300, and may also be used as shared memory accessible by multiple subcores. In at least one embodiment, a geometry / fixed function pipeline 2314 may be included to replace the geometry / fixed function pipeline 2336 within the fixed function block 2330, and may include the same or similar logic units.

[0290] In at least one embodiment, the graphics core 2300 includes additional fixed-function logic 2316, which may include various fixed-function acceleration logics for use by the graphics core 2300. In at least one embodiment, the additional fixed-function logic 2316 includes additional geometry pipelines for use in position-only shading. In position-only shading, there are at least two geometry pipelines, and in the full geometry pipeline and culling pipeline within the geometry / fixed-function pipelines 2316, 2336, it is an additional geometry pipeline that can be included in the additional fixed-function logic 2316. In at least one embodiment, the culling pipeline is a trimmed version of the full geometry pipeline. In at least one embodiment, the full pipeline and the culling pipeline can execute different instances of the application, each with a separate environment. In at least one embodiment, position-only shading can hide long culling runs of discarded triangles, thereby allowing shading to be completed earlier in some cases. For example, in at least one embodiment, the culling pipeline logic in the additional fixed-function logic 2316 can execute the position shader in parallel with the main application and typically generates critical results faster than the full pipeline because the culling pipeline acquires and occludes the positional attributes of vertices without performing rasterization and rendering pixels to the framebuffer. In at least one embodiment, the culling pipeline can use the generated critical results to compute visibility information for all triangles, regardless of whether those triangles were culled. In at least one embodiment, the full pipeline (which may be referred to as the replay pipeline in this case) can consume visibility information to skip culled triangles and only occlude the visible triangles that are ultimately passed to the rasterization stage.

[0291] In at least one embodiment, the additional fixed-function logic 2316 may also include general target processing acceleration logic, such as fixed-function matrix multiplication logic, for implementing a decelerated CUAD program.

[0292] In at least one embodiment, each graphics subcore 2301A-2301F includes a set of execution resources that can be used to perform graph, media, and computational operations in response to requests from the graphics pipeline, media pipeline, or shader program. In at least one embodiment, the graphics subcore 2301A-2301F includes multiple EU arrays 2302A-2302F, 2304A-2304F, thread dispatch and inter-thread communication (TD / IC) logic 2303A-2303F, 3D (e.g., texture) samplers 2305A-2305F, media samplers 2306A-2306F, shader processors 2307A-2307F, and shared local memory (SLM) 2308A-2308F. Each of the EU arrays 2302A-2302F and 2304A-2304F contains multiple execution units, which are GUGPUs capable of servicing graphics, media, or computational operations, performing floating-point and integer / fixed-point logic operations, including graphics, media, or computational shader programs. In at least one embodiment, the TD / IC logic 2303A-2303F performs local thread dispatch and thread control operations for the execution units within the subcore and facilitates communication between threads executing on the execution units of the subcore. In at least one embodiment, the 3D samplers 2305A-2305F can read data associated with textures or other 3D graphics into memory. In at least one embodiment, the 3D samplers can read texture data differently based on the sampling state and texture format configured and associated with a given texture. In at least one embodiment, the media samplers 2306A-2306F can perform similar read operations based on the type and format associated with the media data. In at least one embodiment, each graphics subcore 2301A-2301F may alternatively include a unified 3D and media sampler. In at least one embodiment, threads executing on execution units within each subcore 2301A-2301F may utilize shared local memory 2308A-2308F within each subcore, enabling threads executing within a thread group to use a common pool of on-chip memory for execution.

[0293] Figure 24A parallel processing unit (“PPU”) 2400 according to at least one embodiment is illustrated. In at least one embodiment, the PPU 2400 is configured with machine-readable code that, if executed by the PPU 2400, causes the PPU 2400 to perform some or all of the processes and techniques described herein. In at least one embodiment, the PPU 2400 is a multi-threaded processor implemented on one or more integrated circuit devices and utilizes multi-threading as a latency-hiding technique designed to process computer-readable instructions (also known as machine-readable instructions or simple instructions) that are executed in parallel on multiple threads. In at least one embodiment, a thread refers to an execution thread and is an instance of a set of instructions configured to be executed by the PPU 2400. In at least one embodiment, the PPU 2400 is a graphics processing unit (“GPU”) configured to implement a graphics rendering pipeline for processing three-dimensional (“3D”) graphics data to generate two-dimensional (“2D”) image data for display on a display device, such as an LCD device. In at least one embodiment, the PPU 2400 is used to perform computations, such as linear algebra operations and machine learning operations. Figure 24 An example parallel processor is shown for illustrative purposes only and should be interpreted as a non-limiting example of a processor architecture implemented in at least one embodiment.

[0294] In at least one embodiment, one or more PPUs 2400 are configured to accelerate high-performance computing (“HPC”), data center, and machine learning applications. In at least one embodiment, one or more PPUs 2400 are configured to accelerate CUDA programs. In at least one embodiment, the PPU 2400 includes, but is not limited to, I / O unit 2406, front-end unit 2410, scheduler unit 2412, job allocation unit 2414, hub 2416, crossbar switch (“Xbar”) 2420, one or more general-purpose processing clusters (“GPC”) 2418, and one or more partitioning units (“memory partitioning units”) 2422. In at least one embodiment, the PPU 2400 is connected to a host processor or other PPU 2400 via one or more high-speed GPU interconnects (“GPU interconnects”) 2408. In at least one embodiment, the PPU 2400 is connected to a host processor or other peripheral devices via a system bus or interconnect 2402. In one embodiment, the PPU 2400 is connected to a local memory including one or more memory devices (“memory”) 2404. In at least one embodiment, the memory device 2404 includes, but is not limited to, one or more dynamic random access memory (“DRAM”) devices. In at least one embodiment, the one or more DRAM devices are configured and / or configurable as a high bandwidth memory (“HBM”) subsystem, and multiple DRAM dies are stacked within each device.

[0295] In at least one embodiment, the high-speed GPU interconnect 2408 may refer to a wire-based multi-channel communication link used by the system for scaling, and includes one or more PPUs 2400s (“CPUs”) coupled with one or more CPUs, supporting cache coherency between the PPUs 2400s and the CPUs, as well as CPU master control. In at least one embodiment, the high-speed GPU interconnect 2408 transmits data and / or commands to other units of the PPU 2400, such as one or more copy engines, video encoders, video decoders, power management units, and / or other components, via a hub 2416. Figure 24 Other components that may not be explicitly shown.

[0296] In at least one embodiment, the I / O unit 2406 is configured to access the host processor via the system bus 2402. Figure 24(Not shown) Sending and receiving communications (e.g., commands, data). In at least one embodiment, I / O unit 2406 communicates directly with the host processor via system bus 2402 or via one or more intermediate devices (e.g., memory bridges). In at least one embodiment, I / O unit 2406 may communicate with one or more other processors (e.g., one or more PPUs 2400) via system bus 2402. In at least one embodiment, I / O unit 2406 implements a PCIe interface for communication via the PCIe bus. In at least one embodiment, I / O unit 2406 implements an interface for communication with external devices.

[0297] In at least one embodiment, I / O unit 2406 decodes packets received via system bus 2402. In at least one embodiment, at least some packets represent commands configured to cause PPU 2400 to perform various operations. In at least one embodiment, I / O unit 2406 sends the decoded commands to various other units of PPU 2400 as specified by the commands. In at least one embodiment, the commands are sent to front-end unit 2410 and / or to hub 2416 or other units of PPU 2400, such as one or more copy engines, video encoders, video decoders, power management units, etc. Figure 24 (Not explicitly shown). In at least one embodiment, I / O unit 2406 is configured to route communication between various logical units of PPU 2400.

[0298] In at least one embodiment, a program executed by the host processor encodes a command stream in a buffer that provides a workload to the PPU 2400 for processing. In at least one embodiment, the workload includes instructions and data to be processed by those instructions. In at least one embodiment, the buffer is a region in memory accessible (e.g., read / write) by both the host processor and the PPU 2400—the host interface unit can be configured to access a buffer in system memory connected to the system bus 2402 via memory requests transmitted through the system bus 2402 via the I / O unit 2406. In at least one embodiment, the host processor writes a command stream to the buffer and then sends a pointer indicating the start of the command stream to the PPU 2400, such that the front-end unit 2410 receives pointers to one or more command streams and manages one or more command streams, reads commands from the command streams, and forwards the commands to the respective units of the PPU 2400.

[0299] In at least one embodiment, front-end unit 2410 is coupled to scheduler unit 2412, which configures various GPCs 2418 to process tasks defined by one or more command streams. In at least one embodiment, scheduler unit 2412 is configured to track status information related to the various tasks managed by scheduler unit 2412, wherein the status information may indicate which GPC 2418 a task is assigned to, whether the task is active or inactive, the priority associated with the task, etc. In at least one embodiment, scheduler unit 2412 manages multiple tasks executed on one or more GPCs 2418.

[0300] In at least one embodiment, scheduler unit 2412 is coupled to job allocation unit 2414, which is configured to dispatch tasks for execution on GPC 2418. In at least one embodiment, job allocation unit 2414 tracks multiple scheduled tasks received from scheduler unit 2412 and manages a pool of pending tasks and an active task pool for each GPC 2418. In at least one embodiment, the pool of pending tasks includes multiple time slots (e.g., 32 time slots) containing tasks assigned to a particular GPC 2418; the active task pool may include multiple time slots (e.g., 4 time slots) for tasks actively processed by GPC 2418, such that as one of the GPCs 2418 completes its execution, that task is evicted from the active task pool of the GPC 2418, and one of other tasks is selected from the pool of pending tasks and scheduled for execution on the GPC 2418. In at least one embodiment, if an active task is idle on GPC 2418, for example while waiting for data dependency resolution, the active task is evicted from GPC 2418 and returned to the task pool, while another task in the task pool is selected and scheduled to be executed on GPC 2418.

[0301] In at least one embodiment, the work allocation unit 2414 communicates with one or more GPCs 2418 via XBar 2420. In at least one embodiment, XBar 2420 is an interconnect network that couples a plurality of units of PPU 2400 to other units of PPU 2400, and can be configured to couple the work allocation unit 2414 to a specific GPC 2418. In at least one embodiment, other units of one or more PPUs 2400 can also be connected to XBar 2420 via hub 2416.

[0302] In at least one embodiment, tasks are managed by scheduler unit 2412 and assigned to one of GPCs 2418 by job allocation unit 2414. GPCs 2418 are configured to process tasks and produce results. In at least one embodiment, results may be consumed by other tasks in GPCs 2418, routed to different GPCs 2418 via XBar 2420, or stored in memory 2404. In at least one embodiment, results may be written to memory 2404 via partitioning unit 2422, which implements a memory interface for writing data to or reading data from memory 2404. In at least one embodiment, results may be transferred to another PPU 2400 or CPU via high-speed GPU interconnect 2408. In at least one embodiment, the PPU 2400 includes, but is not limited to, U partitioning units 2422, which is equal to the number of separate and distinct memory devices 2404 coupled to the PPU 2400.

[0303] In at least one embodiment, the host processor executes a driver core that implements an application programming interface (API) that enables one or more applications executing on the host processor to schedule operations for execution on the PPU 2400. In one embodiment, multiple computing applications are executed concurrently by the PPU 2400, and the PPU 2400 provides isolation, Quality of Service (“QoS”), and independent address spaces for the multiple computing applications. In at least one embodiment, an application generates instructions (e.g., in the form of API calls) that cause the driver core to generate one or more tasks for execution by the PPU 2400, and the driver core outputs the tasks to one or more streams processed by the PPU 2400. In at least one embodiment, each task includes one or more associated thread groups, which may be referred to as a warp. In at least one embodiment, a warp includes multiple associated threads (e.g., 32 threads) that can be executed in parallel. In at least one embodiment, a cooperating thread may refer to multiple threads, including instructions for performing tasks and exchanging data via shared memory.

[0304] Figure 25 A GPC2500 according to at least one embodiment is shown. In at least one embodiment, the GPC 2500 is Figure 24The GPC 2418. In at least one embodiment, each GPC 2500 includes, but is not limited to, multiple hardware units for processing tasks, and each GPC 2500 includes, but is not limited to, a pipeline manager 2502, a pre-raster operation unit (“PROP”) 2504, a raster engine 2508, a work assignment crossbar switch (“WDX”) 2516, a memory management unit (“MMU”) 2518, one or more data processing clusters (“DPC”) 2506, and any suitable combination of components.

[0305] In at least one embodiment, the operation of GPC 2500 is controlled by pipeline manager 2502. In at least one embodiment, pipeline manager 2502 manages the configuration of one or more DPCs 2506 to handle tasks assigned to GPC 2500. In at least one embodiment, pipeline manager 2502 configures at least one of one or more DPCs 2506 to implement at least a portion of the graphics rendering pipeline. In at least one embodiment, DPC 2506 is configured to execute vertex shader programs on programmable streaming multiprocessor (“SM”) 2514. In at least one embodiment, pipeline manager 2502 is configured to route packets received from the work allocation unit to appropriate logic units within GPC 2500, and in at least one embodiment, some packets may be routed to fixed-function hardware units in PROP 2504 and / or raster engine 2508, while other packets may be routed to DPC 2506 for processing by raw engine 2512 or SM 2514. In at least one embodiment, pipeline manager 2502 configures at least one of DPCs 2506 to implement a neural network model and / or computation pipeline. In at least one embodiment, pipeline manager 2502 configures at least one of DPCs 2506 to execute at least a portion of a CUDA program.

[0306] In at least one embodiment, the PROP unit 2504 is configured to route data generated by the raster engine 2508 and DPC 2506 to the raster operation (“ROP”) unit in the partition unit, for example, in conjunction with the above. Figure 24Memory partitioning unit 2422, etc., are described in more detail. In at least one embodiment, PROP unit 2504 is configured to perform optimizations for color blending, organize pixel data, perform address translation, etc. In at least one embodiment, raster engine 2508 includes, but is not limited to, multiple fixed-function hardware units configured to perform various raster operations, and in at least one embodiment, raster engine 2508 includes, but is not limited to, a setup engine, a coarse raster engine, a culling engine, a clipping engine, a fine raster engine, a tile aggregation engine, and any suitable combination thereof. In at least one embodiment, the setup engine receives the transformed vertices and generates plane equations associated with the geometric primitives defined by the vertices; the plane equations are transmitted to the coarse raster engine to generate coverage information of basic primitives (e.g., x, y coverage masks of tiles); the output of the coarse raster engine is transmitted to the culling engine, in which fragments associated with primitives that fail the z-test are culled, and transmitted to the clipping engine, in which fragments located outside the view frustum are clipped. In at least one embodiment, the cropped and culled fragments are passed to a fine raster engine to generate properties of pixel fragments based on a planar equation generated by the settings engine. In at least one embodiment, the output of the raster engine 2508 includes fragments that will be processed by any suitable entity (e.g., by a fragment shader implemented within the DPC 2506).

[0307] In at least one embodiment, each DPC 2506 included in the GPC 2500 includes, but is not limited to, an M-pipeline controller (“MPC”) 2510; a primitive engine 2512; one or more SMs 2514; and any suitable combination thereof. In at least one embodiment, the MPC 2510 controls the operation of the DPC 2506, routing packets received from the pipeline manager 2502 to the appropriate units within the DPC 2506. In at least one embodiment, packets associated with vertices are routed to the primitive engine 2512, which is configured to retrieve vertex attributes associated with vertices from memory; conversely, packets associated with shader programs may be sent to the SM 2514.

[0308] In at least one embodiment, the SM 2514 includes, but is not limited to, a programmable streaming processor configured to process tasks represented by multiple threads. In at least one embodiment, the SM 2514 is multithreaded and configured to execute multiple threads (e.g., 32 threads) from a specific thread group concurrently, and implements a Single Instruction, Multiple Data (“SIMD”) architecture, wherein each thread in a group of threads (e.g., a thread bundle) is configured to process a different dataset based on the same instruction set. In at least one embodiment, all threads in the thread group execute the same instructions. In at least one embodiment, the SM 2514 implements a Single Instruction, Multiple Thread (“SIMT”) architecture, wherein each thread in a group of threads is configured to process a different dataset based on the same instruction set, but wherein individual threads in the thread group are allowed to diverge during execution. In at least one embodiment, a program counter, call stack, and execution state are maintained for each thread bundle, thereby achieving concurrency between the thread bundle and serial execution within the thread bundle when threads in the thread bundle diverge. In another embodiment, a program counter, call stack, and execution state are maintained for each individual thread, thereby ensuring equal concurrency among all threads within and between thread bundles. In at least one embodiment, an execution state is maintained for each individual thread, and threads executing the same instructions can be converged and executed in parallel to improve efficiency. The following is in conjunction with... Figure 26 At least one embodiment of SM 2514 is described in more detail.

[0309] In at least one embodiment, the MMU 2518 is integrated with the GPC 2500 and memory partitioning unit (e.g., Figure 24 The MMU 2518 provides an interface between partition units 2422 and provides virtual address to physical address translation, memory protection, and memory request arbitration. In at least one embodiment, the MMU 2518 provides one or more translation back buffers (“TLBs”) for performing virtual address to physical address translation in memory.

[0310] Figure 26 A streaming multiprocessor (“SM”) 2600 according to at least one embodiment is illustrated. In at least one embodiment, the SM 2600 is Figure 25SM 2514. In at least one embodiment, SM 2600 includes, but is not limited to, instruction cache 2602; one or more scheduler units 2604; register file 2608; one or more processing cores (“cores”) 2610; one or more special function units (“SFUs”) 2612; one or more load / store units (“LSUs”) 2614; interconnect network 2616; shared memory / Level 1 (“L1”) cache 2618; and any suitable combination thereof. In at least one embodiment, the work allocation unit schedules tasks to execute on a general-purpose processing cluster (“GPC”) of parallel processing units (“PPUs”), and each task is assigned to a specific data processing cluster (“DPC”) within the GPC, and if the task is associated with a shader program, the task is assigned to one of the SMs 2600. In at least one embodiment, scheduler unit 2604 receives tasks from the work allocation unit and manages instruction scheduling for one or more thread blocks assigned to the SM 2600. In at least one embodiment, scheduler unit 2604 schedules thread blocks to execute as thread bundles of parallel threads, wherein each thread block is assigned at least one thread bundle. In at least one embodiment, each thread bundle executes a thread. In at least one embodiment, scheduler unit 2604 manages multiple different thread blocks, assigns thread bundles to different thread blocks, and then dispatches instructions from multiple different cooperative groups to various functional units (e.g., processing core 2610, SFU 2612, and LSU 2614) in each clock cycle.

[0311] In at least one embodiment, a "cooperative group" can refer to a programming model used to organize groups of communicating threads, allowing developers to express the granularity at which threads are communicating, thereby enabling richer and more efficient parallel decompositions. In at least one embodiment, the cooperative startup API supports synchronization between thread blocks to execute parallel algorithms. In at least one embodiment, the API of a conventional programming model provides a single, simple construct for synchronizing cooperative threads: a barrier across all threads in a thread block (e.g., the `syncthreads()` function). However, in at least one embodiment, programmers can define thread groups at a granularity smaller than that of thread blocks and synchronize within the defined groups to achieve higher performance, design flexibility, and software reuse in the form of a set of group-wide functional interfaces. In at least one embodiment, cooperative groups enable programmers to explicitly define thread groups at the sub-block and multi-block granularity and perform set operations, such as synchronizing threads within the cooperative group. In at least one embodiment, the sub-block granularity is as small as that of a single thread. In at least one embodiment, the programming model supports clean composition across software boundaries, allowing library and utility functions to be safely synchronized in their local environment without having to make assumptions about convergence. In at least one embodiment, the cooperative group primitives enable new patterns of cooperative parallelism, including but not limited to producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire thread block mesh.

[0312] In at least one embodiment, dispatch unit 2606 is configured to send instructions to one or more functional units, and scheduler unit 2604 includes, but is not limited to, two dispatch units 2606 that enable two different instructions from the same thread bundle to be dispatched in each clock cycle. In at least one embodiment, each scheduler unit 2604 includes a single dispatch unit 2606 or additional dispatch units 2606.

[0313] In at least one embodiment, each SM 2600 includes, but is not limited to, a register file 2608 that provides a set of registers for functional units of the SM 2600. In at least one embodiment, the register file 2608 is partitioned between each functional unit, thereby allocating a dedicated portion of the register file 2608 for each functional unit. In at least one embodiment, the register file 2608 is partitioned between different thread bundles executed by the SM 2600, and the register file 2608 provides temporary storage for operands in data paths connected to functional units. In at least one embodiment, each SM 2600 includes, but is not limited to, a plurality of L processing cores 2610. In at least one embodiment, the SM 2600 includes, but is not limited to, a large number (e.g., 128 or more) of different processing cores 2610. In at least one embodiment, each processing core 2610 includes, but is not limited to, a fully pipelined, single-precision, double-precision, and / or mixed-precision processing unit, which includes, but is not limited to, a floating-point arithmetic logic unit and an integer arithmetic logic unit. In at least one embodiment, the floating-point arithmetic logic unit implements the IEEE 754-2008 standard for floating-point arithmetic. In at least one embodiment, the processing core 2610 includes, but is not limited to, 64 single-precision (32-bit) floating-point cores, 64 integer cores, 32 double-precision (64-bit) floating-point cores and 8 tensor cores.

[0314] In at least one embodiment, the tensor core is configured to perform matrix operations. In at least one embodiment, one or more tensor cores are included in the processing core 2610. In at least one embodiment, the tensor core is configured to perform deep learning matrix arithmetic, such as convolution operations for neural network training and inference. In at least one embodiment, each tensor core operates on a 4×4 matrix and performs matrix multiplication and accumulation operations D = A×B + C, where A, B, C, and D are 4×4 matrices.

[0315] In at least one embodiment, matrix multiplication inputs A and B are 16-bit floating-point matrices, and accumulation matrices C and D are either 16-bit or 32-bit floating-point matrices. In at least one embodiment, the Tensor Core performs 32-bit floating-point accumulation on the 16-bit floating-point input data. In at least one embodiment, the 16-bit floating-point multiplication uses 64 operations to obtain a full-precision product, which is then accumulated with other intermediate multiplications using 32-bit floating-point addition to perform a 4x4x4 matrix multiplication. In at least one embodiment, the Tensor Core is used to perform matrix operations on larger two-dimensional or higher-dimensional matrices composed of these smaller components. In at least one embodiment, an API (such as the CUDA-C++ API) exposes specialized matrix loading, matrix multiplication and accumulation, and matrix storage operations to efficiently utilize the Tensor Core from CUDA-C++ programs. In at least one embodiment, at the CUDA level, the thread bundle level interface assumes a 16×16 matrix spanning all 32 thread bundle threads.

[0316] In at least one embodiment, each SM 2600 includes, but is not limited to, M SFUs 2612 that perform special functions (e.g., attribute evaluation, inverse square root, etc.). In at least one embodiment, the SFUs 2612 include, but are not limited to, tree traversal units configured to traverse hierarchical tree data structures. In at least one embodiment, the SFUs 2612 include, but are not limited to, texture units configured to perform texture map filtering operations. In at least one embodiment, the texture unit is configured to load texture maps (e.g., a 2D array of texture pixels) from memory and sample the texture maps to produce sampled texture values for use by a shader program executed by the SM 2600. In at least one embodiment, the texture maps are stored in shared memory / L1 cache 2618. In at least one embodiment, the texture unit uses mip maps (e.g., texture maps with different levels of detail) to implement texture operations (such as filtering operations). In at least one embodiment, each SM 2600 includes, but is not limited to, two texture units.

[0317] In at least one embodiment, each SM 2600 includes, but is not limited to, N LSUs 2614 that implement load and store operations between the shared memory / L1 cache 2618 and the register file 2608. In at least one embodiment, each SM 2600 includes, but is not limited to, an interconnect network 2616 that connects each functional unit to the register file 2608, and the LSUs 2614 that connect to both the register file 2608 and the shared memory / L1 cache 2618. In at least one embodiment, the interconnect network 2616 is a crossbar switch that can be configured to connect any functional unit to any register in the register file 2608 and to connect the LSUs 2614 to memory locations in both the register file 2608 and the shared memory / L1 cache 2618.

[0318] In at least one embodiment, the shared memory / L1 cache 2618 is an array of on-chip memory that, in at least one embodiment, allows data storage and communication between the SM 2600 and the primitive engine, as well as between threads within the SM 2600. In at least one embodiment, the shared memory / L1 cache 2618 includes, but is not limited to, a storage capacity of 128KB and is located on the path from the SM 2600 to the partition unit. In at least one embodiment, the shared memory / L1 cache 2618 is used for cache reads and writes. In at least one embodiment, one or more of the shared memory / L1 cache 2618, the L2 cache, and the memory are backup storage.

[0319] In at least one embodiment, combining data caching and shared memory functionality into a single memory block provides improved performance for both types of memory access. In at least one embodiment, the capacity is used by programs that do not use shared memory or is used as a cache; for example, if shared memory is configured to use half its capacity, texture and load / store operations can use the remaining capacity. According to at least one embodiment, integration within the shared memory / L1 cache 2618 enables the shared memory / L1 cache 2618 to be used as a high-throughput pipeline for streaming data, while providing high-bandwidth and low-latency access to frequently reused data. In at least one embodiment, a simpler configuration can be used compared to graphics processing when configured for general-purpose parallel computing. In at least one embodiment, a simpler programming model is created by bypassing fixed-function GPUs. In at least one embodiment, in a general-purpose parallel computing configuration, the work allocation unit directly allocates and distributes blocks of threads to the DPC. In at least one embodiment, threads within a block execute the same program, using unique thread IDs in computation to ensure each thread produces a unique result, using an SM 2600 to execute the program and perform computations, using a shared memory / L1 cache 2618 for communication between threads, and using an LSU 2614 to read and write global memory via the shared memory / L1 cache 2618 and memory partitioning units. In at least one embodiment, when configured for general-purpose parallel computing, the SM 2600 writes commands to the scheduler unit 2604 that can be used to start new work on the DPC.

[0320] In at least one embodiment, the PPU is included in or coupled to a desktop computer, laptop computer, tablet computer, server, supercomputer, smartphone (e.g., wireless, handheld device), PDA, digital camera, vehicle, head-mounted display, handheld electronic device, etc. In at least one embodiment, the PPU is implemented on a single semiconductor substrate. In at least one embodiment, the PPU is included in a system-on-a-chip (“SoC”) along with one or more other devices (e.g., additional PPUs, memory, RISC CPU, MMU, digital-to-analog converter (“DAC”), etc.).

[0321] In at least one embodiment, the PPU may be included on a graphics card that includes one or more storage devices. The graphics card may be configured to connect to a PCIe slot on a desktop computer motherboard. In at least one embodiment, the PPU may be an integrated GPU (“iGPU”) included in the motherboard's chipset.

[0322] Software architecture for general-purpose computing

[0323] The following figures illustrate, but are not limited to, exemplary software constructions for implementing at least one embodiment.

[0324] Figure 27 A software stack of a programming platform according to at least one embodiment is illustrated. In at least one embodiment, the programming platform is a platform for accelerating computational tasks by utilizing hardware on a computing system. In at least one embodiment, software developers can access the programming platform through libraries, compiler instructions, and / or extensions to programming languages. In at least one embodiment, the programming platform may be, but is not limited to, CUDA, Radeon Open Computing Platform (“ROCm”), OpenCL (OpenCL developed by Khronosgroup). TM ), SYCL or Intel One API.

[0325] In at least one embodiment, the software stack 2700 of the programming platform provides an execution environment for the application 2701. In at least one embodiment, the application 2701 may include any computer software capable of being launched on the software stack 2700. In at least one embodiment, the application 2701 may include, but is not limited to, artificial intelligence (“AI”) / machine learning (“ML”) applications, high-performance computing (“HPC”) applications, virtual desktop infrastructure (“VDI”) or data center workloads.

[0326] In at least one embodiment, application 2701 and software stack 2700 run on hardware 2707. In at least one embodiment, hardware 2707 may include one or more GPUs, CPUs, FPGAs, AI engines, and / or other types of computing devices supporting a programming platform. In at least one embodiment, such as using CUDA, software stack 2700 may be vendor-specific and compatible only with devices from a specific vendor. In at least one embodiment, such as using OpenCL, software stack 2700 may be used with devices from different vendors. In at least one embodiment, hardware 2707 includes a host connected to one or more devices that can be accessed via application programming interface (API) calls to perform computational tasks. In at least one embodiment, compared to the host within hardware 2707, which may include, but is not limited to, a CPU (but may also include computing devices) and its memory, devices within hardware 2707 may include, but are not limited to, GPUs, FPGAs, AI engines, or other computing devices (but may also include CPUs) and their memory.

[0327] In at least one embodiment, the software stack 2700 of the programming platform includes, but is not limited to, multiple libraries 2703, a runtime 2705, and a device kernel driver 2706. In at least one embodiment, each library 2703 may include data and programming code that can be used by a computer program and utilized during software development. In at least one embodiment, library 2703 may include, but is not limited to, pre-written code and subroutines, classes, values, type specifications, configuration data, documentation, help data, and / or message templates. In at least one embodiment, library 2703 includes functions optimized for execution on one or more types of devices. In at least one embodiment, library 2703 may include, but is not limited to, functions for performing mathematical, deep learning, and / or other types of operations on the device. In at least one embodiment, library 2703 is associated with a corresponding API 2702, which may include one or more APIs that expose functions implemented in library 2703.

[0328] In at least one embodiment, application 2701 is written as source code, which is compiled into executable code, as follows: Figures 32-34 This will be discussed in more detail. In at least one embodiment, the executable code of application 2701 may run at least partially on an execution environment provided by software stack 2700. In at least one embodiment, during the execution of application 2701, code that needs to run on the device (compared to the host) may be obtained. In this case, in at least one embodiment, runtime 2705 may be invoked to load and start the necessary code on the device. In at least one embodiment, runtime 2705 may include any technically feasible runtime system capable of supporting the execution of application 2701.

[0329] In at least one embodiment, runtime 2705 is implemented as one or more runtime libraries associated with a corresponding API (shown as API 2704). In at least one embodiment, one or more such runtime libraries may include, but are not limited to, functions for memory management, execution control, device management, error handling, and / or synchronization, etc. In at least one embodiment, memory management functions may include, but are not limited to, functions for allocating, dealing with, and copying device memory, and for transferring data between host memory and device memory. In at least one embodiment, execution control functions may include, but are not limited to, functions for launching functions on the device (sometimes referred to as "kernels" when the function is a global function that can be called from the host), and functions for setting attribute values in buffers maintained by the runtime library for a given function to be executed on the device.

[0330] In at least one embodiment, the runtime library and the corresponding API 2704 can be implemented in any technically feasible manner. In at least one embodiment, one (or any number of) APIs may expose a low-level set of functions for fine-grained control of the device, while another (or any number of) APIs may expose such a higher-level set of functions. In at least one embodiment, a high-level runtime API can be built on top of the low-level APIs. In at least one embodiment, one or more runtime APIs may be language-specific APIs layered on top of language-independent runtime APIs.

[0331] In at least one embodiment, device kernel driver 2706 is configured to facilitate communication with the underlying device. In at least one embodiment, device kernel driver 2706 can provide APIs such as API 2704 and / or low-level functions upon which other software depends. In at least one embodiment, device kernel driver 2706 can be configured to compile intermediate representation (“IR”) code into binary code at runtime. In at least one embodiment, for CUDA, device kernel driver 2706 can compile non-hardware-specific parallel thread execution (“PTX”) IR code into binary code for a specific target device (cached compiled binary code), sometimes also referred to as “final” code. In at least one embodiment, doing so allows the final code to run on the target device, which may not exist when the source code was initially compiled into PTX code. Alternatively, in at least one embodiment, the device source code can be compiled into binary code offline, without requiring device kernel driver 2706 to compile the IR code at runtime.

[0332] Figure 28 The illustration shows an embodiment according to at least one of the embodiments. Figure 27 The software stack 2700 is a CUDA implementation. In at least one embodiment, the CUDA software stack 2800 on which an application 2801 can be launched includes a CUDA library 2803, a CUDA runtime 2805, a CUDA driver 2807, and a device kernel driver 2808. In at least one embodiment, the CUDA software stack 2800 executes on hardware 2809, which may include a CUDA-enabled GPU developed by NVIDIA Corporation of Santa Clara, California.

[0333] In at least one embodiment, application 2801, CUDA runtime 2805, and device kernel driver 2808 can respectively perform functions similar to those of application 2701, runtime 2705, and device kernel driver 2706, in combination with the above. Figure 27The CUDA driver 2807 is described in at least one embodiment. In at least one embodiment, the CUDA driver API 2807 includes a library (libcuda.so) implementing the CUDA driver API 2806. In at least one embodiment, similar to the CUDA runtime API 2804 implemented by the CUDA runtime library (cudart), the CUDA driver API 2806 may expose, but is not limited to, functions for memory management, execution control, device management, error handling, synchronization, and / or graphics interoperability. In at least one embodiment, the CUDA driver API 2806 differs from the CUDA runtime API 2804 in that the CUDA runtime API 2804 simplifies device code management by providing implicit initialization, context (similar to processes) management, and module (similar to dynamically loaded libraries) management. In contrast to the high-level CUDA runtime API 2804, in at least one embodiment, the CUDA driver API 2806 is a low-level API that provides finer-grained control over the device, particularly regarding context and module loading. In at least one embodiment, the CUDA driver API 2806 may expose functions for context management that are not exposed by the CUDA runtime API 2804. In at least one embodiment, the CUDA driver API 2806 is also language-independent and supports, in addition to the CUDA runtime API 2804, OpenCL, for example. Furthermore, in at least one embodiment, development libraries, including the CUDA runtime 2805, can be considered separate from the driver components, including the user-mode CUDA driver 2807 and the kernel-mode device driver 2808 (sometimes also referred to as the "display" driver).

[0334] In at least one embodiment, CUDA library 2803 may include, but is not limited to, mathematical libraries, deep learning libraries, parallel algorithm libraries, and / or signal / image / video processing libraries, which parallel computing applications (e.g., application 2801) may utilize. In at least one embodiment, CUDA library 2803 may include mathematical libraries, such as the cuBLAS library, which is an implementation of basic linear algebra subroutines (“BLAS”) for performing linear algebra operations; the cuFFT library for computing the Fast Fourier Transform (“FFT”); and the cuRAND library for generating random numbers, etc. In at least one embodiment, CUDA library 2803 may include deep learning libraries, such as the cuDNN library for primitives of deep neural networks and the TensorRT platform for high-performance deep learning inference, etc.

[0335] Figure 29 The illustration shows an embodiment according to at least one of the embodiments. Figure 27The software stack 2700 is a ROCm implementation. In at least one embodiment, the ROCm software stack 2900 on which application 2901 can be launched includes a language runtime 2903, a system runtime 2905, a thunk 2907, and a ROCm kernel driver 2908. In at least one embodiment, the ROCm software stack 2900 executes on hardware 2909, which may include a ROCm-enabled GPU developed by AMD Inc. of Santa Clara, California.

[0336] In at least one embodiment, application 2901 can perform the above-described combination. Figure 27 The discussed application 2701 has similar functionality. Additionally, in at least one embodiment, the language runtime 2903 and system runtime 2905 can perform functions combined with the above. Figure 27 The runtime 2705 discussed has similar functionality. In at least one embodiment, the language runtime 2903 and the system runtime 2905 differ in that the system runtime 2905 is a language-independent runtime that implements the ROCr System Runtime API 2904 and utilizes the Heterogeneous System Architecture (“HSA”) runtime API. In at least one embodiment, the HSA runtime API is a thin-user mode API that exposes interfaces for accessing and interacting with the AMD GPU, including functions for memory management, kernel execution control dispatched by the architecture, error handling, system and agent information, and runtime initialization and shutdown, etc. In at least one embodiment, compared to the system runtime 2905, the language runtime 2903 is an implementation of a language-specific runtime API 2902 layered on top of the ROCr System Runtime API 2904. In at least one embodiment, the language runtime API may include, but is not limited to, the Portable Heterogeneous Computing Interface (“HIP”) language runtime API, the Heterogeneous Computing Compiler (“HCC”) language runtime API, or the OpenCL API, etc. In particular, the HIP language is an extension of the C++ programming language, a functionally similar version with CUDA mechanisms, and in at least one embodiment, the HIP language runtime API includes elements combined with the above. Figure 28 The discussion focuses on functions similar to those in CUDA runtime API 2804, such as those used for memory management, execution control, device management, error handling, and synchronization.

[0337] In at least one embodiment, the thunk (ROCt) 2907 is an interface 2906 that can be used to interact with the underlying ROCm driver 2908. In at least one embodiment, the ROCm driver 2908 is a ROCk driver, which is a combination of an AMD GPU driver and an HSA kernel driver (amdkfd). In at least one embodiment, the AMD GPU driver is a device kernel driver for GPUs developed by AMD, which performs the above-described combination. Figure 27 The device kernel driver 2706 discussed has similar functionality. In at least one embodiment, the HSA kernel driver is a driver that allows different types of processors to share system resources more efficiently via hardware features.

[0338] In at least one embodiment, various libraries (not shown) may be included in the ROCm software stack 2900 above the language runtime 2903, and provide integration with the above. Figure 28 The discussed CUDA library 2803 has similar functionality. In at least one embodiment, various libraries may include, but are not limited to, mathematical, deep learning, and / or other libraries, such as the hipBLAS library which implements functions similar to CUDA cuBLAS, the rocFFT library which is similar to CUDA cuFFT for computing FFT, etc.

[0339] Figure 30 The illustration shows an embodiment according to at least one of the embodiments. Figure 27 The software stack 2700 is an OpenCL implementation. In at least one embodiment, the OpenCL software stack 3000 on which the application 3001 can be launched includes an OpenCL framework 3010, an OpenCL runtime 3006, and a driver 3007. In at least one embodiment, the OpenCL software stack 3000 executes on non-vendor-specific hardware 2809. In at least one embodiment, because devices developed by different vendors support OpenCL, specific OpenCL drivers may be required for interoperability with hardware from such vendors.

[0340] In at least one embodiment, the application 3001, the OpenCL runtime 3006, the device kernel driver 3007, and the hardware 3008 can respectively execute the above-described combination. Figure 27 The application 2701, runtime 2705, device kernel driver 2706, and hardware 2707 discussed have similar functionality. In at least one embodiment, application 3001 also includes an OpenCL kernel 3002 with code that will be executed on the device.

[0341] In at least one embodiment, OpenCL defines a "platform" that allows a host to control devices connected to that host. In at least one embodiment, the OpenCL framework provides a platform-level API and a runtime API, shown as Platform API 3003 and Runtime API 3005. In at least one embodiment, Runtime API 3005 uses a context to manage the execution of the kernel on the device. In at least one embodiment, each identified device can be associated with a respective context, which Runtime API 3005 can use to manage the device's command queue, program objects and kernel objects, shared memory objects, etc. In at least one embodiment, Platform API 3003 discloses functions that allow the device context to select and initialize devices, submit work to devices via command queues, and enable data transfers to and from devices, etc. Additionally, in at least one embodiment, the OpenCL framework provides various built-in functions (not shown), including mathematical functions, relational functions, and image processing functions, etc.

[0342] In at least one embodiment, compiler 3004 is also included in the OpenCL framework 3010. In at least one embodiment, the source code may be compiled offline before the application is executed or compiled online during the execution of the application. Unlike CUDA and ROCm, the OpenCL application in at least one embodiment may be compiled online by compiler 3004, which is included to represent any number of compilers that can be used to compile source code and / or IR code (e.g., Standard Portable Intermediate Representation (“SPIR-V”) code) into binary code. Alternatively, in at least one embodiment, the OpenCL application may be compiled offline before execution of such an application.

[0343] Figure 31 Software supported by a programming platform according to at least one embodiment is illustrated. In at least one embodiment, the programming platform 3104 is configured to support various programming models 3103, middleware and / or libraries 3102, and frameworks 3101 that the application 3100 may depend on. In at least one embodiment, the application 3100 may be an AI / ML application implemented using, for example, a deep learning framework (e.g., MXNet, PyTorch, or TensorFlow), which may depend on libraries such as cuDNN, the NVIDIA Collective Communications Library (“NCCL”), and / or the NVIDIA Developer Data Loading Library (“DALI”) CUDA library to provide accelerated computation on the underlying hardware.

[0344] In at least one embodiment, the programming platform 3104 can be a combination of the above-described components. Figure 28 , Figure 29 and Figure 30 One of the described CUDA, ROCm, or OpenCL platforms. In at least one embodiment, the programming platform 3104 supports multiple programming models 3103, which are abstractions of the underlying computing system that allow for the expression of algorithms and data structures. In at least one embodiment, the programming model 3103 may expose features of the underlying hardware to improve performance. In at least one embodiment, the programming model 3103 may include, but is not limited to, CUDA, HIP, OpenCL, C++ Accelerated Massive Parallelism (“C++AMP”), Open Multiprocessing (“OpenMP”), Open Accelerator (“OpenACC”), and / or Vulcan Compute.

[0345] In at least one embodiment, the library and / or middleware 3102 provides an abstract implementation of the programming model 3104. In at least one embodiment, such a library includes data and programming code that can be used by a computer program and utilized during software development. In at least one embodiment, in addition to those available from the programming platform 3104, such middleware also includes software that provides services to applications. In at least one embodiment, the library and / or middleware 3102 may include, but is not limited to, cuBLAS, cuFFT, cuRAND, and other CUDA libraries, or rocBLAS, rocFFT, rocRAND, and other ROCm libraries. Additionally, in at least one embodiment, the library and / or middleware 3102 may include NCCL and ROCm communication collection library (“RCCL”) libraries, which provide communication routines for GPUs, the MIOpen library for deep learning acceleration, and / or intrinsic libraries for linear algebra, matrix and vector operations, geometric transformations, numerical solvers, and related algorithms.

[0346] In at least one embodiment, the application framework 3101 depends on libraries and / or middleware 3102. In at least one embodiment, each application framework 3101 is a software framework for implementing a standard structure of application software. Returning to the AI / ML example discussed above, in at least one embodiment, AI / ML applications can be implemented using frameworks such as Caffe, Caffe2, TensorFlow, Keras, PyTorch, or the MxNet deep learning framework.

[0347] Figure 32 Compilation code according to at least one embodiment is shown to be used in Figure 27-30The application is executed on one of the programming platforms. In at least one embodiment, compiler 3201 receives source code 3200, which includes both host code and device code. In at least one embodiment, compiler 3201 is configured to convert source code 3200 into host executable code 3202 for execution on a host and device executable code 3203 for execution on a device. In at least one embodiment, source code 3200 may be compiled offline before executing the application or compiled online during application execution.

[0348] In at least one embodiment, source code 3200 may include code in any programming language supported by compiler 3201, such as C++, C, Fortran, etc. In at least one embodiment, source code 3200 may be included in a single-source file, which has a mixture of host code and device code, and indicates the location of the device code therein. In at least one embodiment, the single-source file may be a .cu file including CUDA code or a .hip.cpp file including HIP code. Alternatively, in at least one embodiment, source code 3200 may include multiple source code files instead of a single source file, in which the host code and device code are separate.

[0349] In at least one embodiment, compiler 3201 is configured to compile source code 3200 into host executable code 3202 for execution on a host and device executable code 3203 for execution on a device. In at least one embodiment, compiler 3201 performs operations including resolving source code 3200 into an abstract system tree (AST), performing optimizations, and generating executable code. In at least one embodiment where source code 3200 comprises a single source file, compiler 3201 may separate device code and host code within such a single source file, compile the device code and host code into device executable code 3203 and host executable code 3202 respectively, and link the device executable code 3203 and host executable code 3202 together in a single file, as described below. Figure 33 To be discussed in more detail.

[0350] In at least one embodiment, the host executable code 3202 and the device executable code 3203 can be in any suitable format, such as binary code and / or IR code. In the case of CUDA, in at least one embodiment, the host executable code 3202 may include native object code, while the device executable code 3203 may include code in a PTX intermediate representation. In at least one embodiment, in the case of ROCm, both the host executable code 3202 and the device executable code 3203 can include target binary code.

[0351] Figure 33 It is compiled code according to at least one embodiment to be used in Figure 27-30 A more detailed illustration is provided on one of the programming platforms. In at least one embodiment, compiler 3301 is configured to receive source code 3300, compile source code 3300, and output executable file 3310. In at least one embodiment, source code 3300 is a single source file, such as a .cu file, a .hip.cpp file, or a file of other formats, which includes both host code and device code. In at least one embodiment, compiler 3301 may be, but is not limited to, an NVIDIA CUDA compiler (“NVCC”) for compiling CUDA code in .cu files, or an HCC compiler for compiling HIP code in .hip.cpp files.

[0352] In at least one embodiment, compiler 3301 includes compiler front-end 3302, host compiler 3305, device compiler 3306, and linker 3309. In at least one embodiment, compiler front-end 3302 is configured to separate device code 3304 from host code 3303 in source code 3300. In at least one embodiment, device code 3304 is compiled by device compiler 3306 into device executable code 3308, which, as described, may include binary code or IR code. In at least one embodiment, host code 3303 is compiled separately by host compiler 3305 into host executable code 3307. In at least one embodiment, for NVCC, host compiler 3305 may be, but is not limited to, a general-purpose C / C++ compiler that outputs native object code, while device compiler 3306 may be, but is not limited to, a low-level virtual machine (“LLVM”) based compiler that forks the LLVM compiler infrastructure and outputs PTX code or binary code. In at least one embodiment, for HCC, both the host compiler 3305 and the device compiler 3306 can be, but are not limited to, LLVM-based compilers that output target binary code.

[0353] In at least one embodiment, after compiling source code 3300 into host executable code 3307 and device executable code 3308, linker 3309 links the host and device executable codes 3307 and 3308 together in executable file 3310. In at least one embodiment, the native object code of the host and PTX or the binary code of the device can be linked together in an executable and linkable format (“ELF”) file, which is a container format for storing object code.

[0354] Figure 34The illustration shows the transformation of source code prior to compilation, according to at least one embodiment. In at least one embodiment, source code 3400 is passed via a transformation tool 3401, which transforms source code 3400 into transformed source code 3402. In at least one embodiment, a compiler 3403 is used to compile the transformed source code 3402 into host executable code 3404 and device executable code 3305, a process similar to that of compiler 3201 compiling source code 3200 into host executable code 3202 and device executable code 3203, as described above. Figure 32 The subject of discussion.

[0355] In at least one embodiment, the transformation performed by the transformation tool 3401 is used to port source code 3400 to perform in an environment different from where it was originally intended to run. In at least one embodiment, the transformation tool 3401 may include, but is not limited to, a HIP converter for “hipify” CUDA code for a CUDA platform into HIP code that can be compiled and executed on the ROCm platform. In at least one embodiment, the transformation of source code 3400 may include: parsing source code 3400 and converting calls to APIs provided by one programming model (e.g., CUDA) into corresponding calls to APIs provided by another programming model (e.g., HIP), as combined below. Figure 35A and Figure 36 This will be discussed in more detail. Returning to the example of porting CUDA code, in at least one embodiment, calls to the CUDA runtime API, CUDA driver API, and / or CUDA libraries can be translated into corresponding HIP API calls. In at least one embodiment, the automatic translation performed by the translation tool 3401 may sometimes be incomplete, requiring additional manual intervention to fully port the source code 3400.

[0356] Configure GPUs for general-purpose computing

[0357] The following figures illustrate, but are not limited to, exemplary architectures for compiling and executing computational source code according to at least one embodiment.

[0358] Figure 35AA system 3500 is shown, configured to compile and execute CUDA source code 3510 using different types of processing units according to at least one embodiment. In at least one embodiment, system 3500 includes, but is not limited to, CUDA source code 3510, CUDA compiler 3550, host executable code 3570(1), host executable code 3570(2), CUDA device executable code 3584, CPU 3590, CUDA-enabled GPU 3594, GPU 3592, CUDA to HIP conversion tool 3520, HIP source code 3530, HIP compiler driver 3540, HCC 3560, and HCC device executable code 3582.

[0359] In at least one embodiment, CUDA source code 3510 is a collection of human-readable code in the CUDA programming language. In at least one embodiment, CUDA code is human-readable code in the CUDA programming language. In at least one embodiment, the CUDA programming language is an extension of the C++ programming language, which includes, but is not limited to, defining device code and mechanisms for distinguishing between device code and host code. In at least one embodiment, device code is source code that can be executed in parallel on a device after compilation. In at least one embodiment, the device may be a processor optimized for parallel instruction processing, such as a CUDA-enabled GPU 3590, GPU 3592, or another GPGPU. In at least one embodiment, host code is source code that can be executed on a host after compilation. In at least one embodiment, the host is a processor optimized for sequential instruction processing, such as CPU 3590.

[0360] In at least one embodiment, the CUDA source code 3510 includes, but is not limited to, any number (including zero) of global functions 3512, any number (including zero) of device functions 3514, any number (including zero) of host functions 3516, and any number (including zero) of host / device functions 3518. In at least one embodiment, the global functions 3512, device functions 3514, host functions 3516, and host / device functions 3518 can be mixed in the CUDA source code 3510. In at least one embodiment, each global function 3512 can be executed on a device and can be called from a host. Therefore, in at least one embodiment, one or more of the global functions 3512 can serve as entry points for a device. In at least one embodiment, each global function 3512 is a kernel. In at least one embodiment, and in a technique called dynamic parallelism, one or more global functions 3512 define a kernel that can be executed on a device and can be called from such a device. In at least one embodiment, the kernel is executed in parallel N times (where N is any positive integer) by N different threads on the device during execution.

[0361] In at least one embodiment, each device function 3514 executes on a device and can only be called from such a device. In at least one embodiment, each host function 3516 executes on a host and can only be called from such a host. In at least one embodiment, each host / device function 3516 defines both a host version of a function that is executable on a host and can only be called from such a host, and a device version of a function that is executable on a device and can only be called from such a device.

[0362] In at least one embodiment, CUDA source code 3510 may also include, but is not limited to, any number of calls to any number of functions defined by CUDA runtime API 3502. In at least one embodiment, CUDA runtime API 3502 may include, but is not limited to, any number of functions executed on the host for allocating and dealing device memory, transferring data between host memory and device memory, managing a system with multiple devices, etc. In at least one embodiment, CUDA source code 3510 may also include, but is not limited to, any number of calls to any number of functions specified in any number of other CUDA APIs. In at least one embodiment, a CUDA API may be any API designed to be used by CUDA code. In at least one embodiment, a CUDA API includes, but is not limited to, CUDA runtime API 3502, CUDA driver APIs, APIs for any number of CUDA libraries, etc. In at least one embodiment, and relative to CUDA runtime API 3502, the CUDA driver API is a lower-level API but can provide finer-grained control over devices. In at least one embodiment, examples of CUDA libraries include, but are not limited to, cuBLAS, cuFFT, cURAND, cuDNN, etc.

[0363] In at least one embodiment, CUDA compiler 3550 compiles input CUDA code (e.g., CUDA source code 3510) to generate host executable code 3570(1) and CUDA device executable code 3584. In at least one embodiment, CUDA compiler 3550 is an NVCC. In at least one embodiment, host executable code 3570(1) is a compiled version of the host code included in the input source code executable on CPU 3590. In at least one embodiment, CPU 3590 can be any processor optimized for sequential instruction processing.

[0364] In at least one embodiment, CUDA device executable code 3584 is a compiled version of device code included in input source code executable on a CUDA-enabled GPU 3594. In at least one embodiment, CUDA device executable code 3584 includes, but is not limited to, binary code. In at least one embodiment, CUDA device executable code 3584 includes, but is not limited to, IR code, such as PTX code, which is further compiled at runtime by a device driver into binary code for a specific target device (e.g., a CUDA-enabled GPU 3594). In at least one embodiment, the CUDA-enabled GPU 3594 can be any processor optimized for parallel instruction processing and supporting CUDA. In at least one embodiment, the CUDA-enabled GPU 3594 was developed by NVIDIA Corporation of Santa Clara, California.

[0365] In at least one embodiment, the CUDA-to-HIP conversion tool 3520 is configured to convert CUDA source code 3510 into functionally similar HIP source code 3530. In at least one embodiment, the HIP source code 3530 is a collection of human-readable code in a HIP programming language. In at least one embodiment, the HIP code is human-readable code in a HIP programming language. In at least one embodiment, the HIP programming language is an extension of the C++ programming language, including but not limited to a functionally similar version of the CUDA mechanism, used to define device code and distinguish between device code and host code. In at least one embodiment, the HIP programming language may include a subset of the functionality of the CUDA programming language. In at least one embodiment, for example, the HIP programming language includes, but is not limited to, a mechanism for defining global functions 3512; however, such a HIP programming language may lack support for dynamic parallelism, therefore, the global functions 3512 defined in the HIP code can only be called from the host.

[0366] In at least one embodiment, the HIP source code 3530 includes, but is not limited to, any number (including zero) of global functions 3512, any number (including zero) of device functions 3514, any number (including zero) of host functions 3516, and any number (including zero) of host / device functions 3518. In at least one embodiment, the HIP source code 3530 may also include any number of calls to any number of functions specified in the HIP runtime API 3532. In one embodiment, the HIP runtime API 3532 includes, but is not limited to, functionally similar versions of a subset of functions included in the CUDA runtime API 3502. In at least one embodiment, the HIP source code 3530 may also include any number of calls to any number of functions specified in any number of other HIP APIs. In at least one embodiment, the HIP API may be any API designed for use by HIP code and / or ROCm. In at least one embodiment, the HIP API includes, but is not limited to, the HIP runtime API 3532, the HIP driver API, APIs for any number of HIP libraries, APIs for any number of ROCm libraries, etc.

[0367] In at least one embodiment, the CUDA-to-HIP conversion tool 3520 converts each kernel call in CUDA code from CUDA syntax to HIP syntax, and converts any number of other CUDA calls in the CUDA code into any number of other functionally similar HIP calls. In at least one embodiment, a CUDA call is a call to a function specified in the CUDA API, and a HIP call is a call to a function specified in the HIP API. In at least one embodiment, the CUDA-to-HIP conversion tool 3520 converts any number of calls to functions specified in the CUDA runtime API 3502 into any number of calls to functions specified in the HIP runtime API 3532.

[0368] In at least one embodiment, the CUDA to HIP conversion tool 3520 is a tool called hipify-perl that performs a text-based conversion process. In at least one embodiment, the CUDA to HIP conversion tool 3520 is a tool called hipify-clang that, compared to hipify-perl, performs a more complex and robust conversion process involving parsing the CUDA code using clang (a compiler front-end) and then converting the resulting symbols. In at least one embodiment, in addition to the modifications performed by the CUDA to HIP conversion tool 3520, correctly converting the CUDA code into HIP code may require further modifications (e.g., manual editing).

[0369] In at least one embodiment, the HIP compiler driver 3540 determines a target device 3546 and then configures a compiler compatible with the target device 3546 to compile the front end of the HIP source code 3530. In at least one embodiment, the target device 3546 is a processor optimized for parallel instruction processing. In at least one embodiment, the HIP compiler driver 3540 can determine the target device 3546 in any technically feasible manner.

[0370] In at least one embodiment, if the target device 3546 is CUDA compatible (e.g., a CUDA-enabled GPU 3594), the HIP compiler driver 3540 generates HIP / NVCC compilation commands 3542. In at least one embodiment and in conjunction with... Figure 35B In more detail, the HIP / NVCC compilation command 3542 configures the CUDA compiler 3550 to compile the HIP source code 3530 using, but not limited to, a HIP-to-CUDA translation header and a CUDA runtime library. In at least one embodiment and in response to the HIP / NVCC compilation command 3542, the CUDA compiler 3550 generates host executable code 3570(1) and CUDA device executable code 3584.

[0371] In at least one embodiment, if the target device 3546 is incompatible with CUDA, the HIP compiler driver 3540 generates HIP / HCC compilation commands 3544. In at least one embodiment and as in conjunction with... Figure 35C In more detail, HIP / HCC compilation command 3544 configures HCC 3560 to compile HIP source code 3530 using the HCC header and HIP / HCC runtime library. In at least one embodiment and in response to HIP / HCC compilation command 3544, HCC 3560 generates host executable code 3570(2) and HCC device executable code 3582. In at least one embodiment, HCC device executable code 3582 is a compiled version of device code executable on GPU 3592 contained in HIP source code 3530. In at least one embodiment, GPU 3592 may be any processor optimized for parallel instruction processing, CUDA incompatible, and HCC compatible. In at least one embodiment, GPU 3592 was developed by AMD Inc. of Santa Clara, California. In at least one embodiment, GPU 3592 is a GPU 3592 without CUDA enabled.

[0372] For illustrative purposes only, Figure 35AThe document describes three different processes that, in at least one embodiment, can be implemented to compile CUDA source code 3510 for execution on CPU 3590 and various devices. In at least one embodiment, a direct CUDA process compiles CUDA source code 3510 for execution on CPU 3590 and a CUDA-enabled GPU 3594 without converting CUDA source code 3510 to HIP source code 3530. In at least one embodiment, an indirect CUDA process converts CUDA source code 3510 to HIP source code 3530 and then compiles the HIP source code 3530 for execution on CPU 3590 and a CUDA-enabled GPU 3594. In at least one embodiment, a CUDA / HCC process converts CUDA source code 3510 to HIP source code 3530 and then compiles the HIP source code 3530 for execution on CPU 3590 and GPU 3592.

[0373] The direct CUDA flow, which can be implemented in at least one embodiment, can be depicted by dashed lines and a series of bubble comments A1-A3. In at least one embodiment, and as indicated by bubble comment A1, CUDA compiler 3550 receives CUDA source code 3510 and CUDA compilation command 3548 that configures CUDA compiler 3550 to compile CUDA source code 3510. In at least one embodiment, the CUDA source code 3510 used in the direct CUDA flow is written in a CUDA programming language based on a programming language other than C++ (e.g., C, Fortran, Python, Java, etc.). In at least one embodiment, and in response to CUDA compilation command 3548, CUDA compiler 3550 generates host executable code 3570(1) and CUDA device executable code 3584 (indicated by bubble comment A2). In at least one embodiment, and as indicated by bubble comment A3, host executable code 3570(1) and CUDA device executable code 3584 can be executed on CPU 3590 and CUDA-enabled GPU 3594, respectively. In at least one embodiment, the CUDA device executable code 3584 includes, but is not limited to, binary code. In at least one embodiment, the CUDA device executable code 3584 includes, but is not limited to, PTX code, and is further compiled at runtime into binary code for a specific target device.

[0374] The indirect CUDA process, which can be implemented in at least one embodiment, can be described by dashed lines and a series of bubble comments B1-B6. In at least one embodiment, and as shown in bubble comment B1, CUDA-to-HIP conversion tool 3520 receives CUDA source code 3510. In at least one embodiment, and as shown in bubble comment B2, CUDA-to-HIP conversion tool 3520 converts CUDA source code 3510 into HIP source code 3530. In at least one embodiment, and as shown in bubble comment B3, HIP compiler driver 3540 receives HIP source code 3530 and determines whether the target device 3546 has CUDA enabled.

[0375] In at least one embodiment and as shown in bubble note B4, the HIP compiler driver 3540 generates HIP / NVCC compilation commands 3542 and sends both the HIP / NVCC compilation commands 3542 and the HIP source code 3530 to the CUDA compiler 3550. In at least one embodiment and as shown in combination Figure 35B In more detail, HIP / NVCC compilation command 3542 configures CUDA compiler 3550 to compile HIP source code 3530 using, but not limited to, a HIP-to-CUDA translation header and a CUDA runtime library. In at least one embodiment and in response to HIP / NVCC compilation command 3542, CUDA compiler 3550 generates host executable code 3570(1) and CUDA device executable code 3584 (indicated by bubble comment B5). In at least one embodiment and as shown by bubble comment B6, host executable code 3570(1) and CUDA device executable code 3584 can be executed on CPU 3590 and CUDA-enabled GPU 3594, respectively. In at least one embodiment, CUDA device executable code 3584 includes, but is not limited to, binary code. In at least one embodiment, CUDA device executable code 3584 includes, but is not limited to, PTX code, and is further compiled at runtime into binary code for a specific target device.

[0376] The CUDA / HCC flow, which can be implemented in at least one embodiment, can be described by solid lines and a series of bubble comments C1-C6. In at least one embodiment, and as shown in bubble comment C1, the CUDA to HIP conversion tool 3520 receives CUDA source code 3510. In at least one embodiment, and as shown in bubble comment C2, the CUDA to HIP conversion tool 3520 converts the CUDA source code 3510 into HIP source code 3530. In at least one embodiment, and as shown in bubble comment C3, the HIP compiler driver 3540 receives the HIP source code 3530 and determines that the target device 3546 does not have CUDA enabled.

[0377] In at least one embodiment, the HIP compiler driver 3540 generates HIP / HCC compilation commands 3544 and sends both the HIP / HCC compilation commands 3564 and the HIP source code 3530 to the HCC 3560 (indicated by bubble comment C4). In at least one embodiment and as in combination Figure 35C In more detail, HIP / HCC compilation command 3564 configures HCC 3560 to compile HIP source code 3530 using, but not limited to, the HCC header and HIP / HCC runtime library. In at least one embodiment and in response to HIP / HCC compilation command 3544, HCC 3560 generates host executable code 3570(2) and HCC device executable code 3582 (indicated by bubble comment C5). In at least one embodiment and as shown by bubble comment C6, host executable code 3570(2) and HCC device executable code 3582 can be executed on CPU 3590 and GPU 3592, respectively.

[0378] In at least one embodiment, after converting CUDA source code 3510 to HIP source code 3530, the HIP compiler driver 3540 can then be used to generate executable code for a CUDA-enabled GPU 3594 or GPU 3592 without re-executing CUDA to the HIP conversion tool 3520. In at least one embodiment, the CUDA to HIP conversion tool 3520 converts CUDA source code 3510 to HIP source code 3530 and then stores it in memory. In at least one embodiment, the HIP compiler driver 3540 then configures HCC 3560 to generate host executable code 3570(2) and HCC device executable code 3582 based on the HIP source code 3530. In at least one embodiment, the HIP compiler driver 3540 then configures CUDA compiler 3550 to generate host executable code 3570(1) and CUDA device executable code 3584 based on the stored HIP source code 3530.

[0379] Figure 35B The diagram illustrates a configuration, according to at least one embodiment, to compile and execute using a CPU 3590 and a CUDA-enabled GPU 3594. Figure 35A The system 3504 includes, but is not limited to, CUDA source code 3510, CUDA to HIP conversion tool 3520, HIP source code 3530, HIP compiler driver 3540, CUDA compiler 3550, host executable code 3570(1), CUDA device executable code 3584, CPU 3590 and CUDA-enabled GPU 3594.

[0380] In at least one embodiment and as previously mentioned herein Figure 35A As described, the CUDA source code 3510 includes, but is not limited to, any number (including zero) of global functions 3512, any number (including zero) of device functions 3514, any number (including zero) of host functions 3516, and any number (including zero) of host / device functions 3518. In at least one embodiment, the CUDA source code 3510 also includes, but is not limited to, any number of calls to any number of functions specified in any number of CUDA APIs.

[0381] In at least one embodiment, the CUDA to HIP conversion tool 3520 converts CUDA source code 3510 into HIP source code 3530. In at least one embodiment, the CUDA to HIP conversion tool 3520 converts each kernel call in the CUDA source code 3510 from CUDA syntax to HIP syntax, and converts any number of other CUDA calls in the CUDA source code 3510 into any number of other functionally similar HIP calls.

[0382] In at least one embodiment, the HIP compiler driver 3540 determines that the target device 3546 is CUDA enabled and generates HIP / NVCC compilation commands 3542. In at least one embodiment, the HIP compiler driver 3540 then configures the CUDA compiler 3550 via the HIP / NVCC compilation commands 3542 to compile HIP source code 3530. In at least one embodiment, as part of configuring the CUDA compiler 3550, the HIP compiler driver 3540 provides access to a HIP-to-CUDA translation header 3552. In at least one embodiment, the HIP-to-CUDA translation header 3552 translates an arbitrary number of mechanisms (e.g., functions) specified in an arbitrary number of HIP APIs into an arbitrary number of mechanisms specified in an arbitrary number of CUDA APIs. In at least one embodiment, the CUDA compiler 3550 uses the HIP-to-CUDA translation header 3552 in conjunction with a CUDA runtime library 3554 corresponding to the CUDA runtime API 3502 to generate host executable code 3570(1) and CUDA device executable code 3584. In at least one embodiment, host executable code 3570(1) and CUDA device executable code 3584 can then be executed on CPU 3590 and CUDA-enabled GPU 3594, respectively. In at least one embodiment, CUDA device executable code 3584 includes, but is not limited to, binary code. In at least one embodiment, CUDA device executable code 3584 includes, but is not limited to, PTX code, and is further compiled at runtime into binary code for a specific target device.

[0383] Figure 35C A system 3506 according to at least one embodiment is shown, the system 3506 being configured to compile and execute using a CPU 3590 and a GPU 3592 with CUDA disabled. Figure 35A The CUDA source code 3510. In at least one embodiment, the system 3506 includes, but is not limited to, the CUDA source code 3510, the CUDA to HIP conversion tool 3520, the HIP source code 3530, the HIP compiler driver 3540, the HCC 3560, the host executable code 3570(2), the HCC device executable code 3582, the CPU 3590, and the GPU 3592.

[0384] In at least one embodiment, and as previously mentioned herein Figure 35AAs described, the CUDA source code 3510 includes, but is not limited to, any number (including zero) of global functions 3512, any number (including zero) of device functions 3514, any number (including zero) of host functions 3516, and any number (including zero) of host / device functions 3518. In at least one embodiment, the CUDA source code 3510 also includes, but is not limited to, any number of calls to any number of functions specified in any number of CUDA APIs.

[0385] In at least one embodiment, the CUDA to HIP conversion tool 3520 converts CUDA source code 3510 into HIP source code 3530. In at least one embodiment, the CUDA to HIP conversion tool 3520 converts each kernel call in the CUDA source code 3510 from CUDA syntax to HIP syntax, and converts any number of other CUDA calls in the source code 3510 into any number of other functionally similar HIP calls.

[0386] In at least one embodiment, the HIP compiler driver 3540 then determines that the target device 3546 is not CUDA enabled and generates a HIP / HCC compilation command 3544. In at least one embodiment, the HIP compiler driver 3540 then configures the HCC 3560 to execute the HIP / HCC compilation command 3544, thereby compiling the HIP source code 3530. In at least one embodiment, the HIP / HCC compilation command 3544 configures the HCC 3560 to use, but is not limited to, the HIP / HCC runtime library 3558 and the HCC header 3556 to generate host executable code 3570(2) and HCC device executable code 3582. In at least one embodiment, the HIP / HCC runtime library 3558 corresponds to the HIP runtime API 3532. In at least one embodiment, the HCC header 3556 includes, but is not limited to, any number and type of interoperability mechanisms for HIP and HCC. In at least one embodiment, host executable code 3570(2) and HCC device executable code 3582 can be executed on CPU 3590 and GPU 3592, respectively.

[0387] Figure 36 The diagram illustrates a method according to at least one embodiment. Figure 35CAn exemplary kernel is converted by the CUDA to HIP conversion tool 3520. In at least one embodiment, the CUDA source code 3510 divides the overall problem that a given kernel is designed to solve into relatively coarse subproblems that can be solved independently using thread blocks. In at least one embodiment, each thread block includes, but is not limited to, any number of threads. In at least one embodiment, each subproblem is divided into relatively small pieces that can be solved in parallel by the threads within the thread block. In at least one embodiment, threads within a thread block can cooperate by sharing data through shared memory and by coordinating memory accesses through synchronized execution.

[0388] In at least one embodiment, CUDA source code 3510 organizes thread blocks associated with a given kernel into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks. In at least one embodiment, each thread block includes, but is not limited to, any number of threads, and the grid includes, but is not limited to, any number of thread blocks.

[0389] In at least one embodiment, the kernel is a function in the device code defined using the "__global__" declaration specifier. In at least one embodiment, CUDA kernel startup syntax 3610 is used to specify the size of the mesh on which the kernel is executed for a given kernel call, as well as the associated flow. In at least one embodiment, CUDA kernel startup syntax 3610 is specified as "KernelName <<<GridSize,BlockSize,SharedMemorySize,Stream> >>(KernelArguments);". In at least one embodiment, the execution configuration syntax is a "<<<...>>>" construct, which is inserted between the kernel name ("KernelName") and the bracketed list of kernel parameters ("KernelArguments"). In at least one embodiment, the CUDA kernel boot syntax 3610 includes, but is not limited to, the CUDA boot function syntax instead of the execution configuration syntax.

[0390] In at least one embodiment, "GridSize" is of type dim3 and specifies the size and dimensions of the grid. In at least one embodiment, type dim3 is a CUDA-defined structure, which includes, but is not limited to, unsigned integers x, y, and z. In at least one embodiment, if z is not specified, z defaults to 1. In at least one embodiment, if y is not specified, y defaults to 1. In at least one embodiment, the number of thread blocks in the grid is equal to the product of GridSize.x, GridSize.y, and GridSize.z. In at least one embodiment, "BlockSize" is of type dim3 and specifies the size and dimensions of each thread block. In at least one embodiment, the number of threads per thread block is equal to the product of BlockSize.x, BlockSize.y, and BlockSize.z. In at least one embodiment, each thread executing the kernel has a unique thread ID, which can be accessed within the kernel via a built-in variable (e.g., "threadIdx").

[0391] In at least one embodiment, regarding CUDA kernel startup syntax 3610, "SharedMemorySize" is an optional parameter that specifies the number of bytes dynamically allocated for each thread block in shared memory for a given kernel call, excluding statically allocated memory. In at least one embodiment and regarding CUDA kernel startup syntax 3610, SharedMemorySize defaults to zero. In at least one embodiment and regarding CUDA kernel startup syntax 3610, "stream" is an optional parameter that specifies an associated stream and defaults to zero to specify a default stream. In at least one embodiment, a stream is a sequence of commands executed sequentially (which may be issued by different host threads). In at least one embodiment, different streams may execute commands out of order or simultaneously relative to each other.

[0392] In at least one embodiment, CUDA source code 3510 includes, but is not limited to, kernel definitions and a main function for the exemplary kernel "MatAdd". In at least one embodiment, the main function is host code executed on the host and includes, but is not limited to, kernel calls that cause the kernel MatAdd to execute on the device. In at least one embodiment, as shown, the kernel MatAdd adds two matrices A and B of size NxN, where N is a positive integer, and stores the result in matrix C. In at least one embodiment, the main function defines the threadsPerBlock variable as 16x16 and the numBlocks variable as N / 16 x N / 16. In at least one embodiment, the main function then specifies the kernel call "MatAdd<<<numBlocks,threadsPerBlock> >>(A, B, C);”. In at least one embodiment, and in accordance with CUDA kernel startup syntax 3610, a grid of thread blocks of size N / 16 × N / 16 is used to execute the kernel MatAdd, where each thread block is 16 × 16. In at least one embodiment, each thread block comprises 256 threads, creating a grid with enough blocks to have one thread per matrix element, and each thread in the grid executes the kernel MatAdd to perform a pairwise addition.

[0393] In at least one embodiment, while converting CUDA source code 3510 into HIP source code 3530, the CUDA-to-HIP conversion tool 3520 converts each kernel call in the CUDA source code 3510 from CUDA kernel startup syntax 3610 into HIP kernel startup syntax 3620, and converts any number of other CUDA calls in the source code 3510 into any number of other functionally similar HIP calls. In at least one embodiment, the HIP kernel startup syntax 3620 is specified as “hipLaunchKernelGGL(KernelName,GridSize,BlockSize,SharedMemorySize,Stream,KernelArguments);”. In at least one embodiment, each of KernelName, GridSize, BlockSize, ShareMemorySize, Stream, and KernelArguments has the same meaning in the HIP kernel startup syntax 3620 as it does in the CUDA kernel startup syntax 3610 (as previously described herein). In at least one embodiment, the parameters SharedMemorySize and Stream are required in HIP kernel startup syntax 3620, but optional in CUDA kernel startup syntax 3610.

[0394] In at least one embodiment, in addition to the kernel call that causes the kernel MatAdd to execute on the device, Figure 36 The part of HIP source code 3530 described in the text is related to Figure 36 The portion of the CUDA source code 3510 depicted is identical. In at least one embodiment, the kernel MatAdd is defined in the HIP source code 3530, having the same "__global__" declaration specifier as the kernel MatAdd defined in the CUDA source code 3510. In at least one embodiment, the kernel call in the HIP source code 3530 is "hipLaunchKernelGGL(MatAdd, numBlocks, threadsPerBlock, 0, 0, A, B, C);", while the corresponding kernel call in the CUDA source code 3510 is "MatAdd <<<numBlocks,threadsPerBlock> >>(A, B, C);”.

[0395] Figure 37 A more detailed description is provided according to at least one embodiment. Figure 35C The GPU 3592 is a CUDA-unenabled GPU. In at least one embodiment, the GPU 3592 was developed by AMD Inc. of Santa Clara, USA. In at least one embodiment, the GPU 3592 can be configured to perform computational operations in a highly parallel manner. In at least one embodiment, the GPU 3592 is configured to perform graphics pipeline operations, such as drawing commands, pixel operations, geometric calculations, and other operations associated with rendering images to a display. In at least one embodiment, the GPU 3592 is configured to perform graphics-independent operations. In at least one embodiment, the GPU 3592 is configured to perform both graphics-related and graphics-independent operations. In at least one embodiment, the GPU 3592 can be configured to execute device code included in HIP source code 3530.

[0396] In at least one embodiment, the GPU 3592 includes, but is not limited to, any number of programmable processing units 3720, an command processor 3710, an L2 cache 3722, a memory controller 3770, a DMA engine 3780(1), a system memory controller 3782, a DMA engine 3780(2), and a GPU controller 3784. In at least one embodiment, each programmable processing unit 3720 includes, but is not limited to, a workload manager 3730 and any number of compute units 3740. In at least one embodiment, the command processor 3710 reads commands from one or more command queues (not shown) and distributes the commands to the workload manager 3730. In at least one embodiment, for each programmable processing unit 3720, the associated workload manager 3730 distributes work to the compute units 3740 included in the programmable processing unit 3720. In at least one embodiment, each compute unit 3740 can execute any number of thread blocks, but each thread block executes on a single compute unit 3740. In at least one embodiment, the workgroup is a thread block.

[0397] In at least one embodiment, each computing unit 3740 includes, but is not limited to, any number of SIMD units 3750 and shared memory 3760. In at least one embodiment, each SIMD unit 3750 implements a SIMD architecture and is configured to perform operations in parallel. In at least one embodiment, each SIMD unit 3750 includes, but is not limited to, a vector ALU 3752 and a vector register file 3754. In at least one embodiment, each SIMD unit 3750 executes a different thread bundle. In at least one embodiment, a thread bundle is a group of threads (e.g., 16 threads), where each thread in the thread bundle belongs to a single thread block and is configured to process different datasets based on a single instruction set. In at least one embodiment, prediction can be used to disable one or more threads in a thread bundle. In at least one embodiment, a channel is a thread. In at least one embodiment, a work item is a thread. In at least one embodiment, a wavefront is a thread bundle. In at least one embodiment, different wavefronts in a thread block can be synchronized together and communicate via shared memory 3760.

[0398] In at least one embodiment, the programmable processing unit 3720 is referred to as a "shading engine". In at least one embodiment, in addition to the computing unit 3740, each programmable processing unit 3720 also includes, but is not limited to, any number of dedicated graphics hardware. In at least one embodiment, each programmable processing unit 3720 includes, but is not limited to, any number (including zero) of geometry processors, any number (including zero) of rasterizers, any number (including zero) of rendering backends, a workload manager 3730, and any number of computing units 3740.

[0399] In at least one embodiment, compute units 3740 share an L2 cache 3722. In at least one embodiment, the L2 cache 3722 is partitioned. In at least one embodiment, all compute units 3740 in the GPU 3592 have access to the GPU memory 3790. In at least one embodiment, a memory controller 3770 and a system memory controller 3782 facilitate data transfer between the GPU 3592 and the host, and a DMA engine 3780(1) enables asynchronous memory transfers between the GPU 3592 and the host. In at least one embodiment, a memory controller 3770 and a GPU controller 3784 facilitate data transfers between the GPU 3592 and other GPUs 3592, and a DMA engine 3780(2) enables asynchronous memory transfers between the GPU 3592 and other GPUs 3592.

[0400] In at least one embodiment, GPU 3592 includes, but is not limited to, any number and type of system interconnects that facilitate data and control transfers between any number and type of directly or indirectly linked components, either internally or externally to GPU 3592. In at least one embodiment, GPU 3592 includes, but is not limited to, any number and type of I / O interfaces (e.g., PCIe) coupled to any number and type of peripheral devices. In at least one embodiment, GPU 3592 may include, but is not limited to, any number (including zero) of display engines and any number (including zero) of multimedia engines. In at least one embodiment, GPU 3592 implements a memory subsystem that includes, but is not limited to, any number and type of memory controllers (e.g., memory controller 3770 and system memory controller 3782) and memory devices dedicated to a component or shared among multiple components (e.g., shared memory 3760). In at least one embodiment, GPU 3592 implements a cache subsystem that includes, but is not limited to, one or more cache memories (e.g., L2 cache 3722), each cache memory being private or shared among any number of components (e.g., SIMD unit 3750, compute unit 3740, and programmable processing unit 3720).

[0401] Figure 38 This illustrates how threads of an exemplary CUDA mesh 3820, according to at least one embodiment, are mapped to... Figure 37Different computational units 3740. In at least one embodiment, and for illustrative purposes only, grid 3820 has a GridSize of BX multiplied by BY multiplied by 1 and a BlockSize of TX multiplied by TY multiplied by 1. Therefore, in at least one embodiment, grid 3820 includes, but is not limited to, (BX*BY) thread blocks 3830, each thread block 3830 including, but not limited to, (TX*TY) threads 3840. Threads 3840 in Figure 38 It is depicted as a curved arrow.

[0402] In at least one embodiment, grid 3820 is mapped to programmable processing unit 3720(1), which includes, but is not limited to, computing units 3740(1)-3740(C). In at least one embodiment, and as shown, (BJ*BY) thread block 3830 is mapped to computing unit 3740(1), and the remaining thread blocks 3830 are mapped to computing unit 3740(2). In at least one embodiment, each thread block 3830 may include, but is not limited to, any number of thread bundles, and each thread bundle is mapped to... Figure 37 Different SIMD units 3750.

[0403] In at least one embodiment, the thread bundles in a given thread block 3830 can be synchronized together and communicate via shared memory 3760 included in the associated computing unit 3740. For example, and in at least one embodiment, the thread bundles in thread block 3830(BJ, 1) can be synchronized together and communicate via shared memory 3760(1). For example, and in at least one embodiment, the thread bundles in thread block 3830(BJ+1, 1) can be synchronized together and communicate via shared memory 3760(2).

[0404] Figure 39This document illustrates how to migrate existing CUDA code to data-parallel C++ code according to at least one embodiment. Data-parallel C++ (DPC++) can refer to an open, standards-based alternative to a single-architecture proprietary language that allows developers to reuse code across hardware targets (CPUs and accelerators, such as GPUs and FPGAs) and also perform custom tweaks for specific accelerators. DPC++ uses similar and / or the same C and C++ constructs as ISO C++, which developers may be familiar with. DPC++ incorporates the Khronos Group's standard SYCL to support data parallelism and heterogeneous programming. SYCL stands for Cross-Platform Abstraction Layer, built on the underlying concepts, portability, and efficiency of OpenCL, enabling code for heterogeneous processors to be written in a "single-source" style using standard C++. SYCL enables single-source development, where C++ template functions can contain both host code and device code to build complex algorithms accelerated using OpenCL, and then reuse them throughout the source code for different types of data.

[0405] In at least one embodiment, a DPC++ compiler is used to compile DPC++ source code that can be deployed across various hardware targets. In at least one embodiment, the DPC++ compiler is used to generate DPC++ applications that can be deployed across various hardware targets, and DPC++ compatibility tools are used to migrate CUDA applications to multi-platform programs in DPC++. In at least one embodiment, the DPC++ basic toolkit includes: a DPC++ compiler for deploying applications across various hardware targets; DPC++ libraries for improving productivity and performance on CPUs, GPUs, and FPGAs; DPC++ compatibility tools for migrating CUDA applications to multi-platform applications; and any suitable combination thereof.

[0406] In at least one embodiment, the DPC++ programming model simplifies one or more aspects related to programming CPUs and accelerators by using modern C++ features to express parallelism with a programming language called Data Parallel C++. The DPC++ programming language can be used for code reuse against hosts (e.g., CPUs) and accelerators (e.g., GPUs or FPGAs) using a single-source language, and clearly communicates execution and memory dependencies. Mappings within the DPC++ code can be used to translate applications to run on the hardware or set of hardware devices that best accelerate workloads. Even on platforms without available accelerators, the host can be used to simplify the development and debugging of device code.

[0407] In at least one embodiment, CUDA source code 3900 is provided as input to the DPC++ compatibility tool 3902 to generate human-readable DPC++ 3904. In at least one embodiment, the human-readable DPC++ 3904 includes inline comments generated by the DPC++ compatibility tool 3902, which guide developers on how and / or where to modify the DPC++ code to complete the coding and tune it to the desired performance 3906, thereby generating DPC++ source code 3908.

[0408] In at least one embodiment, CUDA source code 3900 is or includes a collection of human-readable source code in the CUDA programming language. In at least one embodiment, CUDA source code 3900 is human-readable source code using the CUDA programming language. In at least one embodiment, the CUDA programming language is an extension of the C++ programming language, which includes, but is not limited to, mechanisms for defining device code and distinguishing between device code and host code. In at least one embodiment, device code is source code that, after compilation, can be executed on a device (e.g., a GPU or FPGA) and may include one or more parallelizable workflows that can be executed on one or more processor cores of the device. In at least one embodiment, the device may be a processor optimized for parallel instruction processing, such as a CUDA-enabled GPU, GPU, or another GPGPU. In at least one embodiment, host code is source code that, after compilation, can be executed on a host machine. In at least one embodiment, some or all of the host code and device code can be executed in parallel across CPU and GPU / FPGA. In at least one embodiment, the host is a processor optimized for sequential instruction processing, such as a CPU. Figure 39 The CUDA source code 3900 described is consistent with what is discussed elsewhere in this document.

[0409] In at least one embodiment, DPC++ compatibility tool 3902 refers to an executable tool, program, application, or any other suitable type of tool for facilitating the migration of CUDA source code 3900 to DPC++ source code 3908. In at least one embodiment, DPC++ compatibility tool 3902 is a command-line based code migration tool that can be used as part of the DPC++ toolkit for porting existing CUDA sources to DPC++. In at least one embodiment, DPC++ compatibility tool 3902 converts some or all of the source code of a CUDA application from CUDA to DPC++ and generates a result file, at least partially written in DPC++, referred to as human-readable DPC++ 3904. In at least one embodiment, human-readable DPC++ 3904 includes comments generated by DPC++ compatibility tool 3902 to indicate where user intervention may be required. In at least one embodiment, user intervention is necessary when CUDA source code 3900 calls a CUDA API that does not have a DPC++ API; other examples requiring user intervention will be discussed in more detail later.

[0410] In at least one embodiment, the workflow for migrating CUDA source code 3900 (e.g., an application or a portion thereof) includes creating one or more build database files; migrating CUDA to DPC++ using a DPC++ compatibility tool 3902; completing the migration and verifying its correctness to generate DPC++ source code 3908; and compiling the DPC++ source code 3908 using a DPC++ compiler to generate a DPC++ application. In at least one embodiment, the compatibility tool provides a utility that intercepts commands used during Makefile execution and stores them in the build database files. In at least one embodiment, the files are stored in JSON format. In at least one embodiment, intercepted build commands translate Makefile commands into DPC compatibility commands.

[0411] In at least one embodiment, intercept-build is a utility script that intercepts the build process to capture build options, macro definitions, and include paths, and writes this data to a build database file. In at least one embodiment, the build database file is a JSON file. In at least one embodiment, the DPC++ compatibility tool 3902 parses the build database and applies options when migrating input sources. In at least one embodiment, the use of intercept-build is optional but strongly recommended for Make or CMake-based environments. In at least one embodiment, the migration database includes commands, directories, and files: commands may include necessary build flags; directories may include paths to header files; and files may include paths to CUDA files.

[0412] In at least one embodiment, the DPC++ compatibility tool 3902 migrates CUDA code (e.g., applications) written in CUDA to DPC++ by generating DPC++ as much as possible. In at least one embodiment, the DPC++ compatibility tool 3902 is available as part of a toolkit. In at least one embodiment, the DPC++ toolkit includes an intercept-build tool. In at least one embodiment, the intercept-build tool creates a build database that captures build commands to migrate CUDA files. In at least one embodiment, the DPC++ compatibility tool 3902 uses the build database generated by the intercept-build tool to migrate CUDA code to DPC++. In at least one embodiment, non-CUDA C++ code and files are migrated as is. In at least one embodiment, the DPC++ compatibility tool 3902 generates human-readable DPC++ 3904, which may be DPC++ code, such as that generated by the DPC++ compatibility tool 3902, that cannot be compiled by the DPC++ compiler and requires additional pipelines to verify incorrectly migrated code portions, and may involve manual intervention, such as intervention by a developer. In at least one embodiment, the DPC++ compatibility tool 3902 provides hints or tools embedded in the code to help developers manually migrate additional code that cannot be migrated automatically. In at least one embodiment, the migration is a one-time activity for a source file, project, or application.

[0413] In at least one embodiment, the DPC++ compatibility tool 39002 is capable of successfully migrating all portions of CUDA code to DPC++, and can simply include optional steps for manually verifying and tuning the performance of the generated DPC++ source code. In at least one embodiment, the DPC++ compatibility tool 3902 directly generates DPC++ source code 3908 that is compiled by the DPC++ compiler, without requiring or utilizing manual intervention to modify the DPC++ code generated by the DPC++ compatibility tool 3902. In at least one embodiment, the DPC++ compatibility tool generates compilable DPC++ code that developers can selectively tune based on performance, readability, maintainability, and various other considerations, or any combination thereof.

[0414] In at least one embodiment, one or more CUDA source files are migrated to DPC++ source files, at least in part, using DPC++ compatibility tool 3902. In at least one embodiment, the CUDA source code includes one or more header files, which may include CUDA header files. In at least one embodiment, the CUDA source files include text that can be printed.<cuda.h> header files and<stdio.h> Header file. In at least one embodiment, a portion of the vector addition kernel CUDA source file may be written as or related to:

[0415]

[0416]

[0417] In at least one embodiment, and in conjunction with the CUDA source files presented above, the DPC++ compatibility tool 3902 parses the CUDA source code and replaces the header files with appropriate DPC++ and SYCL header files. In at least one embodiment, the DPC++ header files include helper declarations. In CUDA, there is a concept of thread IDs; correspondingly, in DPC++ or SYCL, there is a local identifier for each element.

[0418] In at least one embodiment, and in relation to the CUDA source file presented above, there are two vectors A and B, which are initialized and the result of vector addition is placed into vector C as part of VectorAddKernel(). In at least one embodiment, as part of migrating CUDA code to DPC++ code, the DPC++ compatibility tool 3902 converts the CUDA thread ID used to index worker elements to the SYCL standard addressing of the worker elements via the local ID. In at least one embodiment, the DPC++ code generated by the DPC++ compatibility tool 3902 can be optimized—for example, by reducing the dimension of nd_item, thereby increasing memory and / or processor utilization.

[0419] In at least one embodiment, and in conjunction with the CUDA source files presented above, memory allocation is migrated. In at least one embodiment, relying on SYCL concepts such as platform, device, context, and queue, cudaMalloc() is migrated to a unified shared memory SYCL call malloc_device() to which the device and context are passed. In at least one embodiment, the SYCL platform may have multiple devices (e.g., host and GPU devices); a device may have multiple queues to which jobs can be submitted; each device may have a context; and a context may have multiple devices and manage shared memory objects.

[0420] In at least one embodiment, and in conjunction with the CUDA source files presented above, the `main()` function invokes or calls `VectorAddKernel()` to add two vectors A and B and store the result in vector C. In at least one embodiment, the CUDA code calling `VectorAddKernel()` is replaced by DPC++ code to submit the kernel to the command queue for execution. In at least one embodiment, the command group handler `cgh` passes the data submitted to the queue, synchronization, and computation, and `parallel_for` is called for multiple global elements and multiple work items in the workgroup that call `VectorAddKernel()`.

[0421] In at least one embodiment, and in conjunction with the CUDA source files presented above, CUDA calls to copy device memory and then free memory for vectors A, B, and C are migrated to corresponding DPC++ calls. In at least one embodiment, C++ code (e.g., standard ISO C++ code for printing vectors of floating-point variables) is migrated as is without modification by the DPC++ compatibility tool 3902. In at least one embodiment, the DPC++ compatibility tool 3902 modifies the CUDA API used for memory setup and / or host calls to execute the kernel on an accelerated device. In at least one embodiment, and in conjunction with the CUDA source files presented above, the corresponding human-readable DPC++ 3904 (e.g., compilable) is written as or related to:

[0422]

[0423]

[0424]

[0425] In at least one embodiment, human-readable DPC++ 3904 refers to the output generated by the DPC++ compatibility tool 3902 and can be optimized in one or another. In at least one embodiment, the human-readable DPC++ 3904 generated by the DPC++ compatibility tool 3902 can be manually edited by developers after migration to make it more maintainable, performant, or for other considerations. In at least one embodiment, the DPC++ code generated by the DPC++ compatibility tool 39002 (e.g., publicly available DPC++) can be optimized by removing duplicate calls to get_current_device() and / or get_default_context() for each malloc_device() call. In at least one embodiment, the DPC++ code generated above uses a 3D nd_range, which can be refactored to use only a single dimension, thereby reducing memory usage. In at least one embodiment, developers can manually edit the DPC++ code generated by the DPC++ compatibility tool 3902 to replace the use of unified shared memory with accessors. In at least one embodiment, the DPC++ compatibility tool 3902 has the option to change how it migrates CUDA code to DPC++ code. In at least one embodiment, the DPC++ compatibility tool 3902 is verbose because it uses a generic template to migrate CUDA code to DPC++ code, which is suitable for a wide range of situations.

[0426] In at least one embodiment, the CUDA to DPC++ migration workflow includes the following steps: preparing the migration using an intercept-build script; performing the migration of the CUDA project to DPC++ using the DPC++ compatibility tool 3902; manually reviewing and editing the source files of the migration to ensure their integrity and correctness; and compiling the final DPC++ code to generate the DPC++ application. In at least one embodiment, manual review of the DPC++ source code may be required in one or more scenarios, including but not limited to: the migrated API not returning error codes (CUDA code can return error codes that can subsequently be used by the application, but SYCL uses exceptions to report errors, therefore error codes are not used to expose errors); DPC++ does not support CUDA compute capability-related logic; statements cannot be deleted. In at least one embodiment, scenarios requiring manual intervention in the DPC++ code may include, but are not limited to: replacing error code logic with (*,0) code or commenting it out; equivalent DPC++ APIs being unavailable; CUDA compute capability-related logic; hardware-related APIs (clock()); APIs lacking unsupported features; performing time measurement logic; handling built-in vector type conflicts; migrating the cuBLAS API; and more.

[0427] In at least one embodiment, one or more techniques described herein utilize an API programming model. In at least one embodiment, the oneAPI programming model refers to a programming model for interacting with different computing accelerator architectures. In at least one embodiment, oneAPI refers to an application programming interface (API) designed to interact with various computing accelerator architectures. In at least one embodiment, the oneAPI programming model utilizes the DPC++ programming language. In at least one embodiment, the DPC++ programming language refers to a high-level language used for data-parallel programming productivity. In at least one embodiment, the DPC++ programming language is at least partially based on the C and / or C++ programming languages. In at least one embodiment, the oneAPI programming model is one such programming model as those developed by Intel Corporation of Santa Clara, California.

[0428] In at least one embodiment, oneAPI and / or the oneAPI programming model are used to interact with various accelerators, GPUs, processors, and / or their variants and architectures. In at least one embodiment, oneAPI includes a set of libraries that implement various functions. In at least one embodiment, oneAPI includes at least the oneAPI DPC++ library, the oneAPI Math Kernel library, the oneAPI Data Analysis library, the oneAPI Deep Neural Network library, the oneAPI Collection Communication library, the oneAPI Thread Building Block library, the oneAPI Video Processing library, and / or their variants.

[0429] In at least one embodiment, the oneAPI DPC++ library (also known as oneDPL) is a library that implements algorithms and functions to accelerate DPC++ kernel programming. In at least one embodiment, oneDPL implements one or more Standard Template Library (STL) functions. In at least one embodiment, oneDPL implements one or more parallel STL functions. In at least one embodiment, oneDPL provides a set of library classes and functions, such as parallel algorithms, iterators, function object classes, range-based APIs, and / or their variants. In at least one embodiment, oneDPL implements one or more classes and / or functions from the C++ Standard Library. In at least one embodiment, oneDPL implements one or more random number generator functions.

[0430] In at least one embodiment, the oneAPI math kernel library (also known as oneMKL) is a library that implements various optimized and parallelized routines for the various mathematical functions and / or operations. In at least one embodiment, oneMKL implements one or more Basic Linear Algebra Subroutines (BLAS) and / or Linear Algebra Encapsulations (LAPACK) dense linear algebra routines. In at least one embodiment, oneMKL implements one or more sparse BLAS linear algebra routines. In at least one embodiment, oneMKL implements one or more random number generators (RNGs). In at least one embodiment, oneMKL implements one or more vector mathematics (VM) routines for performing mathematical operations on vectors. In at least one embodiment, oneMKL implements one or more Fast Fourier Transform (FFT) functions.

[0431] In at least one embodiment, the oneAPI data analytics library (also known as oneDAL) is a library that implements various data analytics applications and distributed computing. In at least one embodiment, oneDAL implements various algorithms for preprocessing, transformation, analysis, modeling, verification, and decision-making for data analysis in batch, online, and distributed processing modes. In at least one embodiment, oneDAL implements various C++ and / or Java APIs and various connectors to one or more data sources. In at least one embodiment, oneDAL implements the DPC++ API extension to the traditional C++ interface and enables GPUs to be used for various algorithms.

[0432] In at least one embodiment, the oneAPI deep neural network library (also referred to as oneDNN) is a library that implements various deep learning functions. In at least one embodiment, oneDNN implements various neural networks, machine learning and deep learning functions, algorithms and / or variations thereof.

[0433] In at least one embodiment, the oneAPI collection communication library (also known as oneCCL) is a library for implementing various applications of deep learning and machine learning workloads. In at least one embodiment, oneCCL is built on top of lower-level communication middleware such as Message Passing Interface (MPI) and libfabrics. In at least one embodiment, oneCCL enables a set of deep learning-specific optimizations, such as prioritization, persistent operations, out-of-order execution, and / or variations thereof. In at least one embodiment, oneCCL implements various CPU and GPU functionalities.

[0434] In at least one embodiment, the oneAPI thread building block library (also referred to as oneTBB) is a library that implements various parallelization processes for various applications. In at least one embodiment, oneTBB is used for task-based shared parallel programming on a host machine. In at least one embodiment, oneTBB implements a general-purpose parallel algorithm. In at least one embodiment, oneTBB implements a concurrent container. In at least one embodiment, oneTBB implements a scalable memory allocator. In at least one embodiment, oneTBB implements a work-stealing task scheduler. In at least one embodiment, oneTBB implements low-level synchronization primitives. In at least one embodiment, oneTBB is compiler-independent and can be used on various processors, such as GPUs, PPUs, CPUs, and / or variants thereof.

[0435] In at least one embodiment, the oneAPI video processing library (also known as oneVPL) is a library for accelerating video processing in one or more applications. In at least one embodiment, oneVPL implements various video decoding, encoding, and processing functions. In at least one embodiment, oneVPL implements various functions for media pipelines on CPUs, GPUs, and other accelerators. In at least one embodiment, oneVPL implements device discovery and selection for media-centric and video analytics workloads. In at least one embodiment, oneVPL implements API primitives for zero-copy buffer sharing.

[0436] In at least one embodiment, the oneAPI programming model utilizes the DPC++ programming language. In at least one embodiment, the DPC++ programming language is a programming language that is, but is not limited to, a functionally similar version of the CUDA mechanism that defines device code and distinguishes between device code and host code. In at least one embodiment, the DPC++ programming language may include a subset of the functionality of a CUDA programming language. In at least one embodiment, the oneAPI programming model is used with the DPC++ programming language to perform one or more CUDA programming model operations.

[0437] It should be noted that while the example embodiments described herein may relate to the CUDA programming model, the techniques described herein can be used with any suitable programming model, such as HIP, oneAPI, and / or variations thereof.

[0438] At least one embodiment of this disclosure may be described in view of the following terms:

[0439] 1. A system comprising:

[0440] At least one processor;

[0441] At least one memory, comprising instructions, wherein, in response to execution of the instructions by the at least one processor, the system at least:

[0442] Select one or more lights from a set of lights associated with the virtual scene of the frame to be rendered as graphics;

[0443] Select a second light from at least one of one or more lights from a pixel in a previous frame used to render the graphics, or from at least one light from one or more lights associated with a pixel that is spatially close to the pixel.

[0444] Select at least one light from the first one or more lights and the second one or more lights to render pixels in subsequent frames of the graphic; and

[0445] The first one or more lights and the second one or more lights are used to color the pixels of the frame of the graphic.

[0446] 2. The system as described in Clause 1, wherein the shading is determined at least in part based on the visibility determination of light for one or more pixels in the previous frame used to render the graphics.

[0447] 3. The system as described in Clause 1 or 2, wherein the shading is determined at least in part based on the reuse of visibility determination of light used to render one or more pixels spatially adjacent to the pixel.

[0448] 4. The system of any one of clauses 1-3, wherein the at least one memory further comprises instructions that, in response to execution by the at least one processor, cause the system to at least:

[0449] The reuse visibility determination is based at least in part on the available computing capacity of the system.

[0450] 5. The system of any one of clauses 1-4, wherein the at least one memory further comprises instructions that, in response to execution by the at least one processor, cause the system to at least:

[0451] The visibility determination of the light is stored for one or more pixels in subsequent frames used to render the graphics.

[0452] 6. The system of any one of Clauses 1-5, wherein at least one of a lower frequency or pixel resolution than that of the pixels in the frame used to render the graphics is used to store the light for rendering one or more pixels in a subsequent frame.

[0453] 7. The system of any one of Clauses 1-6, wherein light for rendering one or more pixels in subsequent frames of a graphic is selected, at least in part, based on a random process.

[0454] 8. The system of any one of Clauses 1-7, wherein the first one or more lights are selected based at least in part on resampling from the plurality of lights associated with the virtual scene.

[0455] 9. The system of any one of clauses 1-8, wherein the at least one memory further comprises instructions that, in response to execution by the at least one processor, cause the system to at least:

[0456] The second one or more lights are selected based at least in part on resampling from one or more lights of a pixel in a previous frame used to render the graphics, or from at least one of the one or more lights associated with a pixel spatially adjacent to the pixel.

[0457] 10. A method comprising:

[0458] Select the first one or more lights from a virtual scene of a frame to be rendered as graphics;

[0459] Select a second one or more lights from a plurality of lights associated with one or more pixels in at least one of the frames or previous frames;

[0460] Select at least one light from the first one or more lights and the second one or more lights to render one or more pixels in subsequent frames of the graphic; and

[0461] The pixels of a frame of the graphic are rendered at least in part based on the first one or more lights and the second one or more lights.

[0462] 11. The method of Clause 10, further comprising rendering the pixel by reusing at least a visibility determination of light used to render the one or more pixels in the previous frame.

[0463] 12. The method as described in Clause 10 or 11 further includes rendering the pixel by reusing at least the visibility determination of light associated with the one or more pixels adjacent to the pixel.

[0464] 13. The method as described in any one of clauses 10-12, further comprising:

[0465] Determine the computational capacity of the frames available for rendering the graphics; and

[0466] The reuse of visibility determination is adjusted, at least in part, based on the determined computational capacity.

[0467] 14. The method as described in any one of clauses 10-13, further comprising:

[0468] The visibility determination of the light is stored for use in rendering one or more pixels in subsequent frames of the graphic.

[0469] 15. The method as described in any one of Clauses 10-14, further comprising:

[0470] The light is selected to render one or more pixels in subsequent frames of the graphic, based at least in part on a random process and the contribution of the selected light to the appearance of the pixel.

[0471] 16. The method of any one of clauses 10-15, wherein the first one or more lights are resampled from the lights in the virtual scene.

[0472] 17. The method of any one of clauses 10-16, wherein the second or more lights are resampled from the lights of a previously selected subsequent frame for rendering the graphics.

[0473] 18. A non-transitory computer-readable storage medium comprising instructions that, in response to execution by at least one processor of a computing device, cause the computing device to at least:

[0474] Select the first one or more lights from a virtual scene of a frame to be rendered as graphics;

[0475] Select a second light or more from a plurality of lights used to render one or more pixels of a previous frame or at least one of the frames;

[0476] Selecting one or more pixels of light from the first one or more lights and the second one or more lights for rendering the graphics in subsequent frames; and

[0477] The pixels of the frame of the graphic are colored based at least in part on the first one or more lights and the second one or more lights.

[0478] 19. The non-transitory computer-readable storage medium as described in Clause 18, further comprising instructions that, in response to execution by at least one processor of the computing device, cause the computing device to at least:

[0479] The visibility determination of light used to render one or more pixels in the previous frame is reused.

[0480] 20. The non-transitory computer-readable storage medium as described in Clause 18 or 19, further comprising instructions that, in response to execution by at least one processor of the computing device, cause the computing device to at least:

[0481] The pixel is tinted using the visibility determination of light used to render one or more pixels adjacent to the pixel.

[0482] 21. The non-transitory computer-readable storage medium of any one of clauses 18-20, further comprising instructions that, in response to execution by at least one processor of the computing device, cause the computing device to at least:

[0483] The determination of reuse visibility for coloring the pixels is based at least in part on a measurement of available computing capacity.

[0484] 22. The non-transitory computer-readable storage medium of any one of clauses 18-21, further comprising instructions that, in response to execution by at least one processor of the computing device, cause the computing device to at least:

[0485] The visibility determination of the light is stored for use in rendering one or more pixels in subsequent frames of the graphics.

[0486] 23. The non-transitory computer-readable storage medium of any one of clauses 18-22, further comprising instructions that, in response to execution by at least one processor of the computing device, cause the computing device to at least:

[0487] The light is selected for rendering one or more pixels in subsequent frames of the graphics based at least in part on a random process that is weighted according to the corresponding light's contribution to the illumination of the pixel.

[0488] 24. The non-transitory computer-readable storage medium as described in any one of Clauses 18-23, wherein a first one or more lights are resampled from a plurality of lights in a virtual scene.

[0489] 25. A non-transitory computer-readable storage medium as described in any one of Clauses 18-24, wherein a second one or more lights are resampled from lights used to render one or more pixels in at least one of the frames or previous frames.

[0490] Other variations are within the spirit of this disclosure. Therefore, although the disclosed technology is readily adaptable to various modifications and alternative constructions, certain embodiments thereof are illustrated in the accompanying drawings and have been described in detail above. However, it should be understood that the disclosure is not intended to be limited to one or more specific forms disclosed, but rather, it is intended to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of this disclosure as defined in the appended claims.

[0491] Unless otherwise stated or obviously contradicted by the context, the terms “a,” “an,” and “the,” and similar references, used in the context of describing the disclosed embodiments (particularly in the context of the appended claims), should be interpreted as encompassing both singular and plural forms, rather than as definitions of the terms. Unless otherwise stated, the terms “comprising,” “having,” “including,” and “containing” should be interpreted as open-ended terms (meaning “including, but not limited to”). The term “connection” (referring to a physical connection where not modified) should be interpreted as partially or wholly contained, attached to, or joined together, even with some intervention. Unless otherwise indicated herein, references to numerical ranges herein are intended only as a way of abbreviating each individual value falling within that range, and each individual value is incorporated into the specification as if it were separately described herein. Unless otherwise indicated or contradicted by the context, the use of the terms “set” (e.g., “item set”) or “subset” should be interpreted as a non-empty set comprising one or more members. Furthermore, unless otherwise indicated or contradicted by the context, the term “subset” of the corresponding set does not necessarily mean an appropriate subset of the corresponding set, but rather that the subset and the corresponding set can be equal.

[0492] Unless otherwise explicitly stated or clearly contradicted by the context, connective phrases such as “at least one of A, B, and C” or “at least one of A, B, and C” are understood in the context to generally refer to items, terms, etc., which can be A or B or C, or any non-empty subset of the set A, B, and C. For example, in an illustrative example of a set with three members, the connective phrases “at least one of A, B, and C” and “at least one of A, B, and C” refer to any of the following sets: {A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}. Therefore, such connective language is generally not intended to imply that some embodiments require the presence of at least one of A, at least one of B, and at least one of C. Additionally, unless otherwise stated or contradicted by the context, the term “multiple” indicates a plural state (e.g., “multiple items” means multiple items). The number of items in a multiple item is at least two, but may be more if explicitly indicated or indicated by the context. Furthermore, unless otherwise stated or clearly understood from the context, the phrase “based on” means “at least partially based on” rather than “based on only”.

[0493] Unless otherwise indicated herein or clearly contradicted by the context, the operations of the processes described herein may be performed in any suitable order. In at least one embodiment, processes such as those described herein (or variations thereof and / or combinations thereof) are executed under the control of one or more computer systems configured with executable instructions and are implemented as code (e.g., executable instructions, one or more computer programs, or one or more application programs) that is executed jointly on one or more processors via hardware or a combination thereof. In at least one embodiment, the code is stored on a computer-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. In at least one embodiment, the computer-readable storage medium is a non-transitory computer-readable storage medium that excludes transient signals (e.g., propagating transient electrical or electromagnetic transmissions) but includes non-transitory data storage circuitry (e.g., buffers, caches, and queues). In at least one embodiment, code (e.g., executable code or source code) is stored on one or more non-transitory computer-readable storage media (or other memory for storing executable instructions) on which executable instructions are stored, which, when executed by one or more processors of a computer system (e.g., as a result of execution), cause the computer system to perform the operations described herein. In at least one embodiment, the set of non-transitory computer-readable storage media comprises multiple non-transitory computer-readable storage media, and one or more of the individual non-transitory storage media lack all the code, but the multiple non-transitory computer-readable storage media collectively store all the code. In at least one embodiment, the executable instructions are executed such that different instructions are executed by different processors; for example, the non-transitory computer-readable storage media store the instructions, and the main central processing unit (“CPU”) executes some instructions while the graphics processing unit (“GPU”) executes other instructions. In at least one embodiment, different components of the computer system have separate processors, and the different processors execute different subsets of the instructions.

[0494] Therefore, in at least one embodiment, the computer system is configured to implement one or more services that perform the operations of the processes described herein, either individually or collectively, and such a computer system is configured with suitable hardware and / or software to enable the implementation of the operations. Furthermore, the computer system implementing at least one embodiment of this disclosure is a single device, and in another embodiment it is a distributed computer system comprising multiple devices operating in different ways, such that the distributed computer system performs the operations described herein, and that a single device does not perform all the operations.

[0495] The use of any and all examples or exemplary language (e.g., “such as”) provided herein is intended only to better illustrate embodiments of this disclosure and does not constitute a limitation on the scope of the disclosure unless otherwise required. No language in the specification should be construed as indicating that any unclaimed element is essential to the practice of the disclosure.

[0496] All references cited in this article, including publications, patent applications and patents, are incorporated herein by reference as if each reference were individually and specifically indicated to be incorporated herein by reference and the entire contents of which are described herein.

[0497] The terms “coupled” and “connected”, and their derivatives, may be used in the specification and claims. It should be understood that these terms may not be intended to be synonyms with each other. Rather, in certain examples, “connected” or “coupled” may be used to indicate that two or more elements are in direct or indirect physical or electrical contact with each other. “Coupled” may also mean that two or more elements are not in direct contact with each other, but still cooperate or interact with each other.

[0498] Unless otherwise expressly stated, it will be understood that throughout this specification, terms such as “processing,” “computing,” “determining,” etc., refer to the actions and / or processes of a computer or computing system or similar electronic computing device that process and / or convert data represented as physical quantities (e.g., electrons) in the registers and / or memory of the computing system into other data represented as physical quantities in the memory, registers, or other such information storage, transmission, or display devices of the computing system.

[0499] In a similar manner, the term "processor" can refer to any device or part of memory that processes electronic data from registers and / or memory and converts that electronic data into other electronic data that can be stored in registers and / or memory. As a non-limiting example, a "processor" can be a CPU or a GPU. A "computing platform" can include one or more processors. As used herein, a "software" process can include, for example, software and / or hardware entities that perform work over time, such as tasks, threads, and intelligent agents. Similarly, each process can refer to multiple processes that execute instructions sequentially or intermittently, sequentially, or in parallel. The terms "system" and "method" are used interchangeably herein, provided that a system can embody one or more methods, and a method can be considered a system.

[0500] In at least one embodiment, an arithmetic logic unit is a set of combinational logic circuits that takes one or more inputs to produce a result. In at least one embodiment, a processor uses an arithmetic logic unit to implement mathematical operations, such as addition, subtraction, or multiplication. In at least one embodiment, an arithmetic logic unit is used to implement logical operations, such as logical AND / OR or XOR. In at least one embodiment, an arithmetic logic unit is stateless and is made of physical switching components (such as semiconductor transistors) arranged to form logic gates. In at least one embodiment, an arithmetic logic unit may internally operate as a stateful logic circuit with an associated clock. In at least one embodiment, an arithmetic logic unit may be configured as an asynchronous logic circuit with an internal state not maintained in an associated register set. In at least one embodiment, an arithmetic logic unit is used by a processor to combine operands stored in one or more registers of the processor and produce an output that can be stored by the processor in another register or memory location.

[0501] In at least one embodiment, as a result of processing instructions retrieved by the processor, the processor presents one or more inputs or operands to the arithmetic logic unit (ALU), such that the ALU produces a result at least in part based on instruction codes provided to the ALU of the inputs. In at least one embodiment, the instruction codes provided by the processor to the ALU are at least in part based on instructions executed by the processor. In at least one embodiment, combinational logic in the ALU processes the inputs and produces an output, which is placed on a bus within the processor. In at least one embodiment, the processor selects a destination register, memory location, output device, or output storage location on the output bus, such that timing the processor causes the result produced by the ALU to be sent to the desired location.

[0502] This document refers to the process of acquiring, obtaining, receiving, or inputting analog or digital data into a subsystem, computer system, or computer-implemented machine. The process of acquiring, obtaining, receiving, or inputting analog and digital data can be accomplished in various ways, such as by receiving data as a parameter to a function call or a call to an application programming interface (API). In some implementations, the process of acquiring, obtaining, receiving, or inputting analog or digital data can be accomplished by transmitting data via a serial or parallel interface. In another implementation, the process of acquiring, obtaining, receiving, or inputting analog or digital data can be accomplished by transmitting data from a providing entity to an acquiring entity via a computer network. Reference can also be made to providing, outputting, transmitting, sending, or presenting analog or digital data. In various examples, the process of providing, o...

Claims

1. A graphics rendering method, comprising: Select the first one or more lights from a virtual scene of a frame to be rendered as graphics; Select a second one or more lights from a plurality of lights associated with one or more pixels in at least one of the frames or previous frames; Select at least one light from the first one or more lights and the second one or more lights for rendering one or more pixels in subsequent frames of the graphics; as well as The pixels of a frame of the graphic are rendered at least in part based on the first one or more lights and the second one or more lights.

2. The method according to claim 1, further comprising: The pixel is rendered by reusing at least the visibility determination made for the light used to render the one or more pixels in the previous frame.

3. The method according to claim 1, further comprising: The pixel is rendered by reusing at least the visibility determination of light associated with one or more neighboring pixels.

4. The method according to claim 1, further comprising: Determine the computational capacity of the frames that can be used to render the graphics; as well as The reuse of visibility determination is adjusted, at least in part, based on the determined computational capacity.

5. The method according to claim 1, further comprising: The visibility determination of the light is stored for one or more pixels in subsequent frames used to render the graphics.

6. The method according to claim 1, further comprising: The light used to render one or more pixels in subsequent frames of the graphic is selected, at least in part, based on a random process and the contribution of the selected light to the appearance of the pixel.

7. The method of claim 1, wherein the first one or more lights are resampled from the lights in the virtual scene.

8. The method of claim 1, wherein the second or more lights are resampled from the lights of a previously selected subsequent frame for rendering the graphics.

9. A graphics rendering system, comprising: At least one processor; At least one memory, comprising instructions, wherein, in response to execution of the instructions by the at least one processor, the system at least: Select one or more lights from a set of lights associated with the virtual scene of the frame to be rendered as graphics; Select a second one or more lights from at least one of the lights of a pixel in a previous frame used to render the graphics, or from at least one of the lights of a pixel spatially adjacent to the pixel. Select at least one light from the first one or more lights and the second one or more lights to render the pixels in subsequent frames; as well as The first one or more lights and the second one or more lights are used to color the pixels of the frame of the graphic.

10. The system of claim 9, wherein the shading is determined at least in part based on the visibility determination of light for one or more pixels in the previous frame used to render the graphics.

11. The system of claim 9, wherein the shading is determined at least in part based on the visibility determination of light used to render one or more pixels spatially adjacent to the pixel.

12. The system of claim 9, wherein the at least one memory further comprises instructions that, in response to execution by the at least one processor, cause the system to at least: The reuse visibility determination is based at least in part on the available computing capacity of the system.

13. The system of claim 9, wherein the at least one memory further comprises instructions that, in response to execution by the at least one processor, cause the system to at least: The visibility determination of the light is stored for one or more pixels in subsequent frames used to render the graphics.

14. The system of claim 9, wherein the light for rendering one or more pixels in a subsequent frame is stored using at least one of a lower frequency or pixel resolution than the pixels of the frame used to render the graphics.

15. The system of claim 9, wherein the light is selected, at least in part, based on a random process for rendering one or more pixels in subsequent frames of the graphics.

16. The system of claim 9, wherein the first one or more lights are selected based at least in part on resampling from the plurality of lights associated with the virtual scene.

17. The system of claim 9, wherein the at least one memory further comprises instructions, the instructions being responsive to execution by the at least one processor, causing the system to at least: The second one or more lights are selected based at least in part on resampling from one or more lights of a pixel in a previous frame used to render the graphics, or from at least one of the one or more lights associated with a pixel spatially adjacent to the pixel.

18. A non-transitory computer-readable storage medium comprising instructions that, in response to execution by at least one processor of a computing device, cause the computing device to at least: Select the first one or more lights from a virtual scene of a frame to be rendered as graphics; Select a second light or more from a plurality of lights used to render one or more pixels of a previous frame or at least one of the frames; Selecting one or more pixels of light from the first one or more lights and the second one or more lights for rendering the graphics in subsequent frames; and The pixels of the frame of the graphic are colored based at least in part on the first one or more lights and the second one or more lights.

19. The non-transitory computer-readable storage medium of claim 18, further comprising instructions that, in response to execution by at least one processor of the computing device, cause the computing device to at least: The visibility determination made for the light used to render one or more pixels in the previous frame is reused.

20. The non-transitory computer-readable storage medium of claim 18, further comprising instructions that, in response to execution by at least one processor of the computing device, cause the computing device to at least: The pixel is tinted using the visibility determination of light used to render one or more pixels adjacent to the pixel.

21. The non-transitory computer-readable storage medium of claim 18, further comprising instructions that, in response to execution by at least one processor of the computing device, cause the computing device to at least: The determination of reuse visibility for coloring the pixels is based at least in part on measurements of available computing capacity.

22. The non-transitory computer-readable storage medium of claim 18, further comprising instructions that, in response to execution by at least one processor of the computing device, cause the computing device to at least: The visibility determination of the light is stored for one or more pixels in subsequent frames used to render the graphics.

23. The non-transitory computer-readable storage medium of claim 18, further comprising instructions that, in response to execution by at least one processor of the computing device, cause the computing device to at least: The light source for one or more pixels in subsequent frames used to render the graphics is selected based at least in part on a random process influenced by a weighted average of the corresponding light source, the weighting of which is proportional to the contribution of the corresponding light source to the illumination of the pixel.

24. The non-transitory computer-readable storage medium of claim 18, wherein the first one or more lights are resampled from a plurality of lights in the virtual scene.

25. The non-transitory computer-readable storage medium of claim 18, wherein the second one or more lights are resampled from the lights used to render at least one or more pixels of the frame or a previous frame.