Image enhancement using one or more neural networks
By combining low-resolution generation with upscaling, deep learning supersampling, and neural network approximation techniques, the problems of resource-intensive and time-series-required high-resolution image generation are solved, achieving efficient and high-quality image generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NVIDIA CORP
- Filing Date
- 2021-09-02
- Publication Date
- 2026-06-30
AI Technical Summary
When generating high-resolution images and video content, existing technologies are resource-intensive and struggle to meet timing requirements, especially on resource-constrained devices, resulting in limited image quality.
By employing a low-resolution generation and upscaling approach, combined with deep learning supersampling (DLSS) and post-processing techniques, images are first generated at low resolution and visual effects are applied. Then, these effects are approximated by parameterized functions and neural networks and applied to high-resolution images, reducing processing requirements and improving efficiency.
It significantly reduces the demand for graphics processing hardware, improves the efficiency and quality of high-resolution image generation, meets timing requirements, and ensures the visual consistency of high-resolution images.
Smart Images

Figure CN114140333B_ABST
Abstract
Description
Technical Field
[0001] At least one embodiment relates to processing resources for performing and facilitating artificial intelligence. For example, at least one embodiment relates to a processor or computing system for training neural networks according to the various novel techniques described herein. Background Technology
[0002] Image and video content is increasingly being generated and displayed at higher resolutions and on higher-quality displays. The methods used to generate this content at these higher resolutions are often very resource-intensive, which can be problematic for devices with limited resources. Furthermore, video content often needs to be displayed at a target or minimum frame rate, making it difficult to generate high-resolution content at such frame rates. Typically, the quality of the resulting content is constrained by these and other limitations. Attached Figure Description
[0003] Various embodiments according to this disclosure will be described with reference to the accompanying drawings, in which:
[0004] Figure 1A and Figure 1B An image generated according to at least one embodiment is shown;
[0005] Figure 2A and Figure 2B A method for image post-processing according to at least one embodiment is shown;
[0006] Figure 3 Components of a system for providing generated image content according to at least one embodiment are shown;
[0007] Figure 4 A process for enhancing an image according to at least one embodiment is shown;
[0008] Figure 5 A process for applying visual effects to an image is illustrated according to at least one embodiment;
[0009] Figure 6A The inference and / or training logic according to at least one embodiment is illustrated;
[0010] Figure 6B The inference and / or training logic according to at least one embodiment is illustrated;
[0011] Figure 7 An example data center system according to at least one embodiment is shown;
[0012] Figure 8 A computer system according to at least one embodiment is shown;
[0013] Figure 9A computer system according to at least one embodiment is shown;
[0014] Figure 10 A computer system according to at least one embodiment is shown;
[0015] Figure 11 A computer system according to at least one embodiment is shown;
[0016] Figure 12A A computer system according to at least one embodiment is shown;
[0017] Figure 12B A computer system according to at least one embodiment is shown;
[0018] Figure 12C A computer system according to at least one embodiment is shown;
[0019] Figure 12D A computer system according to at least one embodiment is shown;
[0020] Figure 12E and Figure 12F A shared programming model according to at least one embodiment is shown;
[0021] Figure 13 An exemplary integrated circuit and an associated graphics processor according to at least one embodiment are shown;
[0022] Figures 14A-14B An exemplary integrated circuit and an associated graphics processor according to at least one embodiment are shown;
[0023] Figures 15A-15B Additional exemplary graphics processor logic according to at least one embodiment is shown;
[0024] Figure 16 A computer system according to at least one embodiment is shown;
[0025] Figure 17A A parallel processor according to at least one embodiment is shown;
[0026] Figure 17B A partitioning unit according to at least one embodiment is shown;
[0027] Figure 17C A processing cluster according to at least one embodiment is shown;
[0028] Figure 17D A graphics multiprocessor according to at least one embodiment is shown;
[0029] Figure 18A multi-graphics processing unit (GPU) system according to at least one embodiment is illustrated;
[0030] Figure 19 A graphics processor according to at least one embodiment is shown;
[0031] Figure 20 The microarchitecture of a processor according to at least one embodiment is shown;
[0032] Figure 21 A deep learning application processor according to at least one embodiment is shown;
[0033] Figure 22 An example neuromorphic processor according to at least one embodiment is shown;
[0034] Figure 23 and Figure 24 At least a portion of a graphics processor according to at least one embodiment is shown;
[0035] Figure 25 At least a portion of a graphics processor core according to at least one embodiment is shown;
[0036] Figures 26A-26B At least a portion of a graphics processor core according to at least one embodiment is shown;
[0037] Figure 27 A parallel processing unit (“PPU”) according to at least one embodiment is shown;
[0038] Figure 28 A general-purpose processing cluster (“GPC”) according to at least one embodiment is illustrated;
[0039] Figure 29 A memory partition unit of a parallel processing unit (“PPU”) according to at least one embodiment is shown;
[0040] Figure 30 A streaming multiprocessor according to at least one embodiment is illustrated;
[0041] Figure 31 This is an example data flow diagram for an advanced computing pipeline according to at least one embodiment;
[0042] Figure 32 This is a system diagram of an example system for training, adapting, instantiating, and deploying machine learning models in an advanced computing pipeline, according to at least one embodiment.
[0043] Figure 33A A data flow diagram illustrating the process for training a machine learning model according to at least one embodiment is shown; and
[0044] Figure 33B This is an example illustration of a client-server architecture for enhancing annotation tools using a pre-trained annotation model, according to at least one embodiment. Detailed Implementation
[0045] In at least one embodiment, the image or video generation system can generate image or video content at a first resolution, which may be a resolution lower than one or more target output resolutions. In at least one embodiment, this may correspond to... Figure 1A A lower-resolution image 100 is generated. In at least one embodiment, this lower-resolution generation can reduce processing requirements, enabling faster content generation on devices with fewer resource requirements (e.g., processor or memory). In at least one embodiment, the lower-resolution content can then be upscaled to generate a higher-resolution image 110 for output at a target resolution. In at least one embodiment, this can include scaling from 1080p or 1440p resolution to 4K or 8K resolution. In at least one embodiment, the upscaling can also accept aliased input, such as image 100, and produce an upscaled image 110 that is also upsampled and anti-aliased. In at least one embodiment, this can be performed using a process such as Deep Learning Super Sampling (DLSS) 2.0 from NVIDIA. In at least one embodiment, this upscaling and anti-aliasing processing can produce a high-quality, high-resolution output image 110 based on the low-resolution aliased image 100, which can significantly reduce the requirements for video cards or other graphics processing hardware.
[0046] In at least one embodiment, at least some amount of post-processing can be used to add effects, enhance, or augment the resulting image. In at least one embodiment, this may include, for example... Figure 1B The post-processed output image 120 shows the effects and enhancements. In at least one embodiment, post-processing may add effects such as bloom 122 around a bright object, lens flare 124, motion blur 126, or other effects, such as color correction, sharpening, filtering, chromatic aberration, lens distortion, chromatic glitch, or the addition of interface elements. In at least one embodiment, post-processing may also enhance the image, for example, by adding graphical user interface (GUI) or head-up display (HUD) elements (such as health data 128, clock element 130, ammunition status 132, or other such information). In at least one embodiment, it is desired that the added content has the same resolution, sharpness, and image quality as the rest of the image.
[0047] In at least one embodiment, post-processing can be performed on the high-resolution anti-aliased image 110. However, in at least one embodiment, this may require more processing or resource capacity than is available, or may fail to meet timing requirements, especially for higher resolutions or frame rates. In at least one embodiment, post-processing can be applied to the lower-resolution image 100 before upscaling and anti-aliasing, but this may result in a lower-than-desired quality of the output image, especially for higher resolutions.
[0048] In at least one embodiment, enhancements or other visual effects may be applied to the lower-resolution version of the image, and information about these visual effects is subsequently applied to the higher-resolution version, such as... Figure 2A The process 200 is illustrated. In at least one embodiment, an initial image 202 may be generated, for example, by a rendering engine or a game engine. In at least one embodiment, the initial image output by the rendering engine may have a primary color set for all pixels based on a determined object or other image content of the image, and lighting, shading, or other effects that may be determined using ray tracing or other such processes. In at least one embodiment, this may be an image with jagged edges (or an image without anti-aliasing applied) with a resolution lower than at least one target output resolution. In at least one embodiment, the initial image is generated at a lower resolution to reduce memory requirements and improve rendering speed. In at least one embodiment, this may include frames of video game content rendered in real time during a game session. In at least one embodiment, the initial image 202 may be scaled to generate a higher-resolution upscaled image 204, for example, for the target output resolution. In at least one embodiment, the upscaling process may involve upsampling and anti-aliasing performed in DLSS as described above. In at least one embodiment, a lower-resolution image 206 is also generated. In at least one embodiment, the lower-resolution image may correspond to the generated image 202, a downscaled version of image 202, or a downscaled version of the high-resolution image 204. In at least one embodiment, the advantage of downscaling the high-resolution image 204 is the application of anti-aliasing to the image, as well as potential other effects or processing. In at least one embodiment, the high-resolution image 204 and the low-resolution image 206 can be generated in parallel, and both are anti-aliased, which can allow for faster generation of the low-resolution image 206 to reduce the overall latency of post-processing.
[0049] In at least one embodiment, one or more visual effects applied to the high-resolution image 204 can be applied instead to the low-resolution image 206. In at least one embodiment, this can include any suitable visual effects, such as those mentioned or suggested herein, including floodlight, lens flare, color adjustment, etc. In at least one embodiment, this can result in a lower-resolution, anti-aliased image 208 that applies these visual effects. In at least one embodiment, applying these visual effects to the low-resolution image, whether applied at the per-pixel level or to groups of pixels, can significantly reduce processing requirements compared to applying these effects to the high-resolution image 204. In at least one embodiment, the pixel variations of the low-resolution image 208 can be determined, and these variations can be applied or approximated to the corresponding pixels of the upscaled image 204 to produce a high-resolution anti-aliased image 210 with these visual effects applied. In at least one embodiment, this can be accomplished using various methods, such as by performing pixel mapping and making corresponding adjustments, through one or more convolutions, or by using one or more neural networks that can take such visual effect data from the lower-resolution image 208 and the upscaled image 204 as input and can infer a high-resolution output image 210 with the effect data applied. In at least one embodiment, this can be performed using any suitable neural network, such as a convolutional neural network (CNN), which is trained to generate images using the provided visual effects or enhancement data.
[0050] In at least one embodiment, the visual effect data can be approximated using an enhancement function, for example, it can take the form of an appropriately parameterized function, including linear or polynomial functions. In at least one embodiment, the application of one or more visual effects can result in changes to individual pixels (or groups of pixels) in a low-resolution image, and these changes can be represented by the parameters of the enhancement function. In at least one embodiment, the image enhancement function can take the following form:
[0051]
[0052] Where α ij ,β ij , and γ ij These are parameters that can be estimated using a local window, and Indicate u ij A blurred version. In at least one embodiment, i and j represent the two-dimensional coordinates of an individual pixel. In at least one embodiment, as... Figure 2BAs shown in view 250, u corresponds to a high-resolution image, while Du represents a downscaled (or low-resolution) version of that image. In at least one embodiment, the image may be downscaled using any suitable downscaling network, algorithm, or method discussed or suggested herein. In at least one embodiment, a downsampling operation may be performed, which may include various processing functions that incorporate downsampling. Downsampling may be performed to any suitable size, scale, or resolution, and may be similar to, smaller than, or larger than the original rendered image. In at least one embodiment, these values may be configured or selected based on factors such as one or more performance objectives or image characteristics. In at least one embodiment, It will be a matrix, rather than the value of a single pixel, that can represent the blurriness of this low-resolution, post-processed image. In at least one embodiment, using this terminology allows for the use of a high-resolution version of the image, rather than a blurred image, so that pixels can be borrowed from this low-resolution post-processed frame when constructing a higher-resolution post-processed output.
[0053] In at least one embodiment, the initial low-resolution, jagged image (L) 252 is generated by a rendering engine. In at least one embodiment, this initial image 252 can be received from the rendering engine and stored in a buffer until processing is possible. In at least one embodiment, the image can be retrieved from the buffer and provided as input to a scale conversion module 254 or process, such as DLSS, which can generate a higher-resolution, upsampled, anti-aliased image (u) 256. In at least one embodiment, this higher-resolution image can be temporarily stored in another buffer. In at least one embodiment, the scale conversion module 254 (or a separate module) can also generate a downscaled anti-aliased image (Du) 258 (e.g., as part of a linear downscaling process), which can represent a low-resolution (e.g., downsampled) version of (u) 256. In at least one embodiment, the downscaling resolution is the same as the resolution of the initially generated image 252, which allows the post-processing to be performed in game engines that may not support post-processing at different resolutions. In at least one embodiment, one or more visual effects can be applied to the lower-resolution image 258 using at least one effects module 260 or process. In at least one embodiment, this module or process may be provided by a separate system or service. In at least one embodiment, the result of applying the visual effect will be a low-resolution image (y) 262 with these effects or other post-processing applied. In at least one embodiment, the enhancement approximation module 264 may take both low-resolution images 258 and 262 with and without the visual effect applied, respectively, as input. In at least one embodiment, the approximation module 264 may compare these images Du and y to approximate a parameterized function f(Du) = y. In at least one embodiment, any change in pixel position between Du and y can be represented or approximated by a set of parameters (e.g., α, β, γ) of the function. In at least one embodiment, there may be any number of parameters in such a function, which can be determined as the best approximation of those changes in the individual pixel positions in the low-resolution image. In at least one embodiment, other methods, such as convolution or neural networks, may be used to determine the parameter values. In at least one embodiment, other data or information representing changes due to the visual effect, such as pixel-specific change information or other such values, may also be provided or utilized.
[0054] In at least one embodiment, these estimated parameters may be provided to an application process 266 or module that uses these values to modify a high-resolution, anti-aliased image (u) 256 to produce a high-resolution output image (y*) 268 that appears as if the visual effects were applied to the higher-resolution image 256. In at least one embodiment, this may involve interpolating these estimated parameter values into a function f(u) of the higher-resolution image 256 so that the effect-related changes are applied to individual pixels or groups of pixels in the higher-resolution image. In at least one embodiment, this approach can provide a relatively effective and accurate estimate of the true f(u) that would otherwise be applied directly to the higher-resolution image. In at least one embodiment, the per-pixel function may be applied to any or all pixels of the higher-resolution image 256. In at least one embodiment, the quality of this approximation may depend at least in part on the resolution difference between the lower-resolution image (y) 262 and the higher-resolution image (y*) 268, since a resolution from 1440p to 4K may provide a more accurate and sharper visual effect or image enhancement than a resolution from 1080p to 8k. In at least one embodiment, this produces an inferred, approximate, or estimated version of what the higher-resolution image would look like if these visual effects were applied directly to it. In at least one embodiment, this higher-resolution, anti-aliased image 268 (with the approximate visual effects applied) can then be provided as an output at the target resolution. In at least one embodiment, there may be multiple output images with different resolutions or aspect ratios, and estimation parameters from one or more lower-resolution images can be used to approximate these effects to one or more of these higher-resolution output images.
[0055] In at least one embodiment, different approximation functions can be utilized. In at least one embodiment, the approximation function may assume a dedicated polynomial function for each pixel, for example, potentially resulting in three parameters per pixel. In at least one embodiment, these parameters will vary across the spatial range of the image. In at least one embodiment, this can be used to approximate pixel differences for post-processing effects that operate at the pixel level. In at least one embodiment, some post-processing effects (such as blurring) can be based on sliding pixel window operations, such as actual computations that may involve averaging a 3x3 pixel window. In at least one embodiment, pixel-by-pixel approximation may not provide this because it may not have enough input to approximate the function, so different functions or methods that are not pixel-specific can be utilized. In at least one embodiment, f can allow additional complexity to handle this situation. In at least one embodiment, convolutions or neural networks can be used for this approximation to allow this additional complexity. In at least one embodiment, additional steps can be taken to reduce complexity, for example, performing an approximation only on a single color channel (such as luma), as humans are more sensitive to it. In at least one embodiment, this reduction in complexity can also reduce processing time.
[0056] In at least one embodiment, at least a portion of the image or video content can be, as... Figure 3 The content is provided or presented locally on the client device 302 shown. In at least one embodiment, at least a portion of the content may be provided by a content server 320 (e.g., a game server or provider system) via at least one wired or wireless network 340. In at least one embodiment, the content to be presented may include various types of content, such as video games, virtual reality (VR), augmented reality (AR), images, text, audio, haptic, or video content. In at least one embodiment, the client device 302 may include or contain devices such as desktop computers, laptops, game consoles, smartphones, tablets, VR headsets, AR goggles, wearable computers, or smart TVs.
[0057] In at least one embodiment, client device 302 may use components of content application 304 on client device 302 and data locally stored on the client device to generate content for a session, such as a game session or a video viewing session. In at least one embodiment, content application 324 (e.g., a game or streaming application) executing on content server 320 may initiate a session associated with at least client device 302, such as by utilizing a session manager and user data stored in user database 334, and content 332 may be determined by content manager 326 and rendered using rendering engine 328 (if required by this type of content or platform), and transmitted to client device 302 using appropriate transport manager 322 for delivery via download, streaming, or another such transport channel. In at least one embodiment, client device 302 receiving the content may provide the content to a corresponding content application 304, which may also or alternatively include rendering engine 310 for rendering at least some of the content for presentation via client device 302, such as presenting video content via display 306, or presenting audio, such as sound and music, via at least one audio playback device 308 (such as speakers or headphones). In at least one embodiment, at least some of the content may already be stored on, rendered on, or accessible to the client device 302, such that at least this portion of the content does not need to be transmitted over the network 340; for example, the content may have been previously downloaded or stored locally on a hard disk or optical disc. In at least one embodiment, a transmission mechanism such as data streaming may be used to transmit the content from the server 320 or content database 334 to the client device 302. In at least one embodiment, at least a portion of the content may be obtained from or streamed from another source, such as a third-party content service 350, which may also include a content application 352 for generating or providing content. In at least one embodiment, portions of the functionality may be executed using multiple computing devices, or multiple processors (such as a combination of CPU and GPU) within one or more computing devices.
[0058] In at least one embodiment, the content application 324 includes a content manager 326 that can determine or analyze the content before it is sent to the client device 302. In at least one embodiment, the content manager 326 may also include, or cooperate with, other components capable of generating, modifying, or enhancing the content to be provided. In at least one embodiment, this may include a rendering engine 328 for rendering the content, such as jagged content at a first resolution. In at least one embodiment, an upsampling or scaling component 330 may generate at least one additional version of the image at a different resolution (higher or lower) and may perform at least some processing, such as anti-aliasing. In at least one embodiment, a post-processing component 332 may perform post-processing on one or more of these images, such as applying visual effects, enhancements, or augmentations discussed herein. In at least one embodiment, the content manager 326 may then select image or video frames of appropriate resolution to send to the client device 302. In at least one embodiment, the content application 304 on client device 302 may further include components such as rendering engine 310, upsampling module 312, and post-processing module 314, so that any or all of these functions may be performed additionally or alternatively on client device 302. In at least one embodiment, the content application 352 on third-party content service system 350 may also include such functionality. In at least one embodiment, the location where at least some of the functions are performed may be configurable or may depend on factors such as the type of client device 302 or the availability of a network connection with appropriate bandwidth. In at least one embodiment, upsampling module 330 or post-processing module 332 may include one or more neural networks for performing or assisting the function, wherein these neural networks (or at least the network parameters for these networks) may be provided by content server 320 or third-party system 350. In at least one embodiment, the system for content generation may include any suitable combination of hardware and software at one or more locations. In at least one embodiment, the generated image or video content at one or more resolutions may also be provided to or made available to other client devices 360, for example, for downloading or streaming from a media source storing a copy of the image or video content. In at least one embodiment, this could include transmitting images of game content for multiplayer games, where different client devices can display the content at different resolutions.
[0059] In at least one embodiment, it can be as follows Figure 4The process 400 shown is used to generate image or video content. In at least one embodiment, an image 402 is generated at a first resolution, which may or may not have any anti-aliasing or other processing applied, as may be generated by a game engine. In at least one embodiment, the first resolution may be lower than the target output resolution and will therefore be referred to as "low" or "lower" resolution, but this should not be construed as specifying a particular resolution unless otherwise specified. In at least one embodiment, upsampling 404 may be performed to generate a high-resolution, anti-aliased image, where "high" resolution in this instance refers to a resolution higher than the first resolution. In at least one embodiment, multiple higher-resolution versions may be generated. In at least one embodiment, at least one downsampled, lower-resolution version of the anti-aliased image may also be generated 406, either in parallel or based on one of the generated higher-resolution anti-aliased images. In at least one embodiment, one or more visual effects, augmentations, or enhancements may be applied 408 to the lower-resolution anti-aliased image to generate a low-resolution, post-processed image that includes these effects, augmentations, or enhancements. In at least one embodiment, these effects, augmentations, or enhancements can be approximated using a parameterized function of the coordinates of the lower-resolution image 410. In at least one embodiment, these parameters can then be used with a function for a higher-resolution anti-aliased image, wherein the function can be applied 412 to the image to determine changes in individual pixel positions or groups of pixel positions corresponding to these effects, augmentations, or enhancements applied to the lower-resolution image. In at least one embodiment, this can produce at least one higher-resolution image that appears as if the post-processing was applied to this higher-resolution version. In at least one embodiment, the high-resolution anti-aliased image or video frame can then be provided as output 414, for example, for display by a client device configured to display or render at that resolution.
[0060] In at least one embodiment, it can be as follows Figure 5 The execution 500 shown is a process for applying visual effects to images. In at least one embodiment, 502 may be used to apply one or more visual effects to one or more images whose resolution is less than a first resolution. In at least one embodiment, 504 information about these visual effects applied to these one or more images may be determined, as it may be related to the parameters of the enhancement function. In at least one embodiment, this information may be applied 506 to one or more upscaled images at or above the first resolution to generate one or more images having at least the first resolution, which includes an approximation of the visual effects applied to one or more images whose resolution is less than the first resolution.
[0061] In at least one embodiment, after upsampling and anti-aliasing, some post-processing can be directly applied to the higher-resolution image. In at least one embodiment, this may include adding text or user interface elements that will be provided as an overlay on objects in the image, such as on top of game content. In at least one embodiment, adding such augmentation or enhancement does not require a large amount of resource capacity, so such content can be directly added to the higher-resolution image to ensure the sharpness and high quality of that particular content. In at least one embodiment, other visual effects, such as bloom and blur, can be approximated using data from visual effects applied to the lower-resolution version, and this additional content can be applied to the higher-resolution image after this approximation. In at least one embodiment, the image may correspond to an image sequence or video frames, and the approximation function can be applied to each image or frame of the sequence. In at least one embodiment, the application of the function may involve upsampling these parameters of the function, and the "upsampled" function is used to enhance the higher-resolution image.
[0062] In at least one embodiment, different types or versions of post-processing can be applied to the image. In at least one embodiment, the post-processing version operates at the pixel level. In at least one embodiment, the input to the true post-processing function involves color data of a single pixel, and the output is the color of a single pixel. In at least one embodiment, this version can be used for effects such as tone mapping or converting colors to grayscale. In at least one embodiment, a pixel-by-pixel approach can be used because different approaches may cause some information to be corrupted, thus destroying details in higher resolution images. In at least one embodiment, another version can analyze information from a small spatial window to obtain effects such as bloom or blur. In at least one embodiment, rendering a neon light will result in pixels having a high brightness level, which may cause some glow around the neon light, referred to as bloom. In at least one embodiment, by approximating this glow within a small window, applying bloom causes bright pixel values to be blurred in a semi-transparent manner on nearby pixels. In at least one embodiment, another version of post-processing can completely replace a color, for example, for use in a GUI overlay or as an add-on to a heads-up display. In at least one embodiment, a set of convolutions can be utilized, and determining which convolution best approximates a given input / output pair can enable functions such as applying an actual blur kernel.
[0063] In at least one embodiment, there may be an approximate function or equation for each pixel in the downsampled, anti-aliased frame. In at least one embodiment, if there are one million pixels and one million equations, each equation having several (e.g., three) unknown parameters, then there will be more unknowns than equations, which is challenging to solve. In at least one embodiment, it can be assumed that these unknowns (such as α, β, and γ) are shared in a relatively small window, such as a 2x2 or 8x8 pixel region. In at least one embodiment, this can result in fewer unknowns than equations, which can allow for simplified approximations. In at least one embodiment, linear regression can be used to determine the values of these unknowns. In at least one embodiment, these determined values can then be upsampled to obtain different α, β, and γ (or other parameter) values for all pixels in a higher resolution image, so that the function can be applied at that higher resolution.
[0064] In at least one embodiment, the approximation function f is a linear function that can be solved using a process such as matrix inversion. In at least one embodiment, a matrix can be constructed and its inverse solved in a shader. In at least one embodiment, linear regression can be used for linear functions, while different estimation techniques can be used for nonlinear functions. In at least one embodiment, f can be a function of a neural network, where techniques such as gradient descent can be used to determine the values of the parameters f. In at least one embodiment, other techniques can also be used, such as including Gauss-Newton optimization. In at least one embodiment, the Fast Fourier Transform (FFT) can be used to estimate the phase shift in terms of color or spatial displacement to estimate a portion of the transform of the approximation f, which can provide improved performance for groups of pixels or windows. In at least one embodiment, historical data can also be used, where pixel values from previous frames can serve as a starting point for estimating the parameters of the current frame. In at least one embodiment, this can result in a smaller search window for these parameters and may lead to a higher resolution approximation function to some extent. In at least one embodiment, data from the previous three frames can be used, which will allow the same number of equations as unknowns, allowing for different parameter values for each pixel and higher precision results if each of these equations has three unknowns. In at least one embodiment, a recurrent neural network (RNN) can be used to learn this information over multiple frames. In at least one embodiment, such an RNN can utilize motion vectors to warp historical data in a manner similar to the warping process used in DLSS. In at least one embodiment, these motion vectors can be vectors used in upsampling, or they can be generated by another deep learning network, such as an interpolation or optical flow network. In at least one embodiment, Cholesky decomposition can be used to estimate these parameters for overlapping or non-overlapping pixel windows. In at least one embodiment, the function may not be specific to color-to-color transformations, but can receive coordinates and output colors, which may be suitable for text or other such content.
[0065] In at least one embodiment, noise or other variations may be added to the downsampled image to provide a more accurate approximation. In at least one embodiment, understanding the added noise can help learn information about the blur kernel used more quickly and accurately. In at least one embodiment, since this lower-resolution post-processed image is never shown to the end user, it may be appropriately modified to attempt to produce a better approximation for the corresponding high-resolution output.
[0066] Reasoning and training logic
[0067] Figure 6A Inference and / or training logic 615 is illustrated for performing inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6B Provide details about reasoning and / or training logic 615.
[0068] In at least one embodiment, inference and / or training logic 615 may include, but is not limited to, code and / or data storage 601 for storing forward and / or output weights and / or input / output data, and / or other parameters configuring neurons or layers of a neural network trained for and / or used for inference in one or more embodiments. In at least one embodiment, training logic 615 may include or be coupled to code and / or data storage 601 for storing graph code or other software to control timing and / or sequence, wherein weight and / or other parameter information is loaded to configure logic, including integer and / or floating-point units (collectively, arithmetic logic units (ALUs)). In at least one embodiment, code (such as graph code) loads weight or other parameter information into the processor ALU based on the architecture of the neural network to which the code corresponds. In at least one embodiment, code and / or data storage 601 stores weight parameters and / or input / output data of each layer of a neural network trained or used in one or more embodiments during forward propagation of input / output data and / or weight parameters during training and / or inference using one or more embodiments. In at least one embodiment, any portion of the code and / or data storage 601 may be included within other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory.
[0069] In at least one embodiment, any portion of the code and / or data storage 601 may be internal or external to one or more processors or other hardware logic devices or circuits. In at least one embodiment, the code and / or data storage 601 may be a cache memory, dynamic random-addressable memory (“DRAM”), static random-addressable memory (“SRAM”), non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice of whether the code and / or data storage 601 is internal or external to the processor, for example, or composed of DRAM, SRAM, flash memory, or some other storage type, may depend on the available on-chip and off-chip storage space, the latency requirements of the training and / or inference functions being performed, the batch size of the data used in the inference and / or training of the neural network, or some combination of these factors.
[0070] In at least one embodiment, inference and / or training logic 615 may include, but is not limited to, code and / or data storage 605 for storing backpropagation and / or output weights and / or input / output data corresponding to neurons or layers of a neural network trained and / or used for inference in one or more embodiments. In at least one embodiment, during training and / or inference using one or more embodiments, code and / or data storage 605 stores weight parameters and / or input / output data for each layer of a neural network trained or used in one or more embodiments during backpropagation of input / output data and / or weight parameters. In at least one embodiment, training logic 615 may include or be coupled to code and / or data storage 605 for storing graph code or other software to control timing and / or sequence, wherein weight and / or other parameter information is loaded to configure logic including integer and / or floating-point units (collectively, an arithmetic logic unit (ALU)). In at least one embodiment, code (such as graph code) loads weight or other parameter information into the processor ALU based on the architecture of the neural network to which the code corresponds. In at least one embodiment, any portion of the code and / or data storage 605 may be included together with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory. In at least one embodiment, any portion of the code and / or data storage 605 may be internal or external to one or more processors or other hardware logic devices or circuits. In at least one embodiment, the code and / or data storage 605 may be cache memory, DRAM, SRAM, non-volatile memory (e.g., flash memory), or other storage. In at least one embodiment, the choice between the code and / or data storage 605 being internal or external to the processor, for example, whether it consists of DRAM, SRAM, flash memory, or some other type of storage, depends on the available on-chip and off-chip storage, the latency requirements of the training and / or inference functions being performed, the batch size of the data used in the inference and / or training of the neural network, or some combination of these factors.
[0071] In at least one embodiment, code and / or data storage 601 and code and / or data storage 605 may be separate storage structures. In at least one embodiment, code and / or data storage 601 and code and / or data storage 605 may be the same storage structure. In at least one embodiment, code and / or data storage 601 and code and / or data storage 605 may be partially the same storage structure and partially separate storage structures. In at least one embodiment, any portion of code and / or data storage 601 and code and / or data storage 605 may be included with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory.
[0072] In at least one embodiment, the inference and / or training logic 615 may include, but is not limited to, one or more arithmetic logic units (“ALUs”) 610 (including integer and / or floating-point units) for performing logical and / or mathematical operations at least in part based on or instructed by training and / or inference code (e.g., graph code), the results of which may produce activations (e.g., output values from layers or neurons within a neural network) stored in activation storage 620, which are functions of input / output and / or weight parameter data stored in code and / or data storage 601 and / or code and / or data storage 605. In at least one embodiment, activation is activated in response to execution instructions or other code, based on linear algebra and / or matrix-based mathematical generation performed by ALU 610, and stored in activation storage 620, wherein weight values stored in code and / or data storage 605 and / or code and / or data storage 601 are used as operands, and other values, such as bias values, gradient information, momentum values, or other parameters or hyperparameters, may be stored in code and / or data storage 605 or code and / or data storage 601 or other on-chip or off-chip storage.
[0073] In at least one embodiment, one or more ALUs 610 are included in one or more processors or other hardware logic devices or circuits, while in another embodiment, one or more ALUs 610 may be located outside the processor or other hardware logic device or the circuits using them (e.g., coprocessors). In at least one embodiment, one or more ALUs 610 may be included within an execution unit of a processor, or otherwise included in an ALU bank accessible by the execution unit of the processor, which may be within the same processor or distributed among different processors of different types (e.g., central processing unit, graphics processing unit, fixed-function unit, etc.). In at least one embodiment, code and / or data storage 601, code and / or data storage 605, and activation storage 620 may be on the same processor or other hardware logic device or circuit, while in another embodiment, they may be in different processors or other hardware logic devices or circuits, or in some combination of the same and different processors or other hardware logic devices or circuits. In at least one embodiment, any portion of activation storage 620 may be included together with other on-chip or off-chip data storage, including the processor's L1, L2, or L3 cache or system memory. Furthermore, inference and / or training code may be stored together with other code accessible to the processor or other hardware logic or circuitry, and may be retrieved and / or processed using the processor’s fetch, decode, schedule, execute, exit, and / or other logic circuitry.
[0074] In at least one embodiment, the active memory 620 may be a cache memory, DRAM, SRAM, non-volatile memory (e.g., flash memory), or other memory. In at least one embodiment, the active memory 620 may be wholly or partially located inside or outside one or more processors or other logic circuits. In at least one embodiment, the active memory 620 may be selected to be inside or outside the processor, for example, or may include DRAM, SRAM, flash memory, or other memory types, depending on the availability of on-chip and off-chip memory, latency requirements for performing training and / or inference functions, batch size of data used in inference and / or training neural networks, or some combination of these factors. In at least one embodiment, Figure 6A The inference and / or training logic 615 shown can be used in conjunction with an application-specific integrated circuit (“ASIC”), such as those from Google. Processing unit, from Graphcore TM The inference processing unit (IPU) or from Intel. (e.g., "Lake Crest") processor. In at least one embodiment, Figure 6A The inference and / or training logic 615 shown can be used in conjunction with central processing unit (“CPU”) hardware, graphics processing unit (“GPU”) hardware or other hardware (such as field programmable gate array (“FPGA”)).
[0075] Figure 6B Inference and / or training logic 615 according to at least one embodiment is illustrated. In at least one embodiment, the inference and / or training logic 615 may include, but is not limited to, hardware logic, wherein computational resources are dedicated or otherwise exclusively used in conjunction with weight values or other information corresponding to one or more layers of neurons within a neural network. In at least one embodiment, Figure 6B The inference and / or training logic 615 shown can be used in conjunction with an application-specific integrated circuit (ASIC), such as those from Google. Processing unit, from Graphcore TM The inference processing unit (IPU) or from Intel. (e.g., "Lake Crest") processor. In at least one embodiment, Figure 6BThe inference and / or training logic 615 shown can be used in conjunction with central processing unit (CPU) hardware, graphics processing unit (GPU) hardware, or other hardware (e.g., field-programmable gate array (FPGA)). In at least one embodiment, the inference and / or training logic 615 includes, but is not limited to, code and / or data storage 601 and code and / or data storage 605, which can be used to store code (e.g., graph code), weight values, and / or other information, including bias values, gradient information, momentum values, and / or other parameter or hyperparameter information. Figure 6B In at least one embodiment shown, each of code and / or data storage 601 and code and / or data storage 605 is associated with dedicated computing resources (e.g., computing hardware 602 and computing hardware 606), respectively. In at least one embodiment, each of computing hardware 602 and computing hardware 606 includes one or more ALUs that perform mathematical functions (e.g., linear algebraic functions) only on the information stored in code and / or data storage 601 and code and / or data storage 605, respectively, and the results of the function execution are stored in activation memory 620.
[0076] In at least one embodiment, each of the code and / or data storage 601 and 605 and the corresponding computing hardware 602 and 606 corresponds to a different layer of the neural network, such that activation obtained from one “store / computation pair 601 / 602” of the code and / or data storage 601 and computing hardware 602 provides input as input to the next “store / computation pair 605 / 606” of the code and / or data storage 605 and computing hardware 606, in order to reflect the conceptual organization of the neural network. In at least one embodiment, each store / computation pair 601 / 602 and 605 / 606 may correspond to more than one neural network layer. In at least one embodiment, additional store / computation pairs (not shown) may be included in the inference and / or training logic 615 after or in parallel with the store / computation pairs 601 / 602 and 605 / 606.
[0077] Data Center
[0078] Figure 7 An example data center 700 that can be used with at least one embodiment is shown. In at least one embodiment, the data center 700 includes a data center infrastructure layer 710, a framework layer 720, a software layer 730, and an application layer 740.
[0079] In at least one embodiment, such as Figure 7As shown, the data center infrastructure layer 710 may include a resource coordinator 712, packet computing resources 714, and node computing resources (“nodes CR”) 716(1)-716(N), where “N” represents a positive integer. In at least one embodiment, nodes CR 716(1)-716(N) may include, but are not limited to, any number of central processing units (“CPUs”) or other processors (including accelerators, field-programmable gate arrays (FPGAs), graphics processors, etc.), memory devices (e.g., dynamic read-only memory), storage devices (e.g., solid-state drives or disk drives), network input / output (“NW I / O”) devices, network switches, virtual machines (“VMs”), power modules, and cooling modules, etc. In at least one embodiment, one or more nodes CR 716(1)-716(N) may be servers having one or more of the aforementioned computing resources.
[0080] In at least one embodiment, the grouped computing resource 714 may include individual groups (not shown) of node CRs housed within one or more racks, or a plurality of racks (also not shown) housed within data centers in various geographical locations. The individual groups of node CRs within the grouped computing resource 714 may include computing, networking, memory, or storage resources that can be configured or allocated to support groups of one or more workloads. In at least one embodiment, several node CRs, including CPUs or processors, may be grouped within one or more racks to provide computing resources to support one or more workloads. In at least one embodiment, the one or more racks may also include any number of power modules, cooling modules, and network switches, in any combination.
[0081] In at least one embodiment, resource coordinator 712 may configure or otherwise control one or more nodes CR716(1)-716(N) and / or grouped computing resources 714. In at least one embodiment, resource coordinator 712 may include a Software Design Infrastructure (“SDI”) management entity for data center 700. In at least one embodiment, resource coordinator may include hardware, software, or some combination thereof.
[0082] In at least one embodiment, such as Figure 7As shown, framework layer 720 includes job scheduler 722, configuration manager 724, resource manager 726, and distributed file system 728. In at least one embodiment, framework layer 720 may include a framework of software 732 supporting software layer 730 and / or one or more applications 742 supporting application layer 740. In at least one embodiment, software 732 or application 742 may respectively include web-based service software or applications, such as services or applications provided by Amazon Web Services, Google Cloud, and Microsoft Azure. In at least one embodiment, framework layer 720 may be, but is not limited to, a free and open-source software web application framework, such as Apache Spark™ (hereinafter referred to as "Spark") which can leverage distributed file system 728 for large-scale data processing (e.g., "big data"). In at least one embodiment, job scheduler 732 may include Spark drivers to facilitate the scheduling of workloads supported by the various layers of data center 700. In at least one embodiment, configuration manager 724 may be able to configure different layers, such as software layer 730 and framework layer 720 including Spark and distributed file system 728 for supporting large-scale data processing. In at least one embodiment, resource manager 726 is capable of managing cluster or group computing resources mapped to or allocated to support distributed file system 728 and job scheduler 722. In at least one embodiment, cluster or group computing resources may include group computing resources 714 on data center infrastructure layer 710. In at least one embodiment, resource manager 726 may coordinate with resource coordinator 712 to manage these mapped or allocated computing resources.
[0083] In at least one embodiment, the software 732 included in the software layer 730 may include software used by at least a portion of the nodes CR716(1)-716(N), the grouped computing resources 714, and / or the distributed file system 728 of the framework layer 720. One or more types of software may include, but are not limited to, Internet web page search software, email virus scanning software, database software, and streaming video content software.
[0084] In at least one embodiment, one or more applications 742 included in application layer 740 may include one or more types of applications used by at least a portion of nodes CR716(1)-716(N), grouped computing resources 714, and / or the distributed file system 728 of framework layer 720. One or more types of applications may include, but are not limited to, any number of genomics applications, cognitive computing, and machine learning applications, including training or inference software, machine learning framework software (e.g., PyTorch, TensorFlow, Caffe, etc.), or other machine learning applications used in conjunction with one or more embodiments.
[0085] In at least one embodiment, any of the configuration manager 724, resource manager 726, and resource coordinator 712 can implement any number and type of self-modification actions based on any amount and type of data acquired in any technically feasible manner. In at least one embodiment, self-modification actions can mitigate potentially poor configuration decisions by data center operators of data center 700 and can prevent underutilization and / or poor performance of the data center.
[0086] In at least one embodiment, data center 700 may include tools, services, software, or other resources to train one or more machine learning models or to use one or more machine learning models to predict or infer information according to one or more embodiments described herein. For example, in at least one embodiment, a machine learning model can be trained by calculating weight parameters based on a neural network architecture using the software and computing resources described above with respect to data center 700. In at least one embodiment, information can be inferred or predicted using trained machine learning models corresponding to one or more neural networks using the resources described above with respect to data center 700 by using weight parameters calculated through one or more training techniques described herein.
[0087] In at least one embodiment, the data center may use a CPU, application-specific integrated circuit (ASIC), GPU, FPGA, or other hardware to utilize the aforementioned resources to perform training and / or inference. Furthermore, one or more of the aforementioned software and / or hardware resources may be configured as a service to allow a user to train or perform information inference, such as image recognition, speech recognition, or other artificial intelligence services.
[0088] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 6A and / or Figure 6BDetails are provided regarding the inference and / or training logic 615. In at least one embodiment, the inference and / or training logic 615 can be... Figure 7 Used in systems for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0089] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphs to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0090] Computer System
[0091] Figure 8 This is a block diagram illustrating an exemplary computer system according to at least one embodiment. The exemplary computer system may be a system of interconnected devices and components, a system-on-a-chip (SoC), or some combination thereof formed with a processor, which may include an execution unit to execute instructions. In at least one embodiment, according to this disclosure, such as the embodiments described herein, computer system 800 may include, but is not limited to, components such as processor 802, whose execution unit includes logic to execute algorithms for process data. In at least one embodiment, computer system 800 may include a processor, such as those available from Intel Corporation of Santa Clara, California. Processor family, Xeon™ XScale™ and / or StrongARM™ CoreTM or The Nervana™ microprocessor can also be used, although other systems (including PCs, engineering workstations, set-top boxes, etc.) with other microprocessors can also be used. In at least one embodiment, the computer system 800 can execute a version of the Windows operating system available from Microsoft Corporation of Redmond, Washington, although other operating systems (such as UNIX and Linux), embedded software, and / or graphical user interfaces can also be used.
[0092] The embodiments can be used in other devices, such as handheld devices and embedded applications. Some examples of handheld devices include cellular phones, Internet Protocol (IP) devices, digital cameras, personal digital assistants (“PDAs”), and handheld PCs. In at least one embodiment, the embedded application may include a microcontroller, a digital signal processor (“DSP”), a system-on-a-chip (SoC), a network computer (“NetPC”), a set-top box, a network hub, a wide area network (“WAN”) switch, or any other system that can execute one or more instructions according to at least one embodiment.
[0093] In at least one embodiment, the computer system 800 may include, but is not limited to, a processor 802, which may include, but is not limited to, one or more execution units 808, to perform machine learning model training and / or inference according to the techniques described herein. In at least one embodiment, the computer system 800 is a single-processor desktop or server system, but in another embodiment, the computer system 800 may be a multiprocessor system. In at least one embodiment, the processor 802 may include, but is not limited to, a Complex Instruction Set Computer (“CISC”) microprocessor, a Reduced Instruction Set Computing (“RISC”) microprocessor, a Very Long Instruction Word (“VLIW”) microprocessor, a processor implementing instruction set combination, or any other processor device, such as a digital signal processor. In at least one embodiment, the processor 802 may be coupled to a processor bus 810, which can transmit data signals between the processor 802 and other components in the computer system 800.
[0094] In at least one embodiment, processor 802 may include, but is not limited to, a Level 1 (“L1”) internal cache memory (“cache”) 804. In at least one embodiment, processor 802 may have a single internal cache or multiple levels of internal cache. In at least one embodiment, the cache memory may reside external to processor 802. Depending on specific implementation and requirements, other embodiments may also include a combination of internal and external caches. In at least one embodiment, register file 806 may store different types of data in various registers, including but not limited to integer registers, floating-point registers, status registers, and instruction pointer registers.
[0095] In at least one embodiment, an execution unit 808, including but not limited to logic for performing integer and floating-point operations, is also located within the processor 802. In at least one embodiment, the processor 802 may further include a microcode (“ucode”) read-only memory (“ROM”) for storing microcode of certain macro instructions. In at least one embodiment, the execution unit 808 may include logic for processing a packaged instruction set 809. In at least one embodiment, by including the packaged instruction set 809 in the instruction set of a general-purpose processor, along with the associated circuitry for executing the instructions, the packaged data in the general-purpose processor 802 can be used to perform operations used by many multimedia applications. In one or more embodiments, many multimedia applications can be executed more quickly and efficiently by using the full width of the processor’s data bus to perform operations on the packaged data, which may eliminate the need to transfer smaller data units on the processor’s data bus to perform one or more operations on one data element at a time.
[0096] In at least one embodiment, execution unit 808 may also be used in a microcontroller, embedded processor, graphics device, DSP, and other types of logic circuitry. In at least one embodiment, computer system 800 may include, but is not limited to, memory 820. In at least one embodiment, memory 820 may be implemented as a dynamic random access memory (“DRAM”) device, a static random access memory (“SRAM”) device, a flash memory device, or other memory device. In at least one embodiment, memory 820 may store instructions 819 and / or data 821 represented by data signals that can be executed by processor 802.
[0097] In at least one embodiment, the system logic chip may be coupled to a processor bus 810 and a memory 820. In at least one embodiment, the system logic chip may include, but is not limited to, a memory controller hub (“MCH”) 816, and the processor 802 may communicate with the MCH 816 via the processor bus 810. In at least one embodiment, the MCH 816 may provide a high-bandwidth memory path 818 to the memory 820 for instruction and data storage, as well as for storage of graphics commands, data, and textures. In at least one embodiment, the MCH 816 may initiate data signals between the processor 802, the memory 820, and other components in the computer system 800, and bridge data signals between the processor bus 810, the memory 820, and the system I / O interface 822. In at least one embodiment, the system logic chip may provide a graphics port for coupling to a graphics controller. In at least one embodiment, the MCH 816 may be coupled to the memory 820 via the high-bandwidth memory path 818, and the graphics / video card 812 may be coupled to the MCH 816 via an Accelerated Graphics Port (“AGP”) interconnect 814.
[0098] In at least one embodiment, the computer system 800 may use a system I / O interface 822, which is a proprietary hub interface bus that couples the MCH 816 to the I / O controller hub (“ICH”) 830. In at least one embodiment, the ICH 830 may provide direct connectivity to certain I / O devices via a local I / O bus. In at least one embodiment, the local I / O bus may include, but is not limited to, a high-speed I / O bus for connecting peripheral devices to the memory 820, chipset, and processor 802. Examples may include, but are not limited to, an audio controller 829, a firmware hub (“Flash BIOS”) 828, a wireless transceiver 826, a data storage 824, a conventional I / O controller 823 including a user input and keyboard interface 825, a serial expansion port 827 (e.g., Universal Serial Bus (USB)), and a network controller 834. The data storage 824 may include a hard disk drive, a floppy disk drive, a CD-ROM device, a flash memory device, or other mass storage device.
[0099] In at least one embodiment, Figure 8 A system including interconnected hardware devices or "chips" is shown, while in other embodiments, Figure 8 An exemplary system-on-a-chip (SoC) can be illustrated. In at least one embodiment, Figure 8The devices shown can be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, one or more components of the computer system 800 are interconnected using a Compute Fast Link (CXL) interconnect.
[0100] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 6A and / or Figure 6B Details are provided regarding the inference and / or training logic 615. In at least one embodiment, the inference and / or training logic 615 can be... Figure 8 Used in systems for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0101] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphs to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0102] Figure 9 This is a block diagram illustrating an electronic device 900 for utilizing a processor 910 according to at least one embodiment. In at least one embodiment, the electronic device 900 may be, for example, but not limited to, a laptop computer, tower server, rack server, blade server, desktop computer, tablet computer, mobile device, telephone, embedded computer, or any other suitable electronic device.
[0103] In at least one embodiment, system 900 may include, but is not limited to, processor 910 communicatively coupled to any suitable number or type of components, peripherals, modules, or devices. In at least one embodiment, processor 910 is coupled using a bus or interface, such as I... 2 C-bus, System Management Bus (“SMBus”), Low Pin Count (LPC) bus, Serial Peripheral Interface (“SPI”), High Definition Audio (“HDA”) bus, Serial Advanced Technology Accessory (“SATA”) bus, Universal Serial Bus (“USB”) (versions 1, 2, 3, etc.), or Universal Asynchronous Receiver / Transmitter (“UART”) bus. In at least one embodiment, Figure 9 The system shown includes interconnected hardware devices or "chips," while in other embodiments, Figure 9 An exemplary system-on-a-chip (SoC) can be illustrated. In at least one embodiment, Figure 9The device shown can be interconnected with proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, Figure 9 One or more components are interconnected using Computational Fast Link (CXL) interconnects.
[0104] In at least one embodiment, Figure 9 This may include a display 924, a touchscreen 925, a touchpad 930, a near-field communication unit (“NFC”) 945, a sensor hub 940, a thermal sensor 946, a fast chipset (“EC”) 935, a trusted platform module (“TPM”) 938, a BIOS / firmware / flash memory (“BIOS, FW Flash”) 922, a DSP 960, a drive 920 (e.g., a solid-state drive (“SSD”) or a hard disk drive (“HDD”)), a wireless local area network unit (“WLAN”) 950, a Bluetooth unit 952, a wireless wide area network unit (“WWAN”) 956, a global positioning system (GPS) 955, a camera (“USB 3.0 camera”) 954 (e.g., a USB 3.0 camera), and / or a low-power double data rate (“LPDDR”) memory unit (“LPDDR3”) 915 implemented in, for example, the LPDDR3 standard. These components may each be implemented in any suitable manner.
[0105] In at least one embodiment, other components may be communicatively coupled to processor 910 via the components described herein. In at least one embodiment, accelerometer 941, ambient light sensor (“ALS”) 942, compass 943, and gyroscope 944 may be communicatively coupled to sensor hub 940. In at least one embodiment, thermal sensor 939, fan 937, keyboard 936, and touchpad 930 may be communicatively coupled to EC 935. In at least one embodiment, speaker 963, earphone 964, and microphone (“mic”) 965 may be communicatively coupled to audio unit (“audio codec and Class D upscaler”) 962, which in turn may be communicatively coupled to DSP 960. In at least one embodiment, audio unit 962 may include, for example, but not limited to, audio encoder / decoder (“codec”) and Class D upscaler. In at least one embodiment, SIM card (“SIM”) 957 may be communicatively coupled to WWAN unit 956. In at least one embodiment, components such as WLAN unit 950, Bluetooth unit 952, and WWAN unit 956 can be implemented as next-generation form factor (NGFF).
[0106] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 6A and / or Figure 6BDetails are provided regarding the inference and / or training logic 615. In at least one embodiment, the inference and / or training logic 615 can be... Figure 9 Used in systems for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0107] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphs to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0108] Figure 10 A computer system 1000 according to at least one embodiment is shown. In at least one embodiment, the computer system 1000 is configured to implement various processes and methods described throughout this disclosure.
[0109] In at least one embodiment, the computer system 1000 includes, but is not limited to, at least one central processing unit (“CPU”) 1002 connected to a communication bus 1010 implemented using any suitable protocol, such as PCI (“Peripheral Interconnect”), Peripheral Component Interconnect Express (“PCI-Express”), AGP (“Accelerated Graphics Port”), HyperTransport, or any other bus or point-to-point communication protocol. In at least one embodiment, the computer system 1000 includes, but is not limited to, main memory 1004 and control logic (e.g., implemented in hardware, software, or a combination thereof), and data may be stored in main memory 1004 in the form of random access memory (“RAM”). In at least one embodiment, a network interface subsystem (“network interface”) 1022 provides an interface to other computing devices and networks for receiving data from the computer system 1000 and transferring data to other systems.
[0110] In at least one embodiment, the computer system 1000 includes, but is not limited to, an input device 1008, a parallel processing system 1012, and a display device 1006, which may be implemented using conventional cathode ray tubes (“CRTs”), liquid crystal displays (“LCDs”), light-emitting diodes (“LEDs”), plasma displays, or other suitable display technologies. In at least one embodiment, user input is received from the input device 1008 (such as a keyboard, mouse, touchpad, microphone, etc.). In at least one embodiment, each of the foregoing modules may reside on a single semiconductor platform to form the processing system.
[0111] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 6A and / or Figure 6B Details are provided regarding the inference and / or training logic 615. In at least one embodiment, the inference and / or training logic 615 can be... Figure 10 Used in systems for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0112] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphs to generate one or more images having visual effects approximating those of one or more lower-resolution versions of these images.
[0113] Figure 11 A computer system 1100 according to at least one embodiment is illustrated. In at least one embodiment, the computer system 1100 includes, but is not limited to, a computer 1110 and a USB flash drive 1120. In at least one embodiment, the computer 1110 may include, but is not limited to, any number and type of processors (not shown) and memory (not shown). In at least one embodiment, the computer 1110 includes, but is not limited to, a server, a cloud instance, a laptop computer, and a desktop computer.
[0114] In at least one embodiment, the USB flash drive 1120 includes, but is not limited to, a processing unit 1130, a USB interface 1140, and USB interface logic 1150. In at least one embodiment, the processing unit 1130 can be any instruction execution system, apparatus, or device capable of executing instructions. In at least one embodiment, the processing unit 1130 can include, but is not limited to, any number and type of processing cores (not shown). In at least one embodiment, the processing core 1130 includes an application-specific integrated circuit (“ASIC”) optimized to perform any number and type of operations associated with machine learning. For example, in at least one embodiment, the processing core 1130 is a tensor processing unit (“TPC”) optimized to perform machine learning inference operations. In at least one embodiment, the processing core 1130 is a vision processing unit (“VPU”) optimized to perform machine vision and machine learning inference operations.
[0115] In at least one embodiment, the USB interface 1140 can be any type of USB connector or USB receptacle. For example, in at least one embodiment, the USB interface 1140 is a USB 3.0 Type-C receptacle for data and power. In at least one embodiment, the USB interface 1140 is a USB 3.0 Type-A connector. In at least one embodiment, the USB interface logic 1150 may include any amount and type of logic enabling the processing unit 1130 to connect to a device (e.g., computer 1110) via the USB connector 1140.
[0116] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 6A and / or Figure 6B Details are provided regarding the inference and / or training logic 615. In at least one embodiment, the inference and / or training logic 615 can be... Figure 11 Used in systems for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0117] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphs to generate one or more images having visual effects approximating those of one or more lower-resolution versions of these images.
[0118] Figure 12A An exemplary architecture is illustrated, in which multiple GPUs 1210-1213 are communicatively coupled to multiple multi-core processors 1205-1206 via high-speed links 1240-1243 (e.g., bus / point-to-point interconnect, etc.). In one embodiment, the high-speed links 1240-1243 support communication throughput of 4GB / s, 30GB / s, 80GB / s, or higher. Various interconnect protocols can be used, including but not limited to PCIe 4.0 or 5.0 and NVLink 2.0.
[0119] Furthermore, in one embodiment, two or more GPUs 1210-1213 are interconnected via high-speed links 1229-1230, which may use the same or different protocols / links as those used for high-speed links 1240-1243. Similarly, two or more multi-core processors 1205-1206 may be connected via high-speed link 1228, which may be a symmetric multiprocessor (SMP) bus operating at speeds of 20GB / s, 30GB / s, 120GB / s, or higher. Alternatively, similar protocols / links (e.g., via a common interconnect structure) may be used. Figure 12A This shows all communication between the various system components.
[0120] In one embodiment, each multi-core processor 1205-1206 is communicatively coupled to processor memories 1201-1202 via memory interconnects 1226-1227, and each GPU 1210-1213 is communicatively coupled to GPU memories 1220-1223 via GPU memory interconnects 1250-1253. Memory interconnects 1226-1227 and 1250-1253 may utilize the same or different memory access technologies. By way of example and not limitation, processor memories 1201-1202 and GPU memories 1220-1223 may be volatile memories, such as dynamic random access memory (DRAM) (including stacked DRAM), graphics DDR SDRAM (GDDR) (e.g., GDDR5, GDDR6), or high bandwidth memory (HBM), and / or may be non-volatile memories, such as 3D XPoint or Nano-RAM. In one embodiment, some portions of the processor memories 1201-1202 may be volatile memory, while other portions may be non-volatile memory (e.g., using a two-level memory (2LM) hierarchy).
[0121] As described below, although the various processors 1205-1206 and GPUs 1210-1213 can be physically coupled to specific memories 1201-1202 and 1220-1223 respectively, a unified memory architecture can be implemented, in which the same virtual system address space (also known as the “effective address” space) is distributed among the various physical memories. For example, processor memories 1201-1202 can each contain 64GB of system memory address space, and GPU memories 1220-1223 can each contain 32GB of system memory address space (resulting in a total of 256GB of addressable memory in this example).
[0122] Figure 12BAdditional details are shown regarding the interconnection between a multi-core processor 1207 and a graphics acceleration module 1246, according to an exemplary embodiment. The graphics acceleration module 1246 may include one or more GPU chips integrated on a line card coupled to the processor 1207 via a high-speed link 1240. Alternatively, the graphics acceleration module 1246 may be integrated on the same package or chip as the processor 1207.
[0123] In at least one embodiment, the processor 1207 shown includes multiple cores 1260A-1260D, each core having a translation back buffer 1261A-1261D and one or more caches 1262A-1262D. In at least one embodiment, cores 1260A-1260D may include various other components (not shown) for executing instructions and processing data. Caches 1262A-1262D may include level 1 (L1) and level 2 (L2) caches. Furthermore, one or more shared caches 1256 may be included in caches 1262A-1262D and shared by the respective groups of cores 1260A-1260D. For example, one embodiment of the processor 1207 includes 24 cores, each core having its own L1 cache, twelve shared L2 caches, and twelve shared L3 caches. In this embodiment, two adjacent cores share one or more L2 and L3 caches. Processor 1207 and graphics acceleration module 1246 are connected to system memory 1214, which may include Figure 12A The processor memory 1201-1202 in the middle.
[0124] Consistency of data and instructions stored in the various caches 1262A-1262D, 1256 and system memory 1214 is maintained via inter-core communication through the consistency bus 1264. In at least one embodiment, for example, each cache may have associated cache consistency logic / circuitology to communicate via the consistency bus 1264 in response to the detection of a read or write to a particular cache line. In one implementation, a cache snooping protocol is implemented via the consistency bus 1264 to snoop on cache accesses.
[0125] In at least one embodiment, proxy circuitry 1225 communicatively couples graphics acceleration module 1246 to coherence bus 1264, thereby allowing graphics acceleration module 1246 to participate in cache coherence protocols as a peer of cores 1260A-1260D. Specifically, in at least one embodiment, interface 1235 provides connectivity to proxy circuitry 1225 via high-speed link 1240 (e.g., PCIe bus, NVLink, etc.), and interface 1237 connects graphics acceleration module 1246 to link 1240.
[0126] In one implementation, the accelerator integrated circuit 1236 represents multiple graphics processing engines 1231, 1232, N of the graphics acceleration module 1246, providing cache management, memory access, context management, and interrupt management services. In at least one embodiment, the graphics processing engines 1231, 1232, N may each include a separate graphics processing unit (GPU). Alternatively, the graphics processing engines 1231, 1232, N may include different types of graphics processing engines within the GPU, such as graphics execution units, media processing engines (e.g., video encoders / decoders), samplers, and blit engines. In at least one embodiment, the graphics acceleration module 1246 may be a GPU having multiple graphics processing engines 1231-1232, N, or the graphics processing engines 1231-1232, N may be individual GPUs integrated on a general-purpose package, line card, or chip.
[0127] In one embodiment, the accelerator integrated circuit 1236 includes a memory management unit (MMU) 1239 for performing various memory management functions, such as virtual-to-physical memory translation (also known as effective-to-real memory translation), and a memory access protocol for accessing system memory 1214. The MMU 1239 may also include a translation back buffer (TLB) (not shown) for caching virtual / effective-to-physical / real address translations. In one implementation, cache 1238 stores commands and data for effective access by graphics processing engines 1231-1232, N. In one implementation, data stored in cache 1238 and graphics memories 1233-1234, M, is kept consistent with core caches 1262A-1262D, 1256 and system memory 1214. As previously mentioned, this task can be accomplished via proxy circuitry 1225 representing cache 1238 and graphics memory 1233-1234, M (e.g., sending updates related to modifications / accesses to cache lines on processor caches 1262A-1262D, 1256 to cache 1238 and receiving updates from cache 1238).
[0128] In at least one embodiment, a set of registers 1245 stores context data of threads executed by graphics processing engines 1231-1232, N, and context management circuitry 1248 manages the thread context. For example, context management circuitry 1248 can perform save and restore operations to save and restore the context of individual threads during context switching (e.g., saving the first thread and storing the second thread so that the second thread can be executed by the graphics processing engine). For example, during context switching, context management circuitry 1248 can store the current register value to a designated area in memory (e.g., identified by a context pointer). The register value can then be restored when returning to the context. In one embodiment, interrupt management circuitry 1247 receives and processes interrupts received from system devices.
[0129] In one implementation, MMU 1239 translates virtual / effective addresses from graphics processing engine 1231 into real / physical addresses in system memory 1214. In at least one embodiment, accelerator integrated circuit 1236 supports multiple (e.g., 4, 8, 16) graphics accelerator modules 1246 and / or other accelerator devices. Graphics accelerator modules 1246 may be dedicated to a single application executing on processor 1207, or may be shared among multiple applications. In at least one embodiment, a virtualized graphics execution environment is presented, wherein the resources of graphics processing engines 1231-1232, N are shared with multiple applications or virtual machines (VMs). In at least one embodiment, resources may be subdivided into “slices” based on processing requirements and priorities associated with VMs and / or applications, which are allocated to different VMs and / or applications.
[0130] In at least one embodiment, the accelerator integrated circuit 1236 acts as a bridge to the system of the graphics acceleration module 1246, providing address translation and system memory caching services. Additionally, the accelerator integrated circuit 1236 can provide virtualization facilities for the host processor to manage the virtualization, interrupt, and memory management of the graphics processing engines 1231-1232,N.
[0131] Because of the graphics processing engines 1231-1232, the hardware resources of N are explicitly mapped to the real address space seen by the host processor 1207, so any host processor can directly address these resources using valid address values. One function of the accelerator integrated circuit 1236 is to physically separate the graphics processing engines 1231-1232, N, so that they appear as independent units to the system.
[0132] In at least one embodiment, one or more graphics memories 1233-1234, M are coupled to each graphics processing engine 1231-1232, N. The graphics memories 1233-1234, M store instructions and data processed by each graphics processing engine 1231-1232, N. In at least one embodiment, the graphics memories 1233-1234, M may be volatile memory, such as DRAM (including stacked DRAM), GDDR memory (e.g., GDDR5, GDDR6), or HBM, and / or may be non-volatile memory, such as 3D XPoint or Nano-RAM.
[0133] In one embodiment, to reduce data traffic on link 1240, a biasing technique is used to ensure that the data stored in graphics memories 1233-1234, M is the data most frequently used by graphics processing engines 1231-1232, N, and preferably not used (or at least infrequently used) by cores 1260A-1260D. Similarly, the biasing mechanism attempts to keep the data needed by the cores (and preferably not graphics processing engines 1231-1232, N) in the core caches 1262A-1262D, 1256 and system memory 1214.
[0134] Figure 12C Another exemplary embodiment is shown, in which the accelerator integrated circuit 1236 is integrated within the processor 1207. In this embodiment, the graphics processing engines 1231-1232,N communicate directly with the accelerator integrated circuit 1236 via a high-speed link 1240 through interfaces 1237 and 1235 (which can also utilize any form of bus or interface protocol). The accelerator integrated circuit 1236 can perform operations related to... Figure 12B The operations described are the same. However, due to its close proximity to the coherence bus 1264 and caches 1262A-1262D, 1256, it may have higher throughput. At least one embodiment supports different programming models, including a dedicated process programming model (without graphics acceleration module virtualization) and a shared programming model (with virtualization), which may include a programming model controlled by accelerator integrated circuit 1236 and a programming model controlled by graphics acceleration module 1246.
[0135] In at least one embodiment, graphics processing engines 1231-1232, N are dedicated to a single application or process within a single operating system. In at least one embodiment, a single application can funnel requests from other applications to graphics processing engines 1231-1232, N, thereby providing virtualization within a VM / partition.
[0136] In at least one embodiment, graphics processing engines 1231-1232, N can be shared by multiple VM / application partitions. In at least one embodiment, the shared model can use a hypervisor to virtualize graphics processing engines 1231-1232, N to allow each operating system to access them. For a single-partition system without a hypervisor, the operating system owns graphics processing engines 1231-1232, N. In at least one embodiment, the operating system can virtualize graphics processing engines 1231-1232, N to provide access to each process or application.
[0137] In at least one embodiment, the graphics acceleration module 1246 or the individual graphics processing engines 1231-1232, N uses a process handle to select a process element. In at least one embodiment, the process element is stored in system memory 1214 and can be addressed using the effective address to real address translation techniques described herein. In at least one embodiment, the process handle may be an implementation-specific value provided to the host process when registering its context with the graphics processing engines 1231-1232, N (i.e., invoking system software to add the process element to the process element linked list). In at least one embodiment, the lower 16 bits of the process handle may be the offset of the process element in the process element linked list.
[0138] Figure 12D An exemplary accelerator integration slice 1290 is shown. As used herein, a “slice” includes a designated portion of the processing resources of the accelerator integrated circuit 1236. The application’s effective address space 1282 in system memory 1214 stores process element 1283. In one embodiment, process element 1283 is stored in response to a GPU call 1281 from an application 1280 executing on processor 1207. Process element 1283 contains the process state of the corresponding application 1280. A job descriptor (WD) 1284 contained in process element 1283 may be a single job requested by the application, or it may contain a pointer to a job queue. In at least one embodiment, WD 1284 is a pointer to a job request queue in the application’s address space 1282.
[0139] The graphics acceleration module 1246 and / or the various graphics processing engines 1231-1232, N, can be shared by all processes or a subset of processes in the system. In at least one embodiment, infrastructure may be included for setting process states and sending WD 1284 to the graphics acceleration module 1246 to initiate operations in a virtualized environment.
[0140] In at least one embodiment, the dedicated process programming model is implementation-specific. In this model, a single process owns either the graphics acceleration module 1246 or an individual graphics processing engine 1231. When the graphics acceleration module 1246 is owned by a single process, the hypervisor initializes the accelerator integrated circuit 1236 for the owned partition; when the graphics acceleration module 1246 is assigned, the operating system initializes the accelerator integrated circuit 1236 for the owned process.
[0141] In operation, the WD fetch unit 1291 in the accelerator integrated slice 1290 fetches the next WD 1284, which includes instructions for the work to be performed by one or more graphics processing engines of the graphics acceleration module 1246. Data from the WD 1284 can be stored in register 1245 and used by the MMU 1239, interrupt management circuitry 1247, and / or context management circuitry 1248, as shown. For example, one embodiment of the MMU 1239 includes segment / page roaming circuitry for accessing segment / page tables 1286 within the OS virtual address space 1285. The interrupt management circuitry 1247 can handle interrupt events 1292 received from the graphics acceleration module 1246. When performing graphics operations, the effective address 1293 generated by the graphics processing engines 1231-1232,N is translated into a real address by the MMU 1239.
[0142] In one embodiment, the same set of registers 1245 is copied for each graphics processing engine 1231-1232, N, and / or graphics acceleration module 1246, and these registers 1245 can be initialized by a hypervisor or the operating system. Each of these copied registers can be included in the accelerator integration slice 1290. Exemplary registers that can be initialized by the hypervisor are shown in Table 1.
[0143]
[0144]
[0145] Table 2 shows exemplary registers that can be initialized by the operating system.
[0146]
[0147] In at least one embodiment, each WD 1284 is specific to a particular graphics acceleration module 1246 and / or graphics processing engine 1231-1232, N. It contains all the information required for the graphics processing engine 1231-1232, N to complete its work, or it may be a pointer to a memory location where the application has set up a command queue for the work to be completed.
[0148] Figure 12EAdditional details of an exemplary embodiment of the shared model are shown. This embodiment includes a hypervisor real address space 1298, in which a list of process elements 1299 is stored. The hypervisor real address space 1298 can be accessed via a hypervisor 1296, which virtualizes the graphics acceleration module engine for the operating system 1295.
[0149] In at least one embodiment, the shared programming model allows all processes or subsets of processes from all partitions or subsets of partitions in the system to use the graphics acceleration module 1246. Two programming models exist where the graphics acceleration module 1246 is shared by multiple processes and partitions: time-slice sharing and graphics-oriented sharing.
[0150] In at least one embodiment, in this model, the hypervisor 1296 owns the graphics acceleration module 1246 and makes its functionality available to all operating systems 1295. For the graphics acceleration module 1246 to support virtualization through the hypervisor 1296, the graphics acceleration module 1246 may comply with the following requirements: 1) the application's job requests must be autonomous (i.e., no state needs to be maintained between jobs), or the graphics acceleration module 1246 must provide a context saving and recovery mechanism; 2) the graphics acceleration module 1246 guarantees that the application's job requests are completed within a specified amount of time, including any translation errors, or the graphics acceleration module 1246 provides the ability to preempt job processing; and 3) when operating in a directed shared programming model, fairness among the processes of the graphics acceleration module 1246 must be ensured.
[0151] In at least one embodiment, application 1280 needs to make a system call to operating system 1295 using graphics acceleration module 1246 type, working descriptor (WD), authority mask register (AMR) value, and context save / restore region pointer (CSRP). In at least one embodiment, the graphics acceleration module 1246 type describes the target acceleration function for the system call. In at least one embodiment, the graphics acceleration module 1246 type can be a system-specific value. In at least one embodiment, the WD is specifically formatted for graphics acceleration module 1246 and can take the form of graphics acceleration module 1246 commands, valid address pointers to user-defined structures, valid address pointers to command queues, or any other data structure describing the work to be performed by graphics acceleration module 1246. In one embodiment, the AMR value is the AMR state for the current process. In at least one embodiment, the value passed to the operating system is similar to that of the application that sets the AMR. If the implementation of accelerator integrated circuit 1236 and graphics acceleration module 1246 does not support the User Authority Mask Overwrite Register (UAMOR), the operating system can apply the current UAMOR value to the AMR value before passing the AMR in the hypervisor call. Hypervisor 1296 may selectively apply the Current Privilege Mask Overwrite Register (AMOR) value before placing the AMR into process element 1283. In at least one embodiment, CSRP is one of registers 1245 containing the effective address of a region in the application's effective address space 1282 for the graphics acceleration module 1246 to save and restore context state. This pointer is optional if saving state between jobs is not required or when a job is preempted. In at least one embodiment, the context save / restore region may be fixed system memory.
[0152] Upon receiving a system call, the operating system 1295 can verify that the application 1280 has been registered and granted permission to use the graphics acceleration module 1246. Then, the operating system 1295 uses...
[0153] The information shown in Table 3 is used to invoke management program 1296.
[0154]
[0155] Upon receiving a hypervisor call, hypervisor 1296 verifies that operating system 1295 has been registered and granted permission to use graphics acceleration module 1246. Then, hypervisor 1296 adds process element 1283 to the linked list of process elements of the corresponding graphics acceleration module 1246 type. The process element may include the information shown in Table 4.
[0156]
[0157] In at least one embodiment, the hypervisor initializes registers 1245 of multiple accelerator integration slices 1290.
[0158] like Figure 12F As shown, in at least one embodiment, a unified memory is used, which is addressable via a common virtual memory address space for accessing physical processor memories 1201-1202 and GPU memories 1220-1223. In this implementation, operations performed on GPUs 1210-1213 utilize the same virtual / effective memory address space to access processor memories 1201-1202 and vice versa, thereby simplifying programmability. In one embodiment, a first portion of the virtual / effective address space is allocated to processor memory 1201, a second portion to second processor memory 1202, a third portion to GPU memory 1220, and so on. In at least one embodiment, the entire virtual / effective memory space (sometimes referred to as the effective address space) is thus distributed across each of processor memories 1201-1202 and GPU memories 1220-1223, thereby allowing any processor or GPU to access that memory using a virtual address mapped to any physical memory.
[0159] In one embodiment, the bias / coherence management circuitry 1294A-1294E within one or more MMUs 1239A-1239E ensures cache coherence between the caches of one or more host processors (e.g., 1205) and the GPUs 1210-1213, and implements biasing techniques that indicate the physical memory in which certain types of data should be stored. While in Figure 12F Several instances of the bias / coherence management circuitry 1294A-1294E are shown, but the bias / coherence circuitry can be implemented within the MMU of one or more host processors 1205 and / or within the accelerator integrated circuit 1236.
[0160] One embodiment allows GPU-attached memories 1220-1223 to be mapped as part of system memory and accessed using shared virtual memory (SVM) technology without suffering the performance drawbacks associated with full system cache coherence. In at least one embodiment, the ability to access GPU-attached memories 1220-1223 as system memory without the heavy overhead of cache coherence provides a favorable operating environment for GPU offloading. This arrangement allows the host processor 1205 to software-set operands and access computation results without the overhead of conventional I / O DMA data copying. Such conventional copying includes driver calls, interrupts, and memory-mapped I / O (MMIO) accesses, all of which are less efficient than simple memory accesses. In at least one embodiment, the ability to access GPU-attached memories 1220-1223 without cache coherence overhead can be critical for the execution time of offloaded computations. For example, in cases with high volumes of streaming write memory traffic, cache coherence overhead can significantly reduce the effective write bandwidth seen by GPUs 1210-1213. In at least one embodiment, the efficiency of operand setup, the efficiency of result access, and the efficiency of GPU computation may play a role in determining the effectiveness of GPU offloading.
[0161] In at least one embodiment, the selection of GPU bias and host processor bias is driven by a bias tracker data structure. For example, a bias table can be used, which may be a page-granular structure (e.g., controlled at the memory page level) comprising 1 or 2 bits of memory pages attached to each GPU. In at least one embodiment, with or without a bias cache (e.g., for caching frequently / recently used entries in the bias table) in GPUs 1210-1213, the bias table can be implemented over the stolen memory range of one or more GPU-attached memories 1220-1223. Alternatively, the entire bias table can be maintained within the GPU.
[0162] In at least one embodiment, prior to actual access to GPU memory, an access to the bias table entry associated with each access to the GPU-attached memory 1220-1223 causes the following operations: First, a local request from GPUs 1210-1213 to find its page in the GPU bias is directly forwarded to the corresponding GPU memory 1220-1223. A local request from the GPU to find its page in the host bias is forwarded to processor 1205 (e.g., via the high-speed link described herein). In one embodiment, a request from processor 1205 to find the requested page in the host processor bias completes a request similar to a normal memory read. Alternatively, requests to GPU bias pages can be forwarded to GPUs 1210-1213. In at least one embodiment, if the GPU is not currently using the page, the GPU may subsequently migrate the page to the host processor bias. In at least one embodiment, the page bias state can be changed through a software-based mechanism, a hardware-assisted software mechanism, or, in limited cases, a purely hardware-based mechanism.
[0163] One mechanism for changing the bias state employs an API call (e.g., OpenCL), which subsequently invokes the GPU's device driver. The device driver then sends a message (or enqueues a command descriptor) to the GPU, instructing the GPU to change the bias state and, in some migrations, performs a cache refresh operation on the host. In at least one embodiment, the cache refresh operation is used for migrations from the host processor 1205 bias to the GPU bias, but not for the reverse migration.
[0164] In one embodiment, cache coherence is maintained by temporarily rendering GPU bias pages that the host processor 1205 cannot cache. To access these pages, the processor 1205 may request access from the GPU 1210, which may or may not grant access immediately. Therefore, to reduce communication between the processor 1205 and the GPU 1210, it is beneficial to ensure that the GPU bias pages are those required by the GPU, not those required by the host processor 1205, and vice versa.
[0165] The inference and / or training logic 615 is used to execute one or more implementations. This document combines... Figure 6A and / or Figure 6B Provide details about reasoning and / or training logic 615.
[0166] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphs to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0167] Figure 13 Exemplary integrated circuits and associated graphics processors according to various embodiments described herein are illustrated, which may be manufactured using one or more IP cores. In addition to the illustrations, at least one embodiment may include other logic and circuitry, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.
[0168] Figure 13 This is a block diagram illustrating an exemplary system on a chip integrated circuit 1300 that can be fabricated using one or more IP cores according to at least one embodiment. In at least one embodiment, the integrated circuit 1300 includes one or more application processors 1305 (e.g., CPUs), at least one graphics processor 1310, and may additionally include an image processor 1315 and / or a video processor 1320, any of which may be a modular IP core. In at least one embodiment, the integrated circuit 1300 includes peripheral or bus logic including a USB controller 1325, a UART controller 1330, an SPI / SDIO controller 1335, and an I22S / I22C controller 1340. In at least one embodiment, the integrated circuit 1300 may include a display device 1345 coupled to one or more of a High Definition Multimedia Interface (HDMI) controller 1350 and a Mobile Industrial Processor Interface (MIPI) display interface 1355. In at least one embodiment, storage may be provided by a flash memory subsystem 1360, including flash memory and a flash memory controller. In at least one embodiment, a memory interface may be provided via a memory controller 1365 for accessing an SDRAM or SRAM memory device. In at least one embodiment, some integrated circuits also include an embedded security engine 1370.
[0169] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 6A and / or Figure 6B Details regarding inference and / or training logic 615 are provided. In at least one embodiment, inference and / or training logic 615 may be used in integrated circuit 1300 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.
[0170] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0171] Figures 14A-14B Exemplary integrated circuits and associated graphics processors that can be fabricated using one or more IP cores according to various embodiments described herein are illustrated. In addition to those shown, other logic and circuitry, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores, may be included in at least one embodiment.
[0172] Figures 14A-14B This is a block diagram illustrating an exemplary graphics processor used within a SoC according to embodiments described herein. Figure 14A An exemplary graphics processor 1410, which can be fabricated using one or more IP cores according to at least one embodiment, is shown. Figure 14B An additional exemplary graphics processor 1440, which can be fabricated using one or more IP cores, is shown according to at least one embodiment. In at least one embodiment, Figure 14A The graphics processor 1410 is a low-power graphics processor core. In at least one embodiment, Figure 14B The graphics processor 1440 is a higher-performance graphics processor core. In at least one embodiment, each of the graphics processors 1410, 1440 may be... Figure 13 A variant of the 1310 graphics processor.
[0173] In at least one embodiment, the graphics processor 1410 includes a vertex processor 1405 and one or more fragment processors 1415A-1415N (e.g., 1415A, 1415B, 1415C, 1415D to 1415N-1 and 1415N). In at least one embodiment, the graphics processor 1410 may execute different shader programs via separate logic, such that the vertex processor 1405 is optimized to perform operations for a vertex shader program, while one or more fragment processors 1415A-1415N perform fragment (e.g., pixel) shading operations for a fragment or pixel shader program. In at least one embodiment, the vertex processor 1405 performs the vertex processing stage of the 3D graphics pipeline and generates primitive and vertex data. In at least one embodiment, one or more fragment processors 1415A-1415N use the primitive and vertex data generated by the vertex processor 1405 to generate a framebuffer displayed on a display device. In at least one embodiment, the fragment processors 1415A-1415N are optimized to execute fragment shader programs as provided in the OpenGL API, which can be used to perform operations similar to those of pixel shader programs provided in the Direct 3D API.
[0174] In at least one embodiment, the graphics processor 1410 further includes one or more memory management units (MMUs) 1420A-1420B, one or more caches 1425A-1425B, and one or more circuit interconnects 1430A-1430B. In at least one embodiment, one or more MMUs 1420A-1420B provide virtual-to-physical address mapping for the graphics processor 1410 (including for vertex processors 1405 and / or fragment processors 1415A to 1415N), and may reference vertex or image / texture data stored in memory in addition to vertex or image / texture data stored in one or more caches 1425A-1425B. In at least one embodiment, one or more MMUs 1420A-1420B may be synchronized with other MMUs within the system, including with one or more application processors 1305, image processors 1315, and / or Figure 13 The video processor 1320 is associated with one or more MMUs, enabling each processor 1305-1320 to participate in a shared or unified virtual memory system. In at least one embodiment, one or more circuit interconnects 1430A-1430B enable the graphics processor 1410 to interface with other IP cores within the SoC via the SoC's internal bus or via a direct connection.
[0175] In at least one embodiment, the graphics processor 1440 includes Figure 14AThe graphics processor 1410 includes one or more MMUs 1420A-1420B, one or more caches 1425A-1425B, and one or more circuit interconnects 1430A-1430B. In at least one embodiment, the graphics processor 1440 includes one or more shader cores 1455A-1455N (e.g., 1455A, 1455B, 1455C, 1455D, 1455E, 1455F to 1455N-1 and 1455N) that provide a unified shader core architecture, wherein a single core or type of core can execute all types of programmable shader code, including shader program code for implementing vertex shaders, fragment shaders, and / or compute shaders. In at least one embodiment, the number of shader cores may vary. In at least one embodiment, the graphics processor 1440 includes an inter-core task manager 1445, which acts as a thread dispatcher for dispatching execution threads to one or more shader cores 1455A-1455N and a tile unit 1458 to accelerate tiled operations for tile-based rendering, wherein rendering operations that subdivide a scene in image space, for example, to take advantage of local spatial coherence within the scene or to optimize the use of internal caches.
[0176] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. Details regarding inference and / or training logic 615 are provided below with reference to figures 6A and / or 6B. In at least one embodiment, inference and / or training logic 615 may be used in integrated circuits 14A and / or 14B to infer or predict operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein. Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0177] Figures 15A-15B Additional exemplary graphics processor logic according to embodiments described herein is illustrated. In at least one embodiment, Figure 15A It shows that it can be included in Figure 13 The graphics core 1500 within the graphics processor 1310, and in at least one embodiment, may be as follows: Figure 14B The Unified Shader Core 1455A-1455N shown is an example. Figure 15B A highly parallel general-purpose graphics processing unit 1530 suitable for deployment on a multi-chip module is shown in at least one embodiment.
[0178] In at least one embodiment, the graphics core 1500 includes a shared instruction cache 1502, texture units 1518, and cache / shared memory 1520, which are common to the execution resources within the graphics core 1500. In at least one embodiment, the graphics core 1500 may include multiple slices 1501A-1501N or partitions of each core, and the graphics processor may include multiple instances of the graphics core 1500. Slices 1501A-1501N may include supporting logic, including local instruction caches 1504A-1504N, thread schedulers 1506A-1506N, thread dispatchers 1508A-1508N, and a set of registers 1510A-1510N. In at least one embodiment, slices 1501A-1501N may include a set of additional functional units (AFU1512A-1512N), floating-point units (FPU 1514A-1514N), integer arithmetic logic units (ALU 1516A-1516N), address calculation units (ACU 1513A-1513N), double-precision floating-point units (DPFPU 1515A-1515N), and matrix processing units (MPU1517A-1517N).
[0179] In at least one embodiment, the FPU 1514A-1514N can perform single-precision (32-bit) and half-precision (16-bit) floating-point operations, while the DPFPU 1515A-1515N performs double-precision (64-bit) floating-point operations. In at least one embodiment, the ALU 1516A-1516N can perform variable-precision integer operations with 8-bit, 16-bit, and 32-bit precision, and can be configured for mixed-precision operations. In at least one embodiment, the MPU 1517A-1517N can also be configured for mixed-precision matrix operations, including half-precision floating-point operations and 8-bit integer operations. In at least one embodiment, the MPU 1517-1517N can perform various matrix operations to accelerate machine learning application frameworks, including enabling support for accelerated generalized matrix-to-matrix multiplication (GEMM). In at least one embodiment, the AFU 1512A-1512N can perform additional logical operations not supported by floating-point or integer units, including trigonometric operations (e.g., sine, cosine, etc.).
[0180] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6BDetails are provided regarding the inference and / or training logic 615. In at least one embodiment, the inference and / or training logic 615 may be used in the graphics core 1500 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture or neural network usage described herein.
[0181] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0182] Figure 15B A general-purpose processing unit (GPGPU) 1530 is illustrated in at least one embodiment, which can be configured to enable highly parallel computational operations to be performed by a set of graphics processing units. In at least one embodiment, the GPGPU 1530 can be directly linked to other instances of the GPGPU 1530 to create a multi-GPU cluster to improve the training speed for deep neural networks. In at least one embodiment, the GPGPU 1530 includes a host interface 1532 for connection to a host processor. In at least one embodiment, the host interface 1532 is a PCI Express interface. In at least one embodiment, the host interface 1532 may be a vendor-specific communication interface or communication structure. In at least one embodiment, the GPGPU 1530 receives commands from the host processor and uses a global scheduler 1534 to allocate execution threads associated with those commands to a set of compute clusters 1536A-1536H. In at least one embodiment, compute clusters 1536A-1536H share a cache memory 1538. In at least one embodiment, cache memory 1538 can be used as a higher-level cache within the cache memory of computing clusters 1536A-1536H.
[0183] In at least one embodiment, the GPGPU 1530 includes memories 1544A-1544B, which are coupled to the computing cluster 1536A-1536H via a set of memory controllers 1542A-1542B. In at least one embodiment, memories 1544A-1544B may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), which includes graphics double data rate (GDDR) memory.
[0184] In at least one embodiment, each of the computing clusters 1536A-1536H includes a set of graphics cores, for example... Figure 15A The graphics core 1500 may include various types of integer and floating-point logic units that can perform computational operations across a range of precisions, including precisions suitable for machine learning computations. For example, in at least one embodiment, at least a subset of the floating-point units in each computing cluster 1536A-1536H may be configured to perform 16-bit or 32-bit floating-point operations, while different subsets of the floating-point units may be configured to perform 64-bit floating-point operations.
[0185] In at least one embodiment, multiple instances of the GPGPU 1530 can be configured as a computing cluster. In at least one embodiment, the communication used for synchronization and data exchange by the computing clusters 1536A-1536H varies between embodiments. In at least one embodiment, the multiple instances of the GPGPU 1530 communicate via a host interface 1532. In at least one embodiment, the GPGPU 1530 includes an I / O hub 1539 that couples the GPGPU 1530 to a GPU link 1540, enabling direct connection to other instances of the GPGPU 1530. In at least one embodiment, the GPU link 1540 is coupled to a dedicated GPU-to-GPU bridge, which enables communication and synchronization between the multiple instances of the GPGPU 1530. In at least one embodiment, the GPU link 1540 is coupled to a high-speed interconnect for sending and receiving data to and from other GPGPUs or parallel processors. In at least one embodiment, the multiple instances of the GPGPU 1530 reside in a separate data processing system and communicate via network devices accessible through the host interface 1532. In at least one embodiment, GPU link 1540 may be configured to enable connection to a host processor other than or as a replacement for host interface 1532.
[0186] In at least one embodiment, the GPGPU 1530 can be configured to train a neural network. In at least one embodiment, the GPGPU 1530 can be used within an inference platform. In at least one embodiment, when the GPGPU 1530 is used for inference, the GPGPU 1530 may include fewer compute clusters 1536A-1536H compared to when the GPGPU 1530 is used to train a neural network. In at least one embodiment, the memory technology associated with the memories 1544A-1544B can differ between inference and training configurations, wherein a higher bandwidth memory technology is dedicated to the training configuration. In at least one embodiment, the inference configuration of the GPGPU 1530 can support inference-specific instructions. For example, in at least one embodiment, the inference configuration can provide support for one or more 8-bit integer dot product instructions, which can be used during the inference operation of the deployed neural network.
[0187] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6B Details are provided regarding the inference and / or training logic 615. In at least one embodiment, the inference and / or training logic 615 may be used in the GPGPU 1530 to infer or predict operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architecture or neural network usage described herein.
[0188] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0189] Figure 16A block diagram of a computer system 1600 according to at least one embodiment is shown. In at least one embodiment, the computer system 1600 includes a processing subsystem 1601 having one or more processors 1602 and a system memory 1604 communicating via an interconnect path that may include a memory hub 1605. In at least one embodiment, the memory hub 1605 may be a separate component within a chipset assembly or may be integrated within one or more processors 1602. In at least one embodiment, the memory hub 1605 is coupled to an I / O subsystem 1611 via a communication link 1606. In one embodiment, the I / O subsystem 1611 includes an I / O hub 1607 that enables the computer system 1600 to receive input from one or more input devices 1608. In at least one embodiment, the I / O hub 1607 enables a display controller to provide output to one or more display devices 1610A, the display controller being included in one or more processors 1602. In at least one embodiment, one or more display devices 1610A coupled to the I / O hub 1607 may include local, internal, or embedded display devices.
[0190] In at least one embodiment, the processing subsystem 1601 includes one or more parallel processors 1612 coupled to the memory hub 1605 via a bus or other communication link 1613. In at least one embodiment, the communication link 1613 may use any of many standards-based communication link technologies or protocols, such as, but not limited to, PCI Express, or may be a vendor-specific communication interface or communication architecture. In at least one embodiment, the one or more parallel processors 1612 form a compute-intensive parallel or vector processing system, which may include a large number of processing cores and / or processing clusters, such as a multi-core integrated (MIC) processor. In at least one embodiment, the one or more parallel processors 1612 form a graphics processing subsystem that can output pixels to one of one or more display devices 1610A coupled via an I / O hub 1607. In at least one embodiment, the one or more parallel processors 1612 may also include a display controller and a display interface (not shown) to enable direct connection to one or more display devices 1610B.
[0191] In at least one embodiment, system storage unit 1614 may be connected to I / O hub 1607 to provide a storage mechanism for computer system 1600. In at least one embodiment, I / O switch 1616 may be used to provide an interface mechanism to enable connectivity between I / O hub 1607 and other components, such as network adapter 1618 and / or wireless network adapter 1619 which may be integrated into the platform, and various other devices that can be added via one or more additional devices 1620. In at least one embodiment, network adapter 1618 may be an Ethernet adapter or another wired network adapter. In at least one embodiment, wireless network adapter 1619 may include one or more of Wi-Fi, Bluetooth, Near Field Communication (NFC), or other network devices including one or more wireless devices.
[0192] In at least one embodiment, the computer system 1600 may include other components not explicitly shown, such as USB or other port connections, optical storage drives, video capture devices, etc., which may also be connected to the I / O hub 1607. In at least one embodiment, the interconnection can be implemented using any suitable protocol (e.g., a PCI-based protocol (e.g., PCI-Express) or other bus or point-to-point communication interface and / or protocol). Figure 16 The communication paths of the various components, such as NV-Link high-speed interconnect or interconnect protocols.
[0193] In at least one embodiment, one or more parallel processors 1612 include circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constituting a graphics processing unit (GPU). In at least one embodiment, one or more parallel processors 1612 include circuitry optimized for general-purpose processing. In at least one embodiment, components of the computer system 1600 may be integrated with one or more other system elements on a single integrated circuit. For example, in at least one embodiment, the parallel processor 1612, memory hub 1605, processor 1602, and I / O hub 1607 may be integrated into a system-on-a-chip (SoC) integrated circuit. In at least one embodiment, components of the computer system 1600 may be integrated into a single package to form a system-in-package (SIP) configuration. In at least one embodiment, at least a portion of the components of the computer system 1600 may be integrated into a multi-chip module (MCM) that can be interconnected with other MCMs to a modular computer system.
[0194] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6Aand / or Figure 6B Details regarding inference and / or training logic 615 are provided. In at least one embodiment, inference and / or training logic 615 may be used in system 1600 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architecture or neural network usage described herein.
[0195] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0196] processor
[0197] Figure 17A A parallel processor 1700 according to at least one embodiment is illustrated. In at least one embodiment, various components of the parallel processor 1700 may be implemented using one or more integrated circuit devices, such as programmable processors, application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs). In at least one embodiment, the illustrated parallel processor 1700 is according to an exemplary embodiment. Figure 16 The variant of the 1612, which includes one or more parallel processors, is shown.
[0198] In at least one embodiment, the parallel processor 1700 includes a parallel processing unit 1702. In at least one embodiment, the parallel processing unit 1702 includes an I / O unit 1704 that enables communication with other devices, including other instances of the parallel processing unit 1702. In at least one embodiment, the I / O unit 1704 can be directly connected to other devices. In at least one embodiment, the I / O unit 1704 is connected to other devices using a hub or switch interface (e.g., a memory hub 1605). In at least one embodiment, the connection between the memory hub 1605 and the I / O unit 1704 forms a communication link 1613. In at least one embodiment, the I / O unit 1704 is connected to a host interface 1706 and a memory crossbar switch 1716, wherein the host interface 1706 receives commands for performing processing operations, and the memory crossbar switch 1716 receives commands for performing memory operations.
[0199] In at least one embodiment, when host interface 1706 receives a command buffer via I / O unit 1704, host interface 1706 can direct work operations to execute those commands to front end 1708. In at least one embodiment, front end 1708 is coupled to scheduler 1710, which is configured to assign commands or other work items to processing cluster array 1712. In at least one embodiment, scheduler 1710 ensures that processing cluster array 1712 is correctly configured and in an active state before assigning tasks to processing cluster array 1712. In at least one embodiment, scheduler 1710 is implemented via firmware logic executed on a microcontroller. In at least one embodiment, the microcontroller-implemented scheduler 1710 can be configured to perform complex scheduling and work assignment operations at both coarse and fine granular levels, thereby enabling fast preemption and context switching of threads executing on processing array 1712. In at least one embodiment, host software can demonstrate workloads scheduled on processing array 1712 via one of multiple graphics processing doorbells. In at least one embodiment, the workload can then be automatically distributed on the processing array 1712 by the scheduler 1710 logic within the microcontroller, which includes the scheduler 1710.
[0200] In at least one embodiment, the processing cluster array 1712 may include up to "N" processing clusters (e.g., clusters 1714A, 1714B to 1714N). In at least one embodiment, each cluster 1714A-1714N of the processing cluster array 1712 may execute a large number of concurrent threads. In at least one embodiment, the scheduler 1710 may use various scheduling and / or work allocation algorithms to allocate work to the clusters 1714A-1714N of the processing cluster array 1712, which may vary depending on the workload generated by each type of program or computation. In at least one embodiment, scheduling may be handled dynamically by the scheduler 1710, or may be partially assisted by compiler logic during the compilation of program logic configured to be executed by the processing cluster array 1712. In at least one embodiment, different clusters 1714A-1714N of the processing cluster array 1712 may be assigned to process different types of programs or to perform different types of computations.
[0201] In at least one embodiment, the processing cluster array 1712 can be configured to perform various types of parallel processing operations. In at least one embodiment, the processing cluster array 1712 is configured to perform general-purpose parallel computing operations. For example, in at least one embodiment, the processing cluster array 1712 may include logic for performing processing tasks, including filtering video and / or audio data, performing modeling operations, including physical operations, and performing data transformations.
[0202] In at least one embodiment, the processing cluster array 1712 is configured to perform parallel graphics processing operations. In at least one embodiment, the processing cluster array 1712 may include additional logic to support the execution of such graphics processing operations, including but not limited to texture sampling logic for performing texture operations, as well as tessellation logic and other vertex processing logic. In at least one embodiment, the processing cluster array 1712 may be configured to execute shader programs related to graphics processing, such as, but not limited to, vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. In at least one embodiment, the parallel processing unit 1702 may transfer data from system memory via I / O unit 1704 for processing. In at least one embodiment, during processing, the transferred data may be stored in on-chip memory (e.g., parallel processor memory 1722) and then written back to system memory.
[0203] In at least one embodiment, when the parallel processing unit 1702 is used to perform graphics processing, the scheduler 1710 may be configured to divide the processing workload into tasks of approximately equal size to better distribute graphics processing operations among the multiple clusters 1714A-1714N of the processing cluster array 1712. In at least one embodiment, portions of the processing cluster array 1712 may be configured to perform different types of processing. For example, in at least one embodiment, a first portion may be configured to perform vertex shading and topology generation, a second portion may be configured to perform tessellation and geometry shading, and a third portion may be configured to perform pixel shading or other screen-space operations to generate a rendered image for display. In at least one embodiment, intermediate data generated by one or more of the clusters 1714A-1714N may be stored in a buffer to allow intermediate data to be transferred between the clusters 1714A-1714N for further processing.
[0204] In at least one embodiment, the processing cluster array 1712 may receive processing tasks to be executed via a scheduler 1710, which receives commands defining the processing tasks from a front end 1708. In at least one embodiment, the processing task may include an index of data to be processed, such as surface (patch) data, raw data, vertex data, and / or pixel data, as well as state parameters and commands defining how the data is processed (e.g., what program to execute). In at least one embodiment, the scheduler 1710 may be configured to acquire an index corresponding to a task, or may receive an index from the front end 1708. In at least one embodiment, the front end 1708 may be configured to ensure that the processing cluster array 1712 is configured to be active before initiating the workload specified by an incoming command buffer (e.g., a batch buffer, push buffer, etc.).
[0205] In at least one embodiment, each of one or more instances of the parallel processing unit 1702 may be coupled to the parallel processor memory 1722. In at least one embodiment, the parallel processor memory 1722 may be accessed via a memory crossbar switch 1716, which may receive memory requests from the processing cluster array 1712 and the I / O unit 1704. In at least one embodiment, the memory crossbar switch 1716 may be accessed via a memory interface 1718. In at least one embodiment, the memory interface 1718 may include a plurality of partition units (e.g., partition units 1720A, 1720B to 1720N), each of which may be coupled to a portion (e.g., a memory cell) of the parallel processor memory 1722. In at least one embodiment, the plurality of partition units 1720A-1720N are configured to be equal to the number of memory units, such that the first partition unit 1720A has a corresponding first memory unit 1724A, the second partition unit 1720B has a corresponding memory unit 1724B, and the Nth partition unit 1720N has a corresponding Nth memory unit 1724N. In at least one embodiment, the number of partition units 1720A-1720N may not be equal to the number of memory devices.
[0206] In at least one embodiment, memory cells 1724A-1724N may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. In at least one embodiment, memory cells 1724A-1724N may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM). In at least one embodiment, rendering targets such as frame buffers or texture maps may be stored across memory cells 1724A-1724N, allowing partitioning cells 1720A-1720N to write portions of each rendering target in parallel, to efficiently utilize the available bandwidth of the parallel processor memory 1722. In at least one embodiment, local instances of the parallel processor memory 1722 may be excluded to facilitate a unified memory design that combines system memory with local cache memory.
[0207] In at least one embodiment, any of the clusters 1714A-1714N of the processing cluster array 1712 can process data to be written to any memory cell 1724A-1724N within the parallel processor memory 1722. In at least one embodiment, the memory crossbar switch 1716 can be configured to transfer the output of each cluster 1714A-1714N to any partition cell 1720A-1720N or another cluster 1714A-1714N, and the clusters 1714A-1714N can perform further processing operations on the output. In at least one embodiment, each cluster 1714A-1714N can communicate with the memory interface 1718 via the memory crossbar switch 1716 to read from or write to various external storage devices. In at least one embodiment, the memory crossbar switch 1716 has a connection to a memory interface 1718 for communication with I / O unit 1704, and a connection to a local instance of parallel processor memory 1722, thereby enabling processing units within different processing clusters 1714A-1714N to communicate with system memory or other memory not local to parallel processing unit 1702. In at least one embodiment, the memory crossbar switch 1716 may use virtual channels to separate traffic flows between clusters 1714A-1714N and partition units 1720A-1720N.
[0208] In at least one embodiment, multiple instances of the parallel processing unit 1702 may be provided on a single insert card, or multiple insert cards may be interconnected. In at least one embodiment, different instances of the parallel processing unit 1702 may be configured to interoperate, even if the different instances have different numbers of processing cores, different numbers of local parallel processor memories, and / or other configuration differences. For example, in at least one embodiment, some instances of the parallel processing unit 1702 may include higher-precision floating-point units relative to other instances. In at least one embodiment, a system combining one or more instances of the parallel processing unit 1702 or the parallel processor 1700 may be implemented in various configurations and form factors, including but not limited to desktop, laptop, or handheld personal computers, servers, workstations, game consoles, and / or embedded systems.
[0209] Figure 17B This is a block diagram of a partitioning unit 1720 according to at least one embodiment. In at least one embodiment, the partitioning unit 1720 is... Figure 17AThis is an example of one of the partitioning units 1720A-1720N. In at least one embodiment, the partitioning unit 1720 includes an L2 cache 1721, a frame buffer interface 1725, and a raster operation unit (“ROP”) 1726. The L2 cache 1721 is a read / write cache configured to perform load and store operations received from the memory crossbar switch 1716 and the ROP 1726. In at least one embodiment, the L2 cache 1721 outputs read misses and urgent write-back requests to the frame buffer interface 1725 for processing. In at least one embodiment, updates can also be sent to the frame buffer for processing via the frame buffer interface 1725. In at least one embodiment, the frame buffer interface 1725 communicates with memory cells in the parallel processor memory (such as…) Figure 17A It interacts with one of the memory cells 1724A-1724N (e.g., within the parallel processor memory 1722).
[0210] In at least one embodiment, ROP 1726 is a processing unit that performs raster operations such as stenciling, z-testing, blending, etc. In at least one embodiment, ROP 1726 then outputs processed graphics data stored in graphics memory. In at least one embodiment, ROP 1726 includes compression logic to compress depth or color data written to memory and decompress depth or color data read from memory. In at least one embodiment, the compression logic may be lossless compression logic utilizing one or more of a variety of compression algorithms. The compression logic performed by ROP 1726 may vary based on the statistical characteristics of the data to be compressed. For example, in at least one embodiment, incremental color compression is performed based on depth and color data on a per-tile basis.
[0211] In at least one embodiment, ROP 1726 is included within each processing cluster (e.g., Figure 17A Clusters 1714A-1714N are used instead of partition units 1720. In at least one embodiment, read and write requests for pixel data are made via memory crossbar switch 1716 instead of pixel fragment data transfer. In at least one embodiment, the processed graphics data can be displayed on a display device (such as...). Figure 16 Displayed by one or more display devices 1610, routed by processor 1602 for further processing, or by... Figure 17A One of the processing entities within the parallel processor 1700 is routed for further processing.
[0212] Figure 17C This is a block diagram of a processing cluster 1714 within a parallel processing unit according to at least one embodiment. In at least one embodiment, the processing cluster is... Figure 17AAn instance of one of the processing clusters 1714A-1714N. In at least one embodiment, one or more of the processing clusters 1714 can be configured to execute a number of threads in parallel, where a "thread" refers to an instance of a specific program executing on a particular set of input data. In at least one embodiment, Single Instruction Multiple Data (SIMD) instruction issuing technology is used to support the parallel execution of a large number of threads without providing multiple independent instruction units. In at least one embodiment, Single Instruction Multiple Threading (SIMT) technology is used to support the parallel execution of a large number of generally synchronous threads, which uses a common instruction unit configured to issue instructions to a set of processing engines within each processing cluster.
[0213] In at least one embodiment, the operation of the processing cluster 1714 can be controlled by a pipeline manager 1732 that assigns processing tasks to the SIMT parallel processors. In at least one embodiment, the pipeline manager 1732... Figure 17A The scheduler 1710 receives instructions and manages the execution of these instructions via the graphics multiprocessor 1734 and / or texture unit 1736. In at least one embodiment, the graphics multiprocessor 1734 is an exemplary instance of a SIMT parallel processor. However, in at least one embodiment, the processing cluster 1714 may include various types of SIMT parallel processors with different architectures. In at least one embodiment, the processing cluster 1714 may include one or more instances of the graphics multiprocessor 1734. In at least one embodiment, the graphics multiprocessor 1734 can process data, and the data cross switch 1740 can be used to distribute the processed data to one of a number of possible destinations (including other shader units). In at least one embodiment, the pipeline manager 1732 can facilitate the distribution of processed data by specifying the destination of the processed data to be distributed via the data cross switch 1740.
[0214] In at least one embodiment, each graphics multiprocessor 1734 within the processing cluster 1714 may include the same set of functional execution logic (e.g., arithmetic logic units, load-memory units, etc.). In at least one embodiment, the functional execution logic may be configured in a pipelined manner, wherein new instructions may be issued before previous instructions complete. In at least one embodiment, the functional execution logic supports a variety of operations, including integer and floating-point arithmetic, comparison operations, Boolean operations, shift operations, and computation of various algebraic functions. In at least one embodiment, the same functional unit hardware may be used to perform different operations, and any combination of functional units may exist.
[0215] In at least one embodiment, instructions sent to the processing cluster 1714 constitute threads. In at least one embodiment, a group of threads executed across a set of parallel processing engines is a thread group. In at least one embodiment, the thread group executes programs on different input data. In at least one embodiment, each thread within the thread group may be assigned to a different processing engine within the graphics multiprocessor 1734. In at least one embodiment, the thread group may include fewer threads than the number of processing engines within the graphics multiprocessor 1734. In at least one embodiment, when the number of threads included in the thread group is less than the number of processing engines, one or more processing engines may be idle during a loop that is processing the thread group. In at least one embodiment, the thread group may also include more threads than the number of processing engines within the graphics multiprocessor 1734. In at least one embodiment, when the thread group includes more threads than the number of processing engines within the graphics multiprocessor 1734, processing can be performed in consecutive clock cycles. In at least one embodiment, multiple thread groups can be executed simultaneously on the graphics multiprocessor 1734.
[0216] In at least one embodiment, the graphics multiprocessor 1734 includes an internal cache memory for performing load and store operations. In at least one embodiment, the graphics multiprocessor 1734 may forgo the internal cache and use a cache memory within the processing cluster 1714 (e.g., L1 cache 1748). In at least one embodiment, each graphics multiprocessor 1734 may also access partition units (e.g., Figure 17A The L2 cache is located within partition units 1720A-1720N, which are shared among all processing clusters 1714 and can be used to transfer data between threads. In at least one embodiment, the graphics multiprocessor 1734 can also access off-chip global memory, which may include one or more of local parallel processor memory and / or system memory. In at least one embodiment, any memory outside of the parallel processing unit 1702 can be used as global memory. In at least one embodiment, the processing cluster 1714 includes multiple instances of the graphics multiprocessor 1734, which can share common instructions and data that can be stored in the L1 cache 1748.
[0217] In at least one embodiment, each processing cluster 1714 may include a memory management unit (“MMU”) 1745 configured to map virtual addresses to physical addresses. In at least one embodiment, one or more instances of the MMU 1745 may reside in Figure 17AThe memory interface 1718 is located within the MMU 1745. In at least one embodiment, the MMU 1745 includes a set of page table entries (PTEs) for mapping virtual addresses to physical addresses of tiles and optionally to cache line indices. In at least one embodiment, the MMU 1745 may include an address translation back buffer (TLB) or a cache that may reside within the graphics multiprocessor 1734, the L1 cache, or the processing cluster 1714. In at least one embodiment, physical addresses are processed to allocate surface data access locality for efficient request interleaving between partition units. In at least one embodiment, cache line indices may be used to determine whether a request for a cache line is a hit or a miss.
[0218] In at least one embodiment, the processing cluster 1714 can be configured such that each graphics multiprocessor 1734 is coupled to a texture unit 1736 to perform texture mapping operations that determine texture sample locations, read texture data, and filter texture data. In at least one embodiment, texture data is read as needed from an internal texture L1 cache (not shown) or from an L1 cache within the graphics multiprocessor 1734, and texture data is also retrieved from an L2 cache, local parallel processor memory, or system memory. In at least one embodiment, each graphics multiprocessor 1734 outputs a processed task to a data crossbar switch 1740 to provide the processed task to another processing cluster 1714 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via a memory crossbar switch 1716. In at least one embodiment, a preROP 1742 (pre-raster operation unit) is configured to receive data from the graphics multiprocessor 1734 and direct the data to a ROP unit, which can be associated with a partitioning unit (e.g., [missing information]). Figure 17A The PreROP 1742 unit is located together with the partition units 1720A-1720N. In at least one embodiment, the PreROP 1742 unit can perform optimizations for color blending, organize pixel color data, and perform address translation.
[0219] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6B Details are provided regarding inference and / or training logic 615. In at least one embodiment, inference and / or training logic 615 may be used in graphics processing cluster 1714 to infer or predict operations based at least in part on weight parameters computed using neural network training operations, neural network capabilities and / or architecture, or neural network usage as described herein.
[0220] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0221] Figure 17D A graphics multiprocessor 1734 according to at least one embodiment is illustrated. In at least one embodiment, the graphics multiprocessor 1734 is coupled to a pipeline manager 1732 of a processing cluster 1714. In at least one embodiment, the graphics multiprocessor 1734 has an execution pipeline including, but not limited to, an instruction cache 1752, an instruction unit 1754, an address mapping unit 1756, a register file 1758, one or more general-purpose graphics processing unit (GPGPU) cores 1762, and one or more load / store units 1766. The GPGPU cores 1762 and the load / store units 1766 are coupled to a cache memory 1772 and a shared memory 1770 via a memory and cache interconnect 1768.
[0222] In at least one embodiment, instruction cache 1752 receives a stream of instructions to be executed from pipeline manager 1732. In at least one embodiment, instructions are cached in instruction cache 1752 and dispatched to instruction unit 1754 for execution. In one embodiment, instruction unit 1754 may dispatch instructions as thread groups (e.g., thread bundles), assigning each thread group to a different execution unit within GPGPU core 1762. In at least one embodiment, instructions can access any local, shared, or global address space by specifying an address within a unified address space. In at least one embodiment, address mapping unit 1756 may be used to translate addresses in the unified address space into different memory addresses that can be accessed by load / store unit 1766.
[0223] In at least one embodiment, register file 1758 provides a set of registers for functional units of graphics multiprocessor 1734. In at least one embodiment, register file 1758 provides temporary storage for operands of data paths connected to functional units of graphics multiprocessor 1734 (e.g., GPGPU core 1762, load / store unit 1766). In at least one embodiment, register file 1758 is partitioned among each functional unit, such that a dedicated portion of register file 1758 is allocated to each functional unit. In at least one embodiment, register file 1758 is partitioned among different thread bundles being executed by graphics multiprocessor 1734.
[0224] In at least one embodiment, each of the GPGPU cores 1762 may include a floating-point unit (FPU) and / or an integer arithmetic logic unit (ALU) for executing instructions of the graphics multiprocessor 1734. The GPGPU cores 1762 may be architecturally similar or may differ in architecture. In at least one embodiment, a first portion of the GPGPU core 1762 includes a single-precision FPU and an integer ALU, while a second portion of the GPGPU core includes a double-precision FPU. In at least one embodiment, the FPU may implement the IEEE 754-2008 standard for floating-point algorithms or enable variable-precision floating-point algorithms. In at least one embodiment, the graphics multiprocessor 1734 may additionally include one or more fixed-function or special-function units to perform specific functions, such as copying rectangles or pixel blending operations. In at least one embodiment, one or more of the GPGPU cores may also include fixed-function or special-function logic.
[0225] In at least one embodiment, the GPGPU core 1762 includes SIMD logic capable of executing a single instruction on multiple sets of data. In one embodiment, the GPGPU core 1762 can physically execute SIMD4, SIMD8, and SIMD16 instructions and logically execute SIMD1, SIMD2, and SIMD32 instructions. In at least one embodiment, the SIMD instructions for the GPGPU core can be generated by a shader compiler at compile time or automatically generated when executing a program written and compiled for a Single Program Multiple Data (SPMD) or SIMT architecture. In at least one embodiment, multiple threads of a program configured for a SIMT execution model can be executed using a single SIMD instruction. For example, in at least one embodiment, eight SIMD threads performing the same or similar operations can be executed in parallel using a single SIMD8 logic unit.
[0226] In at least one embodiment, the memory and cache interconnect 1768 is an interconnect network connecting each functional unit of the graphics multiprocessor 1734 to the register file 1758 and the shared memory 1770. In at least one embodiment, the memory and cache interconnect 1768 is a cross-switch interconnect that allows the load / store unit 1766 to perform load and store operations between the shared memory 1770 and the register file 1758. In at least one embodiment, the register file 1758 can operate at the same frequency as the GPGPU core 1762, resulting in very low latency for data transfer between the GPGPU core 1762 and the register file 1758. In at least one embodiment, the shared memory 1770 can be used to enable communication between threads executing on functional units within the graphics multiprocessor 1734. In at least one embodiment, the cache memory 1772 can be used, for example, as a data cache to cache texture data communicated between functional units and texture units 1736. In at least one embodiment, the shared memory 1770 can also be used as a program-managed cache. In at least one embodiment, in addition to the data automatically cached in cache memory 1772, the thread executing on GPGPU core 1762 can also programmatically store data in shared memory.
[0227] In at least one embodiment, a parallel processor or GPGPU, as described herein, is communicatively coupled to a host / processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various general-purpose GPU (GPGPU) functions. In at least one embodiment, the GPU may be communicatively coupled to the host processor / core via a bus or other interconnect (e.g., high-speed interconnects such as PCIe or NVLink). In at least one embodiment, the GPU may be integrated with the core on the same package or chip and communicatively coupled to the core via an internal processor bus / interconnect (i.e., within the package or chip). In at least one embodiment, regardless of how the GPU is connected, the processor core may assign work to the GPU in the form of a sequence of commands / instructions contained in a job descriptor. In at least one embodiment, the GPU then uses dedicated circuitry / logic to efficiently process these commands / instructions.
[0228] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6BDetails are provided regarding inference and / or training logic 615. In at least one embodiment, inference and / or training logic 615 may be used in graphics multiprocessor 1734 for inferring or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network capabilities and / or architecture, or neural network usage as described herein.
[0229] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0230] Figure 18 A multi-GPU computing system 1800 according to at least one embodiment is illustrated. In at least one embodiment, the multi-GPU computing system 1800 may include a processor 1802 coupled to a plurality of general-purpose graphics processing units (GPGPUs) 1806A-D via a host interface switch 1804. In at least one embodiment, the host interface switch 1804 is a PCI Fast switch device that couples the processor 1802 to a PCI Fast bus, through which the processor 1802 can communicate with the GPGPUs 1806A-D. The GPGPUs 1806A-D may be interconnected via a set of high-speed point-to-point GPU-to-GPU links 1816. In at least one embodiment, the GPU-to-GPU links 1816 are connected to each of the GPGPUs 1806A-D via dedicated GPU links. In at least one embodiment, the P2P GPU links 1816 enable direct communication between each of the GPGPUs 1806A-D, without requiring communication on the host interface bus 1804 to which the processor 1802 is connected. In at least one embodiment, the host interface bus 1804 remains available for system memory access or, for example, communication with other instances of the multi-GPU computing system 1800 via one or more network devices, through GPU-to-GPU traffic directed to the P2P GPU link 1816. While in at least one embodiment, the GPGPUs 1806A-D are connected to the processor 1802 via the host interface switch 1804, in at least one embodiment, the processor 1802 includes direct support for the P2P GPU link 1816 and can be directly connected to the GPGPUs 1806A-D.
[0231] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6BDetails are provided regarding inference and / or training logic 615. In at least one embodiment, inference and / or training logic 615 may be used in a multi-GPU computing system 1800 for inferring or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network capabilities and / or architecture, or neural network usage as described herein.
[0232] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0233] Figure 19 This is a block diagram of a graphics processor 1900 according to at least one embodiment. In at least one embodiment, the graphics processor 1900 includes a ring interconnect 1902, a pipeline front end 1904, a media engine 1937, and graphics cores 1980A-1980N. In at least one embodiment, the ring interconnect 1902 couples the graphics processor 1900 to other processing units, said processing units including other graphics processors or one or more general-purpose processor cores. In at least one embodiment, the graphics processor 1900 is one of many processors integrated within a multi-core processing system.
[0234] In at least one embodiment, the graphics processor 1900 receives multiple batches of commands via a ring interconnect 1902. In at least one embodiment, the input commands are interpreted by a command streamer 1903 in a pipeline front-end 1904. In at least one embodiment, the graphics processor 1900 includes scalable execution logic for performing 3D geometry processing and media processing via graphics cores 1980A-1980N. In at least one embodiment, for 3D geometry processing commands, the command streamer 1903 provides the commands to the geometry pipeline 1936. In at least one embodiment, for at least some media processing commands, the command streamer 1903 provides the commands to a video front-end 1934, which is coupled to a media engine 1937. In at least one embodiment, the media engine 1937 includes a video quality engine (VQE) 1930 for video and image post-processing, and a multi-format encoding / decoding (MFX) engine 1933 for providing hardware-accelerated media data encoding and decoding. In at least one embodiment, the geometry pipeline 1936 and the media engine 1937 each generate an execution thread for thread execution resources provided by at least one graphics core 1980A.
[0235] In at least one embodiment, the graphics processor 1900 includes scalable thread execution resources featuring modular cores 1980A-1980N (sometimes referred to as core slices), each graphics core having multiple sub-cores 1950A-1950N, 1960A-1960N (sometimes referred to as core sub-slices). In at least one embodiment, the graphics processor 1900 may have any number of graphics cores 1980A to 1980N. In at least one embodiment, the graphics processor 1900 includes a graphics core 1980A having at least a first sub-core 1950A and a second sub-core 1960A. In at least one embodiment, the graphics processor 1900 is a low-power processor with a single sub-core (e.g., 1950A). In at least one embodiment, the graphics processor 1900 includes multiple graphics cores 1980A-1980N, each graphics core including a set of first sub-cores 1950A-1950N and a set of second sub-cores 1960A-1960N. In at least one embodiment, each of the first sub-cores 1950A-1950N includes at least a first set of execution units 1952A-1952N and media / texture samplers 1954A-1954N. In at least one embodiment, each of the second sub-cores 1960A-1960N includes at least a second set of execution units 1962A-1962N and samplers 1964A-1964N. In at least one embodiment, each of the sub-cores 1950A-1950N and 1960A-1960N shares a set of shared resources 1970A-1970N. In at least one embodiment, the shared resources include shared cache memory and pixel operation logic.
[0236] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6B Details are provided regarding inference and / or training logic 615. In at least one embodiment, inference and / or training logic 615 may be used in graphics processor 1900 for inferring or predicting operations based at least in part on weight parameters computed using neural network training operations, neural network functionality and / or architecture, or neural network usage as described herein.
[0237] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0238] Figure 20This is a block diagram illustrating the microarchitecture of a processor 2000 according to at least one embodiment, which may include logic circuitry for executing instructions. In at least one embodiment, the processor 2000 can execute instructions, including x86 instructions, ARM instructions, and special-purpose instructions for application-specific integrated circuits (ASICs). In at least one embodiment, the processor 2000 may include registers for storing packaged data, such as the 64-bit wide MMX™ registers in an Intel microprocessor enabled by MMX technology in Santa Clara, California. In at least one embodiment, the MMX registers available in integer and floating-point forms can operate with packaged data elements accompanied by Single Instruction Multiple Data (“SIMD”) and Streaming SIMD Extensions (“SSE”) instructions. In at least one embodiment, a 128-bit wide XMM register associated with SSE2, SSE3, SSE4, AVX, or later (generally referred to as “SSEx”) technologies can hold such packaged data operands. In at least one embodiment, the processor 2000 can execute instructions to accelerate machine learning or deep learning algorithms, training, or inference.
[0239] In at least one embodiment, processor 2000 includes an ordered front end (“front end”) 2001 to fetch instructions to be executed and prepare instructions for later use in the processor pipeline. In at least one embodiment, front end 2001 may include several units. In at least one embodiment, instruction prefetcher 2026 fetches instructions from memory and provides the instructions to instruction decoder 2028, which in turn decodes or interprets the instructions. For example, in at least one embodiment, instruction decoder 2028 decodes the received instructions into one or more machine-executable so-called “microinstructions” or “microoperations” (also referred to as “microoperations” or “microinstructions”). In at least one embodiment, instruction decoder 2028 parses the instructions into opcodes and corresponding data and control fields, which can be used by the microarchitecture to perform operations according to at least one embodiment. In at least one embodiment, trace cache 2030 may assemble the decoded microinstructions into a program-ordered sequence or trace in microinstruction queue 2034 for execution. In at least one embodiment, when the trace cache 2030 encounters complex instructions, the microcode ROM 2032 provides the microinstructions required to complete the operation.
[0240] In at least one embodiment, some instructions may be converted into a single micro-operation, while others require several micro-operations to complete the entire operation. In at least one embodiment, if more than four micro-instructions are required to complete an instruction, the instruction decoder 2028 may access the microcode ROM 2032 to execute the instruction. In at least one embodiment, an instruction may be decoded into a small number of micro-instructions for processing at the instruction decoder 2028. In at least one embodiment, if multiple micro-instructions are required to complete an operation, the instruction may be stored in the microcode ROM 2032. In at least one embodiment, the trace cache 2030 references an entry point programmable logic array (“PLA”) to determine the correct micro-instruction pointer for reading a microcode sequence from the microcode ROM 2032 to complete one or more instructions, according to at least one embodiment. In at least one embodiment, after the microcode ROM 2032 has completed the micro-operation ordering of the instructions, the machine front end 2001 may resume fetching micro-operations from the trace cache 2030.
[0241] In at least one embodiment, the out-of-order execution engine (“out-of-order engine”) 2003 can prepare instructions for execution. In at least one embodiment, the out-of-order execution logic has multiple buffers to smooth and reorder the instruction flow to optimize performance as instructions descend the pipeline and are scheduled for execution. In at least one embodiment, the out-of-order execution engine 2003 includes, but is not limited to, an allocator / register renamer 2040, a memory microinstruction queue 2042, an integer / floating-point microinstruction queue 2044, a memory scheduler 2046, a fast scheduler 2002, a slow / general-purpose floating-point scheduler (“slow / general-purpose FP scheduler”) 2004, and a simple floating-point scheduler (“simple FP scheduler”) 2006. In at least one embodiment, the fast scheduler 2002, the slow / general-purpose floating-point scheduler 2004, and the simple floating-point scheduler 2006 are also collectively referred to as “microinstruction schedulers 2002, 2004, 2006”. In at least one embodiment, the allocator / register renamer 2040 allocates the machine buffers and resources required for the sequential execution of each microinstruction. In at least one embodiment, the allocator / register renamer 2040 renames logical registers to entries in a register file. In at least one embodiment, the allocator / register renamer 2040 also allocates entries for each microinstruction in one of two microinstruction queues, a memory microinstruction queue 2042 for memory operations and an integer / floating-point microinstruction queue 2044 for non-memory operations, preceding the memory scheduler 2046 and microinstruction schedulers 2002, 2004, and 2006. In at least one embodiment, the microinstruction schedulers 2002, 2004, and 2006 determine when they are ready to execute a microinstruction based on the readiness of their dependent input register operand sources and the availability of the execution resource microinstructions that need to be completed. In at least one embodiment, the fast scheduler 2002 of at least one embodiment can schedule on each half of the master clock cycle, while the slow / general-purpose floating-point scheduler 2004 and the simple floating-point scheduler 2006 can schedule once per master processor clock cycle. In at least one embodiment, microinstruction schedulers 2002, 2004, and 2006 arbitrate the scheduling port to schedule microinstructions for execution.
[0242] In at least one embodiment, execution block 2011 includes, but is not limited to, integer register file / tribute network 2008, floating-point register file / tribute network (“FP register file / tribute network”) 2010, address generation unit (“AGU”) 2012 and 2014, fast arithmetic logic unit (“fast ALU”) 2016 and 2018, slow arithmetic logic unit (“slow ALU”) 2020, floating-point ALU (“FP”) 2022, and floating-point movement unit (“FP movement”) 2024. In at least one embodiment, integer register file / tribute network 2008 and floating-point register file / bypass network 2010 are also referred to herein as “register file 2008, 2010”. In at least one embodiment, AGU 2012 and 2014, fast ALU 2016 and 2018, slow ALU 2020, floating-point ALU 2022, and floating-point movement unit 2024 are also referred to herein as "execution units 2012, 2014, 2016, 2018, 2020, 2022, and 2024". In at least one embodiment, execution block 2011 may include, but is not limited to, any number (including zero) and type of register files, branch networks, address generation units, and execution units (in any combination).
[0243] In at least one embodiment, register files 2008 and 2010 may be arranged between microinstruction schedulers 2002, 2004, and 2006 and execution units 2012, 2014, 2016, 2018, 2020, 2022, and 2024. In at least one embodiment, integer register file / tribute network 2008 performs integer operations. In at least one embodiment, floating-point register file / tribute network 2010 performs floating-point operations. In at least one embodiment, each of register files 2008 and 2010 may include, but is not limited to, a tribute network that can bypass or forward recently completed results not yet written to the register file to a new dependent object. In at least one embodiment, register files 2008 and 2010 can communicate data with each other. In at least one embodiment, integer register file / tribute network 2008 may include, but is not limited to, two separate register files, one register file for low-order 32-bit data and a second register file for high-order 32-bit data. In at least one embodiment, the floating-point register file / tribute network 2010 may include, but is not limited to, 128-bit wide entries, since floating-point instructions typically have operands with a width of 64 to 128 bits.
[0244] In at least one embodiment, execution units 2012, 2014, 2016, 2018, 2020, 2022, and 2024 can execute instructions. In at least one embodiment, register files 2008 and 2010 store integer and floating-point data operation values that the microinstructions need to execute. In at least one embodiment, processor 2000 can include, but is not limited to, any number of execution units 2012, 2014, 2016, 2018, 2020, 2022, and 2024, and combinations thereof. In at least one embodiment, floating-point ALU 2022 and floating-point move unit 2024 can perform floating-point, MMX, SIMD, AVX, and SSE or other operations, including specialized machine learning instructions. In at least one embodiment, floating-point ALU 2022 can include, but is not limited to, a 64-bit multiplication-64-bit floating-point divider to perform division, square root, and remainder micro-operations. In at least one embodiment, floating-point hardware can be used to process instructions involving floating-point values. In at least one embodiment, ALU operations can be passed to the fast ALU 2016, 2018. In at least one embodiment, the fast ALU 2016, 2018 can perform fast operations with an effective delay of half a clock cycle. In at least one embodiment, most complex integer operations are routed to the slow ALU 2020, because the slow ALU 2020 can include, but is not limited to, integer execution hardware for long-latency type operations, such as multipliers, shifters, flag logic, and branching. In at least one embodiment, memory load / store operations can be performed by AGU 2012, 2014. In at least one embodiment, the fast ALU 2016, fast ALU 2018, and slow ALU 2020 can perform integer operations on 64-bit data operands. In at least one embodiment, the fast ALU 2016, fast ALU 2018, and slow ALU 2020 can be implemented to support various data bit sizes, including sixteen, thirty-two, 128, 256, etc. In at least one embodiment, the floating-point ALU 2022 and the floating-point movement unit 2024 can be implemented to support a range of operands with various bit widths. In at least one embodiment, the floating-point ALU 2022 and the floating-point movement unit 2024 can operate on 128-bit wide packed data operands in conjunction with SIMD and multimedia instructions.
[0245] In at least one embodiment, microinstruction schedulers 2002, 2004, and 2006 schedule dependent operations before the parent load completes execution. In at least one embodiment, since microinstructions can be speculatively scheduled and executed within processor 2000, processor 2000 may also include logic for handling memory misses. In at least one embodiment, if a data load miss occurs in the data cache, there may be a dependent operation running in the pipeline that temporarily deprives the scheduler of the correct data. In at least one embodiment, a replay mechanism tracks and re-executes instructions that use incorrect data. In at least one embodiment, it may be necessary to replay dependent operations and may allow independent operations to be completed. In at least one embodiment, the scheduler and replay mechanism of at least one embodiment of the processor may also be designed to capture instruction sequences used for text string comparison operations.
[0246] In at least one embodiment, the term "register" may refer to an onboard processor storage location that can be used as part of an instruction that identifies operands. In at least one embodiment, a register may be one that can be used externally to the processor (from a programmer's perspective). In at least one embodiment, a register may not be limited to a particular type of circuit. Rather, in at least one embodiment, a register may store data, provide data, and perform the functions described herein. In at least one embodiment, the registers described herein may be implemented using a variety of different techniques via circuitry within the processor, such as dedicated physical registers, dynamically allocated physical registers renamed using register renaming, a combination of dedicated and dynamically allocated physical registers, etc. In at least one embodiment, an integer register stores 32-bit integer data. The register file of at least one embodiment also includes eight multimedia SIMD registers for encapsulating data.
[0247] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6B Details regarding inference and / or training logic 615 are provided. In at least one embodiment, inference and / or training logic 615 may be incorporated into execution block 2011 and other memories or registers shown or not shown. For example, in at least one embodiment, the training and / or inference techniques described herein may use one or more ALUs shown in execution block 2011. Furthermore, weight parameters may be stored in on-chip or off-chip memory and / or registers (shown or not shown) that configure the ALUs of execution block 2011 to execute one or more machine learning algorithms, neural network architectures, use cases, or training techniques described herein.
[0248] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0249] Figure 21 A deep learning application processor 2100 according to at least one embodiment is illustrated. In at least one embodiment, the deep learning application processor 2100 uses instructions, which, if executed by the deep learning application processor 2100, cause the deep learning application processor 2100 to perform some or all of the processes and techniques described herein. In at least one embodiment, the deep learning application processor 2100 is an application-specific integrated circuit (ASIC). In at least one embodiment, the application processor 2100 performs matrix multiplication operations or is "hardwired" into hardware as a result of executing one or more instructions or both. In at least one embodiment, the deep learning application processor 2100 includes, but is not limited to, processing clusters 2110(1)-2110(12), inter-chip link (“ICL”) 2120(1)-2120(12), inter-chip controller (“ICC”) 2130(1)-2130(2), memory controller (“Mem Ctrlr”) 2142(1)-2142(4), high-bandwidth memory physical layer (“HBM PHY”) 2144(1)-2144(4), management controller central processing unit (“management controller CPU”) 2150, peripheral component interconnect fast controller and direct memory access block (“PCIe controller and DMA”) 2170, and sixteen-channel peripheral component interconnect fast port (“PCI Express x 16”) 2180.
[0250] In at least one embodiment, processing cluster 2110 can perform deep learning operations, including inference or prediction operations based on weight parameters computed using one or more training techniques, including those described herein. In at least one embodiment, each processing cluster 2110 can include, but is not limited to, any number and type of processors. In at least one embodiment, deep learning application processor 2100 can include any number and type of processing cluster 2100. In at least one embodiment, the inter-chip link 2120 is bidirectional. In at least one embodiment, the inter-chip link 2120 and the inter-chip controller 2130 enable multiple deep learning application processors 2100 to exchange information, including activation information generated from executing one or more machine learning algorithms embodied in one or more neural networks. In at least one embodiment, deep learning application processor 2100 can include any number (including zero) and type of ICL 2120 and ICC 2130.
[0251] In at least one embodiment, the HBM2 2140 provides a total of 32GB of memory. The HBM2 2140(i) is associated with both the memory controller 2142(i) and the HBM PHY 2144(i). In at least one embodiment, any number of HBM2 2140s can provide any type and total amount of high-bandwidth memory and can be associated with any number (including zero) and type of memory controller 2142 and HBM PHY 2144. In at least one embodiment, any number and type of blocks can replace SPI, I2C, GPIO 3360, PCIe controller 2160, and DMA 2170 and / or PCIe 2180 to implement any number and type of communication standards in any technically feasible manner.
[0252] Inference and / or training logic 1015 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 6A and / or Figure 6B Details regarding the inference and / or training logic 615 are provided. In at least one embodiment, the deep learning application processor 2100 is used to train a machine learning model (e.g., a neural network) to predict or infer information provided to the deep learning application processor 2100. In at least one embodiment, the deep learning application processor 2100 is used to infer or predict information based on a trained machine learning model (e.g., a neural network) that has been trained by another processor or system or by the deep learning application processor 2100. In at least one embodiment, the processor 2100 may be used to perform one or more neural network use cases described herein.
[0253] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0254] Figure 22 This is a block diagram of a neuromorphic processor 2200 according to at least one embodiment. In at least one embodiment, the neuromorphic processor 2200 may receive one or more inputs from a source external to the neuromorphic processor 2200. In at least one embodiment, these inputs may be transmitted to one or more neurons 2202 within the neuromorphic processor 2200. In at least one embodiment, the neurons 2202 and their components may be implemented using circuitry or logic including one or more arithmetic logic units (ALUs). In at least one embodiment, the neuromorphic processor 2200 may include, but is not limited to, thousands or even tens of thousands of instances of neurons 2202, but any suitable number of neurons 2202 may be used. In at least one embodiment, each instance of a neuron 2202 may include a neuron input 2204 and a neuron output 2206. In at least one embodiment, a neuron 2202 may generate an output that can be transmitted to the inputs of other instances of the neuron 2202. In at least one embodiment, the neuron input 2204 and the neuron output 2206 may be interconnected via synapses 2208.
[0255] In at least one embodiment, neuron 2202 and synapse 2208 may be interconnected, causing neuromorphic processor 2200 to operate to process or analyze information received by neuromorphic processor 2200. In at least one embodiment, neuron 2202 may send an output pulse (or “trigger” or “peak”) when the input received through neuron input 2204 exceeds a threshold. In at least one embodiment, neuron 2202 may sum or integrate the signal received at neuron input 2204. For example, in at least one embodiment, neuron 2202 may be implemented as a leaky integral-triggered neuron, wherein if the summation (referred to as “membrane potential”) exceeds a threshold, neuron 2202 may use a transfer function such as a sigmoid or threshold function to generate an output (or “trigger”). In at least one embodiment, the leaky integral-triggered neuron may sum the signal received at neuron input 2204 to a membrane potential and may apply an attenuation factor (or leak) to reduce the membrane potential. In at least one embodiment, a leaking integral-triggered neuron may trigger if multiple input signals are received at neuron input 2204 quickly enough to exceed a threshold (i.e., before the membrane potential decays too low to trigger). In at least one embodiment, neuron 2202 may be implemented using circuitry or logic that receives input, integrates the input to the membrane potential, and decays the membrane potential. In at least one embodiment, the input may be averaged, or any other suitable transfer function may be used. Furthermore, in at least one embodiment, neuron 2202 may include, but is not limited to, comparator circuitry or logic that generates an output spike at neuron output 2206 when the result of applying the transfer function to neuron input 2204 exceeds a threshold. In at least one embodiment, once neuron 2202 is triggered, it can ignore previously received input information by, for example, resetting the membrane potential to 0 or another suitable default value. In at least one embodiment, once the membrane potential is reset to 0, neuron 2202 may resume normal operation after a suitable period of time (or recovery period).
[0256] In at least one embodiment, neurons 2202 can be interconnected via synapses 2208. In at least one embodiment, synapses 2208 can be operated to transmit signals from the output of a first neuron 2202 to the input of a second neuron 2202. In at least one embodiment, neurons 2202 can transmit information on more than one instance of synapses 2208. In at least one embodiment, one or more instances of neuron outputs 2206 can be connected via instances of synapses 2208 to instances of neuron inputs 2204 within the same neuron 2202. In at least one embodiment, an instance of neuron 2202 that produces an output to be transmitted on the instance of synapse 2208 can be referred to as a "presynaptic neuron". In at least one embodiment, an instance of neuron 2202 that receives input transmitted via an instance of synapse 2208 can be referred to as a "postsynaptic neuron". In at least one embodiment, regarding various instances of synapse 2208, since an instance of neuron 2202 can receive input from one or more instances of synapse 2208 and can also transmit output through one or more instances of synapse 2208, a single instance of neuron 2202 can be both a "presynaptic neuron" and a "postsynaptic neuron".
[0257] Neurons 2202 can be organized into one or more layers. Each instance of neuron 2202 can have a neuron output 2206, which can fan out to one or more neuron inputs 2204 via one or more synapses 2208. In at least one embodiment, the neuron output 2206 of neuron 2202 in the first layer 2210 can be connected to the neuron input 2204 of neuron 2202 in the second layer 2212. In at least one embodiment, layer 2210 can be referred to as a "feedforward layer". In at least one embodiment, each instance of neuron 2202 in the first layer 2210 can fan out to each instance of neuron 2202 in the second layer 2212. In at least one embodiment, the first layer 2210 can be referred to as a "fully connected feedforward layer". In at least one embodiment, each instance of neuron 2202 in each instance of the second layer 2212 fan out to fewer than all instances of neuron 2202 in the third layer 2214. In at least one embodiment, the second layer 2212 can be referred to as a "sparsely connected feedforward layer". In at least one embodiment, neurons 2202 in the second layer 2212 may fan out to neurons 2202 in multiple other layers, including neurons 2202 fan out to (the same) second layer 2212. In at least one embodiment, the second layer 2212 may be referred to as a “recurrent layer.” In at least one embodiment, the neuromorphic processor 2200 may be any suitable combination of recurrent layers and feedforward layers, including but not limited to sparsely connected feedforward layers and fully connected feedforward layers.
[0258] In at least one embodiment, the neuromorphic processor 2200 may include, but is not limited to, a reconfigurable interconnect architecture or dedicated hardwired interconnects to connect synapses 2208 to neurons 2202. In at least one embodiment, the neuromorphic processor 2200 may include, but is not limited to, circuitry or logic that allows synapses to be assigned to different neurons 2202 as needed, depending on the neural network topology and neuron fan-in / fan-out. For example, in at least one embodiment, synapses 2208 may be connected to neurons 2202 using interconnect structures (such as on-chip networks) or via dedicated connections. In at least one embodiment, synaptic interconnects and their components may be implemented using circuitry or logic.
[0259] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0260] Figure 23A processing system according to at least one embodiment is illustrated. In at least one embodiment, system 2300 includes one or more processors 2302 and one or more graphics processors 2308, and may be a single-processor desktop system, a multi-processor workstation system, or a server system having a large number of processors 2302 or processor cores 2307. In at least one embodiment, system 2300 is a processing platform incorporated within a system-on-a-chip (SoC) integrated circuit for use in mobile, handheld, or embedded devices.
[0261] In at least one embodiment, system 2300 may include or be integrated into a server-based gaming platform, including a game console, mobile game console, handheld game console, or online game console, which are game and media consoles. In at least one embodiment, system 2300 is a mobile phone, smartphone, tablet computing device, or mobile internet device. In at least one embodiment, processing system 2300 may also include components coupled to or integrated into a wearable device, such as a smartwatch, smart glasses, augmented reality, or virtual reality device. In at least one embodiment, processing system 2300 is a television or set-top box device having one or more processors 2302 and a graphical interface generated by one or more graphics processors 2308.
[0262] In at least one embodiment, each of the one or more processors 2302 includes one or more processor cores 2307 for processing instructions that, when executed, perform operations against the system and user software. In at least one embodiment, each of the one or more processor cores 2307 is configured to process a particular instruction set 2309. In at least one embodiment, the instruction set 2309 may facilitate Complex Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), or computation via Very Long Instruction Word (VLIW). In at least one embodiment, each processor core 2307 may process a different instruction set 2309, which may include instructions that facilitate the emulation of other instruction sets. In at least one embodiment, the processor core 2307 may also include other processing devices, such as a digital signal processor (DSP).
[0263] In at least one embodiment, processor 2302 includes cache memory 2304. In at least one embodiment, processor 2302 may have a single internal cache or multiple levels of internal caches. In at least one embodiment, the cache memory is shared among various components of processor 2302. In at least one embodiment, processor 2302 also uses an external cache (e.g., a Level 3 (L3) cache or a last-level cache (LLC)) (not shown), which can be shared among processor cores 2307 using known cache coherence techniques. In at least one embodiment, processor 2302 further includes a register file 2306, which may include different types of registers for storing different types of data (e.g., integer registers, floating-point registers, status registers, and instruction pointer registers). In at least one embodiment, register file 2306 may include general-purpose registers or other registers.
[0264] In at least one embodiment, one or more processors 2302 are coupled to one or more interface buses 2310 to transmit communication signals, such as address, data, or control signals, between the processors 2302 and other components in the system 2300. In at least one embodiment, the interface bus 2310 may be a processor bus, such as a version of the Direct Media Interface (DMI) bus. In at least one embodiment, the interface bus 2310 is not limited to the DMI bus and may include one or more peripheral component interconnect buses (e.g., PCI, PCI Express), memory buses, or other types of interface buses. In at least one embodiment, the processor 2302 includes an integrated memory controller 2316 and a platform controller hub 2330. In at least one embodiment, the memory controller 2316 facilitates communication between memory devices and other components of the processing system 2300, while the platform controller hub (PCH) 2330 provides connectivity to input / output (I / O) devices via a local I / O bus.
[0265] In at least one embodiment, memory device 2320 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, a phase-change memory device, or a device with suitable performance for use as processor memory. In at least one embodiment, memory device 2320 may be used as system memory of processing system 2300 to store data 2322 and instructions 2321 for use when one or more processors 2302 execute an application or process. In at least one embodiment, memory controller 2316 is also coupled to an optional external graphics processor 2312, which may communicate with one or more graphics processors 2308 of processor 2302 to perform graphics and media operations. In at least one embodiment, display device 2311 may be connected to processor 2302. In at least one embodiment, display device 2311 may include one or more internal display devices, such as in mobile electronic devices or laptop devices, or external display devices connected via a display interface (e.g., DisplayPort). In at least one embodiment, the display device 2311 may include a head-mounted display (HMD), such as a stereoscopic display device for virtual reality (VR) or augmented reality (AR) applications.
[0266] In at least one embodiment, the platform controller hub 2330 enables peripheral devices to connect to the storage device 2320 and the processor 2302 via a high-speed I / O bus. In at least one embodiment, the I / O peripheral devices include, but are not limited to, an audio controller 2346, a network controller 2334, a firmware interface 2328, a wireless transceiver 2326, a touch sensor 2325, and a data storage device 2324 (e.g., a hard disk drive, flash memory, etc.). In at least one embodiment, the data storage device 2324 may be connected via a storage interface (e.g., SATA) or via a peripheral bus, such as a peripheral component interconnect bus (e.g., PCI, PCIe). In at least one embodiment, the touch sensor 2325 may include a touchscreen sensor, a pressure sensor, or a fingerprint sensor. In at least one embodiment, the wireless transceiver 2326 may be a Wi-Fi transceiver, a Bluetooth transceiver, or a mobile network transceiver, such as a 3G, 4G, or LTE transceiver. In at least one embodiment, the firmware interface 2328 enables communication with the system firmware and may be, for example, a Unified Extensible Firmware Interface (UEFI). In at least one embodiment, network controller 2334 may enable network connectivity to a wired network. In at least one embodiment, a high-performance network controller (not shown) is coupled to interface bus 2310. In at least one embodiment, audio controller 2346 is a multi-channel high-definition audio controller. In at least one embodiment, processing system 2300 includes an optional legacy I / O controller 2340 for coupling legacy (e.g., Personal System 2 (PS / 2)) devices to the system. In at least one embodiment, platform controller hub 2330 may also be connected to one or more Universal Serial Bus (USB) controllers 2342 that connect input devices, such as a keyboard and mouse combination 2343, a camera 2344, or other USB input devices.
[0267] In at least one embodiment, instances of the memory controller 2316 and platform controller hub 2330 may be integrated into a discrete external graphics processor, such as external graphics processor 2312. In at least one embodiment, the platform controller hub 2330 and / or the memory controller 2316 may be external to one or more processors 2302. For example, in at least one embodiment, system 2300 may include external memory controller 2316 and platform controller hub 2330, which may be configured as a memory controller hub and peripheral controller hub in a system chipset communicating with processor 2302.
[0268] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. This document combines... Figure 6A and / or Figure 6BDetails regarding the inference and / or training logic 615 are provided. In at least one embodiment, some or all of the inference and / or training logic 615 may be incorporated into the graphics processor 2300. For example, in at least one embodiment, the training and / or inference techniques described herein may use one or more ALUs embodied in the graphics processor 2312. Furthermore, in at least one embodiment, the inference and / or training operations described herein may use, in addition to Figure 6A or Figure 6B The logic is performed using logic other than that shown. In at least one embodiment, the weight parameters may be stored in on-chip or off-chip memory and / or registers (shown or not shown), which configure the ALU of the graphics processor 2300 to execute one or more machine learning algorithms, neural network architectures, use cases, or training techniques described herein.
[0269] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0270] Figure 24 This is a block diagram of a processor 2400 having one or more processor cores 2402A-2402N, an integrated memory controller 2414, and an integrated graphics processor 2408 according to at least one embodiment. In at least one embodiment, the processor 2400 may include additional cores, up to and including additional cores 2402N, indicated by dashed boxes. In at least one embodiment, each processor core 2402A-2402N includes one or more internal cache units 2404A-2404N. In at least one embodiment, each processor core may also access one or more shared cache units 2406.
[0271] In at least one embodiment, internal cache units 2404A-2404N and shared cache unit 2406 represent a cache memory hierarchy within processor 2400. In at least one embodiment, cache memory units 2404A-2404N may include at least one level of instruction and data cache within each processor core and one or more levels of cache in a shared intermediate cache, such as Level 2 (L2), Level 3 (L3), Level 4 (L4), or other levels of cache, wherein the highest level of cache preceding external memory is classified as LLC. In at least one embodiment, cache coherence logic maintains coherence between the various cache units 2406 and 2404A-2404N.
[0272] In at least one embodiment, the processor 2400 may further include a set of one or more bus controller units 2416 and a system agent core 2410. In at least one embodiment, the one or more bus controller units 2416 manage a set of peripheral buses, such as one or more PCI or PCIe buses. In at least one embodiment, the system agent core 2410 provides management functions for various processor components. In at least one embodiment, the system agent core 2410 includes one or more integrated memory controllers 2414 to manage access to various external memory devices (not shown).
[0273] In at least one embodiment, one or more processor cores 2402A-2402N include support for multi-threaded concurrent processing. In at least one embodiment, system agent core 2410 includes components for coordinating and operating cores 2402A-2402N during multi-threaded processing. In at least one embodiment, system agent core 2410 may additionally include a power control unit (PCU) including logic and components for regulating one or more power states of processor cores 2402A-2402N and graphics processor 2408.
[0274] In at least one embodiment, processor 2400 further includes a graphics processor 2408 for performing graph processing operations. In at least one embodiment, graphics processor 2408 is coupled to a shared cache unit 2406 and a system proxy core 2410 including one or more integrated memory controllers 2414. In at least one embodiment, system proxy core 2410 further includes a display controller 2411 for driving graphics processor outputs to one or more coupled displays. In at least one embodiment, display controller 2411 may also be a separate module coupled to graphics processor 2408 via at least one interconnect, or it may be integrated within graphics processor 2408.
[0275] In at least one embodiment, ring-based interconnect unit 2412 is used to couple internal components of processor 2400. In at least one embodiment, alternative interconnect units, such as point-to-point interconnects, switched interconnects, or other technologies, may be used. In at least one embodiment, graphics processor 2408 is coupled to ring interconnect 2412 via I / O link 2413.
[0276] In at least one embodiment, I / O link 2413 represents at least one of a variety of I / O interconnects, including packaged I / O interconnects that facilitate communication between various processor components and high-performance embedded memory module 2418 (e.g., eDRAM module). In at least one embodiment, each of processor cores 2402A-2402N and graphics processor 2408 uses embedded memory module 2418 as a shared last-level cache.
[0277] In at least one embodiment, processor cores 2402A-2402N are homogeneous cores executing a common instruction set architecture. In at least one embodiment, processor cores 2402A-2402N are heterogeneous in terms of instruction set architecture (ISA), with one or more processor cores 2402A-2402N executing a common instruction set, while one or more other processor cores 2402A-2402N execute a subset of the common instruction set or a different instruction set. In at least one embodiment, processor cores 2402A-2402N are heterogeneous in terms of microarchitecture, with one or more cores having relatively high power consumption coupled to one or more power cores having lower power consumption. In at least one embodiment, processor 2400 may be implemented on one or more chips or implemented as a SoC integrated circuit.
[0278] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6B Details regarding the inference and / or training logic 615 are provided. In at least one embodiment, some or all of the inference and / or training logic 615 may be incorporated into the processor 2400. For example, in at least one embodiment, the training and / or inference techniques described herein may be used in the graphics processor 2312, one or more graphics cores 2402A-2402N, or... Figure 24 One or more ALUs embodied in other components. Furthermore, in at least one embodiment, the inference and / or training operations described herein can use, in addition to... Figure 6A or Figure 6B The logic is performed in a manner other than that shown. In at least one embodiment, weight parameters may be stored in on-chip or off-chip memory and / or registers (shown or not shown), which configure the ALU of the graphics processor 2400 to execute one or more machine learning algorithms, neural network architectures, use cases, or training techniques described herein.
[0279] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0280] Figure 25 This is a block diagram of the hardware logic of a graphics processor core 2500 according to at least one embodiment described herein. In at least one embodiment, the graphics processor core 2500 is included within a graphics core array. In at least one embodiment, the graphics processor core 2500 (sometimes referred to as a core slice) may be one or more graphics cores within a modular graphics processor. In at least one embodiment, the graphics processor core 2500 is an example of a graphics core slice, and a graphics processor as described herein may include multiple graphics core slices based on target power and performance envelopes. In at least one embodiment, each graphics core 2500 may include a fixed-function block 2530 coupled to a plurality of sub-cores 2501A-2501F (also referred to as sub-slices), said plurality of sub-cores including modular blocks having general and fixed-function logic.
[0281] In at least one embodiment, the fixed-function block 2530 includes a geometry and fixed-function pipeline 2536, which, for example, may be shared by all sub-cores of the graphics processor 2500 in a lower-performance and / or lower-power graphics processor implementation. In at least one embodiment, the geometry and fixed-function pipeline 2536 includes a 3D fixed-function pipeline, a video front-end unit, a thread generator and a thread dispatcher, and a unified return buffer manager that manages a unified return buffer.
[0282] In at least one fixed embodiment, fixed functional block 2530 further includes a graphics SoC interface 2537, a graphics microcontroller 2538, and a media pipeline 2539. In at least one embodiment, the graphics SoC interface 2537 provides an interface between the graphics core 2500 and other processor cores in the on-chip integrated circuit system. In at least one embodiment, the graphics microcontroller 2538 is a programmable subprocessor configurable to manage various functions of the graphics processor 2500, including thread dispatch, scheduling, and preemption. In at least one embodiment, the media pipeline 2539 includes logic that facilitates decoding, encoding, preprocessing, and / or post-processing of multimedia data, including image and video data. In at least one embodiment, the media pipeline 2539 implements media operations via requests for computation or sampling logic within subcores 2501-2501F.
[0283] In at least one embodiment, the SoC interface 2537 enables the graphics core 2500 to communicate with a general-purpose application processor core (e.g., a CPU) and / or other components within the SoC, including memory hierarchy elements such as shared last-level cache, system RAM, and / or embedded on-chip or packaged DRAM. In at least one embodiment, the SoC interface 2537 also enables communication with fixed-function devices within the SoC (e.g., a camera imaging pipeline) and enables the use and / or implementation of global memory atoms that can be shared between the graphics core 2500 and the CPU within the SoC. In at least one embodiment, the SoC interface 2537 also implements power management control for the graphics core 2500 and enables interfacing between the clock domain of the graphics core 2500 and other clock domains within the SoC. In at least one embodiment, the SoC interface 2537 enables the receipt of command buffers from a command stream converter and a global thread dispatcher, configured to provide commands and instructions to each of one or more graphics cores within the graphics processor. In at least one embodiment, when a media operation is to be performed, commands and instructions can be dispatched to the media pipeline 2539, or when a graphics processing operation is to be performed, they can be assigned to the geometry and fixed-function pipeline (e.g., geometry and fixed-function pipeline 2536, geometry and fixed-function pipeline 2514).
[0284] In at least one embodiment, the graphics microcontroller 2538 can be configured to perform various scheduling and management tasks on the graphics core 2500. In at least one embodiment, the graphics microcontroller 2538 can perform graphics and / or compute workload scheduling on various graphics parallel engines within the execution unit (EU) arrays 2502A-2502F, 2504A-2504F in subcores 2501A-2501F. In at least one embodiment, host software executing on the CPU core of the SoC including the graphics core 2500 can submit a workload of one of a plurality of graphics processor doorbells, which invokes scheduling operations on the appropriate graphics engine. In at least one embodiment, the scheduling operation includes determining which workload should be run next, submitting the workload to a command stream converter, preempting existing workloads running on the engine, monitoring the progress of the workload, and notifying the host software when the workload is completed. In at least one embodiment, the graphics microcontroller 2538 may also facilitate a low-power or idle state of the graphics core 2500, thereby providing the graphics core 2500 with the ability to save and restore registers across low-power state transitions within the graphics core 2500, independent of the operating system and / or the graphics driver software on the system.
[0285] In at least one embodiment, the graphics core 2500 may have up to N more or fewer modular sub-cores than the illustrated sub-cores 2501A-2501F. For each group of N sub-cores, in at least one embodiment, the graphics core 2500 may further include shared functional logic 2510, shared and / or cache memory 2512, geometry / fixed-function pipeline 2514, and additional fixed-function logic 2516 to accelerate various graphics and computational processing operations. In at least one embodiment, the shared functional logic 2510 may include logic units (e.g., samplers, mathematical and / or inter-thread communication logic) that can be shared by each of the N sub-cores within the graphics core 2500. In at least one embodiment, the shared and / or cache memory 2512 may be the last-level cache of the N sub-cores 2501A-2501F within the graphics core 2500, and may also be used as shared memory accessible by multiple sub-cores. In at least one embodiment, a geometry / fixed function pipeline 2514 may be included to replace the geometry / fixed function pipeline 2536 within the fixed function block 2530, and may include the same or similar logic units.
[0286] In at least one embodiment, the graphics core 2500 includes additional fixed-function logic 2516, which may include various fixed-function acceleration logics for use by the graphics core 2500. In at least one embodiment, the additional fixed-function logic 2516 includes additional geometry pipelines for use in position-only shading. In position-only shading, there are at least two geometry pipelines, and in the full geometry pipeline and culling pipeline within the geometry and fixed-function pipelines 2514, 2536, it is an additional geometry pipeline that can be included in the additional fixed-function logic 2516. In at least one embodiment, the culling pipeline is a trimmed version of the full geometry pipeline. In at least one embodiment, the full pipeline and the culling pipeline can execute different instances of the application, each with a separate environment. In at least one embodiment, position-only shading can hide long culling runs of discarded triangles, thereby allowing shading to be completed earlier in some cases. For example, in at least one embodiment, the culling pipeline logic in the additional fixed-function logic 2516 can execute the position shader in parallel with the main application and typically generates critical results faster than the full pipeline because the culling pipeline acquires and occludes the positional attributes of vertices without performing rasterization and rendering pixels to the framebuffer. In at least one embodiment, the culling pipeline can use the generated critical results to compute visibility information for all triangles, regardless of whether those triangles were culled. In at least one embodiment, the full pipeline (which may be referred to as the replay pipeline in this case) can consume visibility information to skip culled triangles and only occlude the visible triangles that are ultimately passed to the rasterization stage.
[0287] In at least one embodiment, the additional fixed-function logic 2516 may also include machine learning acceleration logic, such as fixed-function matrix multiplication logic, for implementing optimizations for machine learning training or inference.
[0288] In at least one embodiment, each graphics subcore 2501A-2501F includes a set of execution resources that can be used to perform graphics, media, and computational operations in response to requests from the graphics pipeline, media pipeline, or shader program. In at least one embodiment, the graphics subcore 2501A-2501F includes multiple EU arrays 2502A-2502F, 2504A-2504F, thread dispatch and inter-thread communication (TD / IC) logic 2503A-2503F, 3D (e.g., texture) samplers 2505A-2505F, media samplers 2506A-2506F, shader processors 2507A-2507F, and shared local memory (SLM) 2508A-2508F. Each of the EU arrays 2502A-2502F and 2504A-2504F contains multiple execution units, which are general-purpose graphics processing units capable of servicing graphics, media, or computational operations, performing floating-point and integer / fixed-point logic operations, including graphics, media, or computational shader programs. In at least one embodiment, the TD / IC logic 2503A-2503F performs local thread dispatch and thread control operations for the execution units within the subcore and facilitates communication between threads executing on the execution units of the subcore. In at least one embodiment, the 3D samplers 2505A-2505F can read data associated with textures or other 3D graphics into memory. In at least one embodiment, the 3D samplers can read texture data differently based on the sampling state and texture format configured and associated with a given texture. In at least one embodiment, the media samplers 2506A-2506F can perform similar read operations based on the type and format associated with the media data. In at least one embodiment, each graphics subcore 2501A-2501F may alternatively include a unified 3D and media sampler. In at least one embodiment, threads executing on execution units within each subcore 2501A-2501F may utilize shared local memory 2508A-2508F within each subcore, enabling threads executing within a thread group to utilize a common pool of on-chip memory for execution.
[0289] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6BDetails regarding inference and / or training logic 615 are provided. In at least one embodiment, some or all of the inference and / or training logic 615 may be incorporated into graphics processor 2510. For example, in at least one embodiment, the training and / or inference techniques described herein may be used in graphics processor 2312, graphics microcontroller 2538, geometry and fixed-function pipelines 2514 and 2536, or... Figure 25 One or more ALUs embodied in other logic within the [the document / process]. Furthermore, in at least one embodiment, the inference and / or training operations described herein can use [other methods / methods]. Figure 6A or Figure 6B The logic other than that shown is used to perform the task. In at least one embodiment, the weight parameters may be stored in on-chip or off-chip memory and / or registers (shown or not shown), which configure the ALU of the graphics processor 2500 to execute one or more machine learning algorithms, neural network architectures, use cases or training techniques described herein.
[0290] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0291] Figures 26A-26B The diagram illustrates thread execution logic 2600 of an array of processing elements including a graphics processor core, according to at least one embodiment. Figure 26A At least one embodiment is shown in which thread execution logic 2600 is used. Figure 26B Exemplary internal details of an execution unit according to at least one embodiment are shown.
[0292] like Figure 26AAs shown, in at least one embodiment, thread execution logic 2600 includes a shader processor 2602, a thread dispatcher 2604, an instruction cache 2606, a scalable execution unit array including multiple execution units 2608A-2608N, a sampler 2610, a data cache 2612, and a data port 2614. In at least one embodiment, the scalable execution unit array can be dynamically scaled, for example, based on the computational requirements of the workload, by enabling or disabling one or more execution units (e.g., any one of execution units 2608A, 2608B, 2608C, 2608D to 2608N-1 and 2608N). In at least one embodiment, the scalable execution units are interconnected via an interconnect structure linking to each execution unit. In at least one embodiment, the thread execution logic 2600 includes one or more connections to memory (such as system memory or cache memory) via one or more of the instruction cache 2606, data port 2614, sampler 2610, and execution units 2608A-2608N. In at least one embodiment, each execution unit (e.g., 2608A) is an independent programmable general-purpose computing unit capable of executing multiple concurrent hardware threads, processing multiple data elements in parallel for each thread. In at least one embodiment, the array of execution units 2608A-2608N is scalable to include any number of individual execution units.
[0293] In at least one embodiment, execution units 2608A-2608N are primarily used to execute shader programs. In at least one embodiment, shader processor 2602 can process various shader programs and dispatch execution threads associated with the shader programs via thread dispatcher 2604. In at least one embodiment, thread dispatcher 2604 includes logic for arbitrating thread initialization celebrations from the graphics and media pipeline and for instantiating requested threads on one or more execution units 2608A-2608N. For example, in at least one embodiment, the geometry pipeline can dispatch vertex, tessellation, or geometry shaders to thread execution logic for processing. In at least one embodiment, thread dispatcher 2604 can also handle runtime thread generation requests from executing shader programs.
[0294] In at least one embodiment, the execution units 2608A-2608N support an instruction set that includes native support for many standard 3D graphics shader instructions, enabling shader programs in graphics libraries (e.g., Direct3D and OpenGL) to execute with minimal conversion. In at least one embodiment, the execution units support vertex and geometry processing (e.g., vertex programs, geometry programs, vertex shaders), pixel processing (e.g., pixel shaders, fragment shaders), and general processing (e.g., computation and media shaders). In at least one embodiment, each execution unit 2608A-2608N includes one or more arithmetic logic units (ALUs) capable of performing multiple-issue single-instruction multiple-data (SIMD) operations, and multithreaded operation enables an efficient execution environment despite higher latency memory access. In at least one embodiment, each hardware thread within each execution unit has a dedicated high-bandwidth register file and associated independent thread states. In at least one embodiment, execution is multiple issues per clock cycle to a pipeline capable of integer, single-precision, and double-precision floating-point operations, SIMD branching functions, logical operations, a priori operations, and other operations. In at least one embodiment, while waiting for data from one of the memory or shared functions, dependency logic within execution units 2608A-2608N causes the waiting thread to sleep until the requested data is returned. In at least one embodiment, while the waiting thread is sleeping, hardware resources can be dedicated to processing other threads. For example, in at least one embodiment, during the latency associated with vertex shader operations, the execution unit can perform operations on the pixel shader, fragment shader, or another type of shader program (including different vertex shaders).
[0295] In at least one embodiment, each execution unit in the execution units 2608A-2608N operates on an array of data elements. In at least one embodiment, the plurality of data elements is an "execution size" or the number of instruction channels. In at least one embodiment, an execution channel is a logical unit for execution of data element access, masking, and flow control within an instruction. In at least one embodiment, the plurality of channels may be independent of the plurality of physical arithmetic logic units (ALUs) or floating-point units (FPUs) for a particular graphics processor. In at least one embodiment, the execution units 2608A-2608N support integer and floating-point data types.
[0296] In at least one embodiment, the execution unit instruction set includes SIMD instructions. In at least one embodiment, various data elements can be stored in registers as encapsulated data types, and the execution unit will process various elements based on the data size of those elements. For example, in at least one embodiment, when operating on a 256-bit wide vector, 256 bits of the vector are stored in registers, and the execution unit operates on the vector as four separate 64-bit encapsulated data elements (quad-word (QW) size data elements), eight separate 32-bit encapsulated data elements (double-word (DW) size data elements), sixteen separate 16-bit encapsulated data elements (word (W) size data elements), or thirty-two separate 8-bit data elements (byte (B) size data elements). However, in at least one embodiment, different vector widths and register sizes are possible.
[0297] In at least one embodiment, one or more execution units can be combined into a fused execution unit 2609A-2609N having thread control logic (2607A-2607N) shared with the fused EU. In at least one embodiment, multiple EUs can be merged into an EU group. In at least one embodiment, the number of EUs in the fused EU group can be configured to execute separate SIMD hardware threads. The number of EUs in the fused EU group can vary depending on the embodiment. In at least one embodiment, each EU can execute various SIMD widths, including but not limited to SIMD8, SIMD16, and SIMD32. In at least one embodiment, each fused graphics execution unit 2609A-2609N includes at least two execution units. For example, in at least one embodiment, the fused execution unit 2609A includes a first EU 2608A, a second EU 2608B, and thread control logic 2607A shared with the first EU 2608A and the second EU 2608B. In at least one embodiment, thread control logic 2607A controls the threads executing on the fused graphics execution unit 2609A, thereby allowing each EU within the fused execution units 2609A-2609N to execute using a common instruction pointer register.
[0298] In at least one embodiment, one or more internal instruction caches (e.g., 2606) are included in the thread execution logic 2600 to cache thread instructions for the execution unit. In at least one embodiment, one or more data caches (e.g., 2612) are included to cache thread data during thread execution. In at least one embodiment, a sampler 2610 is included to provide texture sampling for 3D operations and media sampling for media operations. In at least one embodiment, the sampler 2610 includes dedicated texture or media sampling functions to process texture or media data during the sampling process before providing sampled data to the execution unit.
[0299] During execution, in at least one embodiment, the graphics and media pipeline sends thread initiation requests to thread execution logic 2600 via thread creation and dispatch logic. In at least one embodiment, once a set of geometric objects has been processed and rasterized into pixel data, pixel processor logic (e.g., pixel shader logic, fragment shader logic, etc.) within shader processor 2602 is invoked to further compute output information and cause the results to be written to output surfaces (e.g., color buffer, depth buffer, stencil buffer, etc.). In at least one embodiment, the pixel shader or fragment shader computes values of various vertex attributes to be interpolated on the rasterized objects. In at least one embodiment, the pixel processor logic within shader processor 2602 then executes a pixel or fragment shader program provided by an application programming interface (API). In at least one embodiment, to execute the shader program, shader processor 2602 dispatches threads to execution units (e.g., 2608A) via thread dispatcher 2604. In at least one embodiment, shader processor 2602 uses texture sampling logic in sampler 2610 to access texture data in a texture map stored in memory. In at least one embodiment, arithmetic operations on the texture data and the input geometry data are performed to calculate pixel color data for each geometric segment, or one or more pixels are discarded for further processing.
[0300] In at least one embodiment, data port 2614 provides a memory access mechanism for thread execution logic 2600 to output processed data to memory for further processing on the graphics processor output pipeline. In at least one embodiment, data port 2614 includes or is coupled to one or more cache memories (e.g., data cache 2612) to cache data for memory access via the data port.
[0301] like Figure 26BAs shown, in at least one embodiment, the graphics execution unit 2608 may include an instruction fetch unit 2637, a general-purpose register file array (GRF) 2624, an architecture register file array (ARF) 2626, a thread arbiter 2622, a send unit 2630, a branch unit 2632, a set of SIMD floating-point units (FPUs) 2634, and in at least one embodiment, a set of dedicated integer SIMD ALUs 2635. The GRF 2624 and ARF 2626 include a set of general-purpose register files and architecture register files associated with each concurrent hardware thread that may be active in the graphics execution unit 2608. In at least one embodiment, the architecture state of each thread is maintained in the ARF 2626, while data used during thread execution is stored in the GRF 2624. In at least one embodiment, the execution state of each thread, including the instruction pointer of each thread, may be stored in thread-specific registers in the ARF 2626.
[0302] In at least one embodiment, the graphics execution unit 2608 has an architecture that is a combination of simultaneous multithreading (SMT) and fine-grained interleaved multithreading (IMT). In at least one embodiment, the architecture has a modular configuration that can be fine-tuned at design time based on a target number of simultaneous threads and the number of registers per execution unit, wherein execution unit resources are logically allocated for executing multiple simultaneous threads.
[0303] In at least one embodiment, the graphics execution unit 2608 can jointly issue multiple instructions, each of which can be a different instruction. In at least one embodiment, the thread arbiter 2622 of the graphics execution unit thread 2608 can dispatch instructions to one of the sending unit 2630, the branching unit 2632, or the SIMD FPU 2634 for execution. In at least one embodiment, each execution thread can access 128 general-purpose registers in the GRF 2624, where each register can store 32 bytes and can be accessed as a SIMD 8-element vector of 32-bit data elements. In at least one embodiment, each execution unit thread can access 4KB of the GRF 2624, although the embodiments are not limited thereto, and more or fewer register resources may be provided in other embodiments. In at least one embodiment, although the number of threads per execution unit may also vary depending on the embodiment, a maximum of seven threads can be executed simultaneously. In at least one embodiment where seven threads can access 4KB, the GRF 2624 can store a total of 28KB. In at least one embodiment, the flexible addressing mode can allow registers to be addressed together to efficiently build wider registers or rectangular block data structures representing strides.
[0304] In at least one embodiment, memory operations, sampler operations, and other longer-latency system communications are scheduled via a “send” instruction executed by message sending unit 2630. In at least one embodiment, branch instructions are dispatched to dedicated branch unit 2632 to facilitate SIMD divergence and eventual convergence.
[0305] In at least one embodiment, the graphics execution unit 2608 includes one or more SIMD floating-point units (FPUs) 2634 to perform floating-point operations. In at least one embodiment, the one or more FPUs 2634 also support integer computation. In at least one embodiment, the one or more FPUs 2634 can perform up to M 32-bit floating-point (or integer) operations in SIMD, or up to 2M 16-bit integer or 16-bit floating-point operations in SIMD. In at least one embodiment, at least one FPU provides extended mathematical capabilities to support high-throughput a priori mathematical functions and double-precision 64-bit floating-point operations. In at least one embodiment, a set of 8-bit integer SIMD ALUs 2635 is also present and can be specifically optimized to perform operations related to machine learning computations.
[0306] In at least one embodiment, an array of multiple instances of the graphics execution unit 2608 may be instantiated in a graphics sub-core group (e.g., a sub-slice). In at least one embodiment, the execution unit 2608 may execute instructions across multiple execution channels. In at least one embodiment, each thread executing on the graphics execution unit 2608 executes on a different channel.
[0307] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6B Details are provided regarding the inference and / or training logic 615. In at least one embodiment, some or all of the inference and / or training logic 615 may be incorporated into the execution logic 2600. Furthermore, in at least one embodiment, additional logic may be used besides... Figure 6A or Figure 6B The logic other than that shown is used to perform the inference and / or training operations described herein. In at least one embodiment, weight parameters may be stored in on-chip or off-chip memory and / or registers (shown or not shown), which configure the ALU of execution logic 2600 to execute one or more machine learning algorithms, neural network architectures, use cases, or training techniques described herein.
[0308] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0309] Figure 27 A parallel processing unit (“PPU”) 2700 according to at least one embodiment is illustrated. In at least one embodiment, the PPU 2700 is configured with machine-readable code that, if executed by the PPU 2700, causes the PPU 2700 to perform some or all of the processes and techniques described herein. In at least one embodiment, the PPU 2700 is a multi-threaded processor implemented on one or more integrated circuit devices and utilizes multi-threading as a delay-hiding technique designed to process computer-readable instructions (also referred to as machine-readable instructions or simple instructions) executed in parallel on multiple threads. In at least one embodiment, a thread refers to an execution thread and is an instance of a set of instructions configured to be executed by the PPU 2700. In at least one embodiment, the PPU 2700 is a graphics processing unit (“GPU”) configured to implement a graphics rendering pipeline for processing three-dimensional (“3D”) graphics data to generate two-dimensional (“2D”) image data for display on a display device, such as a liquid crystal display (“LCD”) device. In at least one embodiment, the PPU 2700 is used to perform computations, such as linear algebra operations and machine learning operations. Figure 27 An example parallel processor is shown for illustrative purposes only and should be interpreted as a non-limiting example of a processor architecture contemplated within the scope of this disclosure, which may be supplemented and / or replaced by any suitable processor.
[0310] In at least one embodiment, one or more PPU 2700s are configured to accelerate high-performance computing (“HPC”), data center, and machine learning applications. In at least one embodiment, the PPU 2700 is configured to accelerate deep learning systems and applications, including, but not limited to, the following non-limiting examples: autonomous vehicle platforms, deep learning, high-precision speech, image, and text recognition systems, intelligent video analytics, molecular simulation, drug discovery, disease diagnosis, weather forecasting, big data analytics, astronomy, molecular dynamics simulation, financial modeling, robotics, factory automation, real-time language translation, online search optimization, and personalized user recommendations, etc.
[0311] In at least one embodiment, the PPU 2700 includes, but is not limited to, an input / output (“I / O”) unit 2706, a front-end unit 2710, a scheduler unit 2712, a job allocation unit 2714, a hub 2716, a crossbar (“Xbar”) 2720, one or more general-purpose processing clusters (“GPCs”) 2718, and one or more partitioning units (“memory partitioning units”) 2722. In at least one embodiment, the PPU 2700 is connected to a host processor or other PPU 2700 via one or more high-speed GPU interconnects (“GPU interconnects”) 2708. In at least one embodiment, the PPU 2700 is connected to a host processor or other peripheral device via interconnect 2702. In one embodiment, the PPU 2700 is connected to local memory including one or more memory devices (“memory”) 2704. In at least one embodiment, the memory device 2704 includes, but is not limited to, one or more dynamic random access memory (“DRAM”) devices. In at least one embodiment, one or more DRAM devices are configured and / or configurable as a high-bandwidth memory (“HBM”) subsystem, and multiple DRAM dies are stacked within each device.
[0312] In at least one embodiment, the high-speed GPU interconnect 2708 may refer to a wire-based multi-channel communication link used by the system for scaling, and includes one or more PPUs 2700s (“CPUs”) combined with one or more central processing units, supporting cache coherence between the PPUs 2700s and the CPUs, as well as CPU master control. In at least one embodiment, the high-speed GPU interconnect 2708 transmits data and / or commands to other units of the PPU 2700, such as one or more copy engines, video encoders, video decoders, power management units, and / or other components, via a hub 2716. Figure 27 Other components that may not be explicitly shown.
[0313] In at least one embodiment, the I / O unit 2706 is configured to access the host processor via the system bus 2702. Figure 27(Not shown) Sending and receiving communications (e.g., commands, data). In at least one embodiment, I / O unit 2706 communicates directly with the host processor via system bus 2702 or via one or more intermediate devices (e.g., memory bridges). In at least one embodiment, I / O unit 2706 may communicate with one or more other processors (e.g., one or more PPUs 2700) via system bus 2702. In at least one embodiment, I / O unit 2706 implements a Peripheral Component Interconnect Express (“PCIe”) interface for communication via the PCIe bus. In at least one embodiment, I / O unit 2706 implements an interface for communicating with external devices.
[0314] In at least one embodiment, I / O unit 2706 decodes packets received via system bus 2702. In at least one embodiment, at least some packets represent commands configured to cause PPU 2700 to perform various operations. In at least one embodiment, I / O unit 2706 sends the decoded commands to various other units of PPU 2700 as specified by the commands. In at least one embodiment, the commands are sent to front-end unit 2710 and / or to hub 2716 or other units of PPU 2700, such as one or more copy engines, video encoders, video decoders, power management units, etc. Figure 27 (Not explicitly shown in the text). In at least one embodiment, I / O unit 2706 is configured to route communication between various logical units of PPU 2700.
[0315] In at least one embodiment, a program executed by the host processor encodes a command stream in a buffer that provides a workload to the PPU 2700 for processing. In at least one embodiment, the workload includes instructions and data to be processed by those instructions. In at least one embodiment, the buffer is a region in memory accessible (e.g., read / write) by both the host processor and the PPU 2700—the host interface unit can be configured to access a buffer in system memory connected to the system bus 2702 via memory requests transmitted through the system bus 2702 via the I / O unit 2706. In at least one embodiment, the host processor writes a command stream to the buffer and then sends a pointer indicating the start of the command stream to the PPU 2700, causing the front-end unit 2710 to receive pointers to one or more command streams and manage one or more command streams, read commands from the command streams, and forward the commands to the respective units of the PPU 2700.
[0316] In at least one embodiment, front-end unit 2710 is coupled to scheduler unit 2712, which configures various GPCs 2718 to process tasks defined by one or more command streams. In at least one embodiment, scheduler unit 2712 is configured to track status information related to the various tasks managed by scheduler unit 2712, wherein the status information may indicate which GPC 2718 a task is assigned to, whether the task is active or inactive, the priority associated with the task, etc. In at least one embodiment, scheduler unit 2712 manages multiple tasks executed on one or more GPCs 2718.
[0317] In at least one embodiment, scheduler unit 2712 is coupled to job allocation unit 2714, which is configured to dispatch tasks for execution on GPC 2718. In at least one embodiment, job allocation unit 2714 tracks multiple scheduled tasks received from scheduler unit 2712 and manages a pool of pending tasks and an active task pool for each GPC 2718. In at least one embodiment, the pool of pending tasks includes multiple time slots (e.g., 32 time slots) containing tasks assigned to a particular GPC 2718 for processing; the active task pool may include multiple time slots (e.g., 4 time slots) for tasks actively processed by GPC 2718, such that as one of the GPCs 2718 completes its execution, that task is evicted from the active task pool of the GPC 2718, and one of other tasks is selected from the pool of pending tasks and scheduled for execution on the GPC 2718. In at least one embodiment, if an active task is idle on GPC 2718, for example while waiting for data dependency resolution, the active task is evicted from GPC 2718 and returned to the task pool, while another task in the task pool is selected and scheduled to be executed on GPC 2718.
[0318] In at least one embodiment, the work allocation unit 2714 communicates with one or more GPCs 2718 via XBar 2720. In at least one embodiment, XBar 2720 is an interconnect network that couples a plurality of units of PPU 2700 to other units of PPU 2700, and can be configured to couple the work allocation unit 2714 to a specific GPC 2718. In at least one embodiment, other units of one or more PPUs 2700 can also be connected to XBar 2720 via hub 2716.
[0319] In at least one embodiment, tasks are managed by scheduler unit 2712 and assigned to one of GPCs 2718 by job allocation unit 2714. GPCs 2718 are configured to process tasks and produce results. In at least one embodiment, results may be consumed by other tasks in GPCs 2718, routed to different GPCs 2718 via XBar 2720, or stored in memory 2704. In at least one embodiment, results may be written to memory 2704 via partitioning unit 2722, which implements a memory interface for writing data to or reading data from memory 2704. In at least one embodiment, results may be transferred to another PPU 2704 or CPU via high-speed GPU interconnect 2708. In at least one embodiment, PPU 2700 includes, but is not limited to, U partitioning units 2722, which is equal to the number of separate and different memory devices 2704 coupled to PPU 2700. In at least one embodiment, partitioning units 2722 will be combined below. Figure 29 To describe in more detail.
[0320] In at least one embodiment, the host processor executes a driver core that implements an application programming interface (API) that enables one or more applications executing on the host processor to schedule operations for execution on the PPU 2700. In one embodiment, multiple computing applications are executed concurrently by the PPU 2700, and the PPU 2700 provides isolation, Quality of Service (“QoS”), and independent address spaces for the multiple computing applications. In at least one embodiment, an application generates instructions (e.g., in the form of API calls) that cause the driver core to generate one or more tasks for execution by the PPU 2700, and the driver core outputs the tasks to one or more streams processed by the PPU 2700. In at least one embodiment, each task includes one or more associated thread groups, which may be referred to as a warp. In at least one embodiment, a warp includes multiple associated threads (e.g., 32 threads) that can be executed in parallel. In at least one embodiment, a cooperating thread may refer to multiple threads that include instructions for performing tasks and exchanging data via shared memory. In at least one embodiment, combined with Figure 29 Threads and cooperative threads are described in more detail according to at least one embodiment.
[0321] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6BDetails regarding the inference and / or training logic 615 are provided. In at least one embodiment, the deep learning application processor is used to train a machine learning model (such as a neural network) to predict or infer information provided to the PPU 2700. In at least one embodiment, the PPU 2700 is used to infer or predict information based on a trained machine learning model (e.g., a neural network) that has been trained by another processor or system or the PPU 2700. In at least one embodiment, the PPU 2700 can be used to perform one or more neural network use cases described herein.
[0322] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0323] Figure 28 A general-purpose processing cluster (“GPC”) 2800 according to at least one embodiment is illustrated. In at least one embodiment, the GPC 2800 is Figure 27 The GPC 2718. In at least one embodiment, each GPC 2800 includes, but is not limited to, multiple hardware units for processing tasks, and each GPC 2800 includes, but is not limited to, a pipeline manager 2802, a pre-raster operation unit (“preROP”) 2804, a raster engine 2808, a work assignment crossbar switch (“WDX”) 2816, a memory management unit (“MMU”) 2818, one or more data processing clusters (“DPC”) 2806, and any suitable combination of components.
[0324] In at least one embodiment, the operation of GPC 2800 is controlled by pipeline manager 2802. In at least one embodiment, pipeline manager 2802 manages the configuration of one or more DPCs 2806 to handle tasks assigned to GPC 2800. In at least one embodiment, pipeline manager 2802 configures at least one of one or more DPCs 2806 to implement at least a portion of the graphics rendering pipeline. In at least one embodiment, DPC 2806 is configured to execute vertex shader programs on programmable streaming multiprocessor (“SM”) 2814. In at least one embodiment, pipeline manager 2802 is configured to route packets received from the work allocation unit to appropriate logic units within GPC 2800, and in at least one embodiment, some packets may be routed to fixed-function hardware units in preROP 2804 and / or raster engine 2808, while other packets may be routed to DPC 2806 for processing by original engine 2812 or SM 2814. In at least one embodiment, pipeline manager 2802 configures at least one of DPCs 2806 to implement a neural network model and / or computation pipeline.
[0325] In at least one embodiment, the preROP unit 2804 is configured to route data generated by the raster engine 2808 and DPC 2806 to the raster operation (“ROP”) unit in the partition unit 2722, in conjunction with the above. Figure 27 More detailed description. In at least one embodiment, the preROP unit 2804 is configured to perform optimizations for color blending, organize pixel data, perform address translation, etc. In at least one embodiment, the raster engine 2808 includes, but is not limited to, multiple fixed-function hardware units configured to perform various raster operations, and in at least one embodiment, the raster engine 2808 includes, but is not limited to, a setup engine, a coarse raster engine, a culling engine, a clipping engine, a fine raster engine, a tile aggregation engine, and any suitable combination thereof. In at least one embodiment, the setup engine receives transformed vertices and generates plane equations associated with the geometric primitives defined by the vertices; the plane equations are passed to the coarse raster engine to generate coverage information of basic primitives (e.g., x, y coverage masks of tiles); the output of the coarse raster engine is passed to the culling engine, in which fragments associated with primitives that fail the z-test are culled, and passed to the clipping engine, in which fragments located outside the view frustum are clipped. In at least one embodiment, the clipped and culled fragments are passed to the fine raster engine to generate properties of pixel fragments based on the plane equations generated by the setup engine. In at least one embodiment, the output of the raster engine 2808 includes fragments that will be processed by any appropriate entity (e.g., by the fragment shader implemented within the DPC 2806).
[0326] In at least one embodiment, each DPC 2806 included in the GPC 2800 includes, but is not limited to, an M-pipeline controller (“MPC”) 2810; a primitive engine 2812; one or more SMs 2814; and any suitable combination thereof. In at least one embodiment, the MPC 2810 controls the operation of the DPC 2806, routing packets received from the pipeline manager 2802 to the appropriate units within the DPC 2806. In at least one embodiment, packets associated with vertices are routed to the primitive engine 2812, which is configured to retrieve vertex attributes associated with vertices from memory; conversely, packets associated with shader programs may be sent to the SM 2814.
[0327] In at least one embodiment, the SM 2814 includes, but is not limited to, a programmable streaming processor configured to process tasks represented by multiple threads. In at least one embodiment, the SM 2814 is multithreaded and configured to execute multiple threads (e.g., 32 threads) from a specific thread group concurrently, and implements a Single Instruction, Multiple Data (“SIMD”) architecture, wherein each thread in a group of threads (e.g., a thread bundle) is configured to process different datasets based on the same instruction set. In at least one embodiment, all threads in the thread group execute the same instructions. In at least one embodiment, the SM 2814 implements a Single Instruction, Multiple Thread (“SIMT”) architecture, wherein each thread in a group of threads is configured to process different datasets based on the same instructions, but wherein individual threads in the thread group are allowed to diverge during execution. In at least one embodiment, a program counter, call stack, and execution state are maintained for each thread bundle, thereby achieving concurrency between the thread bundle and serial execution within the thread bundle when threads in the thread bundle diverge. In another embodiment, a program counter, call stack, and execution state are maintained for each individual thread, thereby ensuring equal concurrency among all threads within and between thread bundles. In at least one embodiment, an execution state is maintained for each individual thread, and threads executing the same instructions can be converged and executed in parallel to improve efficiency. At least one embodiment of the SM 2814 is described in more detail below.
[0328] In at least one embodiment, the MMU 2818 is integrated with the GPC 2800 and memory partitioning unit (e.g., Figure 27 The MMU 2818 provides an interface between partition units 2722 and provides virtual address to physical address translation, memory protection, and memory request arbitration. In at least one embodiment, the MMU 2818 provides one or more translation back buffers (“TLBs”) for performing virtual address to physical address translation in memory.
[0329] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. The following is in conjunction with... Figure 6A and / or Figure 6B Details regarding the inference and / or training logic 615 are provided. In at least one embodiment, the deep learning application processor is used to train a machine learning model (such as a neural network) to predict or infer information provided to the GPC 2800. In at least one embodiment, the GPC 2800 is used to infer or predict information based on a machine learning model (e.g., a neural network) that has been trained by another processor or system or the GPC 2800. In at least one embodiment, the GPC 2800 can be used to perform one or more neural network use cases described herein.
[0330] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0331] Figure 29 A memory partitioning unit 2900 of a parallel processing unit (“PPU”) according to at least one embodiment is illustrated. In at least one embodiment, the memory partitioning unit 2900 includes, but is not limited to, a raster operation (“ROP”) unit 2902; a secondary (“L2”) cache 2904; a memory interface 2906; and any suitable combination thereof. In at least one embodiment, the memory interface 2906 is coupled to memory. In at least one embodiment, the memory interface 2906 may implement a 32, 64, 128, or 1024-bit data bus, or a similar implementation for high-speed data transfer. In at least one embodiment, the PPU includes U memory interfaces 2906, one memory interface 2906 per pair of partitioning units 2900, wherein each pair of partitioning units 2900 is connected to a corresponding memory device. For example, in at least one embodiment, the PPU may be connected to up to Y memory devices, such as a high-bandwidth memory stack or a graphics dual data rate version 5 synchronous dynamic random access memory (“GDDR5 SDRAM”).
[0332] In at least one embodiment, the memory interface 2906 implements a high-bandwidth memory second-generation (“HBM2”) memory interface, and Y is equal to half of U. In at least one embodiment, the HBM2 memory stack is located on the same physical package as the PPU, providing significant power savings and area savings compared to conventional GDDR5 SDRAM systems. In at least one embodiment, each HBM2 stack includes, but is not limited to, four memory dies, and Y is equal to 4, with each HBM2 stack including two 128-bit channels per die for a total of eight channels and a 1024-bit data bus width. In at least one embodiment, the memory supports Single Error Corrected Double Error Detection (“SECDED”) error correction code (“ECC”) to protect data. In at least one embodiment, ECC provides higher reliability for data corruption-sensitive computing applications.
[0333] In at least one embodiment, the PPU implements a multi-level memory hierarchy. In at least one embodiment, the memory partitioning unit 2900 supports unified memory to provide a single unified virtual address space for the central processing unit (“CPU”) and the PPU memory, thereby enabling data sharing between virtual memory systems. In at least one embodiment, the frequency of PPU accesses to memory located on other processors is tracked to ensure that memory pages are moved to the physical memory of the PPU that accesses pages more frequently. In at least one embodiment, the high-speed GPU interconnect 2708 supports address translation services, which allow the PPU to directly access the CPU's page tables and provide full access to the CPU's memory through the PPU.
[0334] In at least one embodiment, the replication engine transfers data between multiple PPUs or between a PPU and a CPU. In at least one embodiment, the replication engine can generate page faults for addresses not mapped to page tables, and memory partitioning unit 2900 then servicees the page faults, mapping the addresses to page tables, after which the replication engine performs the transfer. In at least one embodiment, multiple replication engines operating on fixed (i.e., non-pageable) memory across multiple processors substantially reduce available memory. In at least one embodiment, in the event of a hardware page fault, an address can be passed to the replication engine regardless of whether a memory page resides, and the replication process is transparent.
[0335] According to at least one embodiment, from Figure 27Data from memory 2704 or other system memory is retrieved by memory partitioning unit 2900 and stored in L2 cache 2904, which is located on-chip and shared among various GPCs. In at least one embodiment, each memory partitioning unit 2900 includes, but is not limited to, at least a portion of the L2 cache associated with the corresponding memory device. In at least one embodiment, lower-level caches are implemented in various units within a GPC. In at least one embodiment, each SM 2814 may implement a Level 1 (“L1”) cache, wherein the L1 cache is a private memory dedicated to a specific SM 2814, and data is retrieved from L2 cache 2904 and stored in each L1 cache for processing within the functional units of the SM 2814. In at least one embodiment, L2 cache 2904 is coupled to memory interface 2906 and XBar 2720.
[0336] In at least one embodiment, ROP unit 2902 performs graphic raster operations related to pixel color, such as color compression, pixel blending, etc. In at least one embodiment, ROP unit 2902 performs depth testing in conjunction with raster engine 2808, receiving depth from the culling engine of raster engine 2808 for sample locations associated with pixel fragments. In at least one embodiment, depth is tested for the corresponding depth in the depth buffer at the sample location associated with the fragment. In at least one embodiment, if the fragment passes the depth test for the sample location, ROP unit 2902 updates the depth buffer and sends the result of the depth test to raster engine 2808. It will be appreciated that the number of partition units 2900 may differ from the number of GPCs; therefore, each ROP unit 2902 may be coupled to each GPC in at least one embodiment. In at least one embodiment, ROP unit 2902 tracks packets received from different GPCs and determines which XBar 2720 the result generated by ROP unit 2902 is routed to.
[0337] Figure 30 A streaming multiprocessor (“SM”) 3000 according to at least one embodiment is illustrated. In at least one embodiment, the SM 3000 is Figure 28SM 2814. In at least one embodiment, SM 3000 includes, but is not limited to, instruction cache 3002; one or more scheduler units 3004; register file 3008; one or more processing cores (“cores”) 3010; one or more special function units (“SFUs”) 3012; one or more load / store units (“LSUs”) 3014; interconnect network 3016; shared memory / Level 1 (“L1”) cache 3018; and any suitable combination thereof. In at least one embodiment, the work allocation unit schedules tasks to execute on a general-purpose processing cluster (“GPC”) of parallel processing units (“PPUs”), and each task is assigned to a specific data processing cluster (“DPC”) within the GPC, and if the task is associated with a shader program, the task is assigned to one of the SMs 3000. In at least one embodiment, scheduler unit 3004 receives tasks from the work allocation unit and manages instruction scheduling for one or more thread blocks assigned to the SM 3000. In at least one embodiment, scheduler unit 3004 schedules thread blocks to execute as thread bundles of parallel threads, wherein each thread block is assigned at least one thread bundle. In at least one embodiment, each thread bundle executes a thread. In at least one embodiment, scheduler unit 3004 manages multiple different thread blocks, assigns thread bundles to different thread blocks, and then dispatches instructions from multiple different cooperative groups to various functional units (e.g., processing core 3010, SFU 3012, and LSU 3014) in each clock cycle.
[0338] In at least one embodiment, a cooperative group can refer to a programming model for organizing groups of communicating threads, allowing developers to express the granularity at which threads are communicating, thereby enabling richer and more efficient parallel decompositions. In at least one embodiment, a cooperative startup API supports synchronization between blocks of threads to execute parallel algorithms. In at least one embodiment, applications using a conventional programming model provide a single, simple construct for synchronizing cooperative threads: a barrier (e.g., the `syncthreads()` function) across all threads in a block of threads. However, in at least one embodiment, programmers can define thread groups at a granularity smaller than that of thread blocks and synchronize within the defined groups to achieve higher performance, design flexibility, and software reuse in the form of a set of group-wide functional interfaces. In at least one embodiment, cooperative groups enable programmers to explicitly define thread groups at the sub-block (i.e., as small as a single thread) and multi-block granularity and perform set operations, such as synchronizing threads within the cooperative group. In at least one embodiment, this programming model supports clean composition across software boundaries, allowing library and utility functions to be safely synchronized in their native environment without having to make assumptions about convergence. In at least one embodiment, the cooperative group primitives enable new patterns of cooperative parallelism, including but not limited to producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire thread block mesh.
[0339] In at least one embodiment, the scheduling unit 3006 is configured to send instructions to one or more functional units, and the scheduler unit 3004 includes, but is not limited to, two scheduling units 3006, which enable two different instructions from the same thread bundle to be scheduled in each clock cycle. In at least one embodiment, each scheduler unit 3004 includes a single scheduling unit 3006 or additional scheduling units 3006.
[0340] In at least one embodiment, each SM 3000 includes, but is not limited to, a register file 3008 that provides a set of registers for functional units of the SM 3000. In at least one embodiment, the register file 3008 is partitioned between each functional unit, thereby allocating a dedicated portion of the register file 3008 for each functional unit. In at least one embodiment, the register file 3008 is partitioned between different thread bundles executed by the SM 3000, and the register file 3008 provides temporary storage for operands in data paths connected to functional units. In at least one embodiment, each SM 3000 includes, but is not limited to, a plurality of L processing cores 3010. In at least one embodiment, the SM 3000 includes, but is not limited to, a large number (e.g., 128 or more) of different processing cores 3010. In at least one embodiment, each processing core 3010 includes, but is not limited to, a fully pipelined, single-precision, double-precision, and / or mixed-precision processing unit, which includes, but is not limited to, a floating-point arithmetic logic unit and an integer arithmetic logic unit. In at least one embodiment, the floating-point arithmetic logic unit implements the IEEE 754-2008 standard for floating-point arithmetic. In at least one embodiment, the processing core 3010 includes, but is not limited to, 64 single-precision (32-bit) floating-point cores, 64 integer cores, 32 double-precision (64-bit) floating-point cores and 8 tensor cores.
[0341] According to at least one embodiment, tensor cores are configured to perform matrix operations. In at least one embodiment, one or more tensor cores are included in processing core 3010. In at least one embodiment, tensor cores are configured to perform deep learning matrix arithmetic, such as convolution operations for neural network training and inference. In at least one embodiment, each tensor core operates on a 4×4 matrix and performs matrix multiplication and accumulation operations D = A×B + C, where A, B, C, and D are 4×4 matrices.
[0342] In at least one embodiment, matrix multiplication inputs A and B are 16-bit floating-point matrices, and accumulation matrices C and D are either 16-bit or 32-bit floating-point matrices. In at least one embodiment, the Tensor Core performs 32-bit floating-point accumulation on the 16-bit floating-point input data. In at least one embodiment, the 16-bit floating-point multiplication uses 64 operations to obtain a full-precision product, which is then accumulated with other intermediate multiplications using 32-bit floating-point addition to perform a 4x4x4 matrix multiplication. In at least one embodiment, the Tensor Core is used to perform matrix operations on larger two-dimensional or higher-dimensional matrices composed of these smaller components. In at least one embodiment, APIs (such as the CUDA 9 C++ API) expose specialized matrix loading, matrix multiplication and accumulation, and matrix storage operations to efficiently utilize the Tensor Core from CUDA-C++ programs. In at least one embodiment, at the CUDA level, the thread bundle level interface assumes a 16×16 matrix spanning all 32 thread bundle threads.
[0343] In at least one embodiment, each SM 3000 includes, but is not limited to, M SFUs 3012 that perform special functions (e.g., attribute evaluation, inverse square root, etc.). In at least one embodiment, an SFU 3012 includes, but is not limited to, a tree traversal unit configured to traverse a hierarchical tree data structure. In at least one embodiment, an SFU 3012 includes, but is not limited to, a texture unit configured to perform texture map filtering operations. In at least one embodiment, a texture unit is configured to load a texture map (e.g., a 2D array of texture pixels) from memory and sample the texture map to produce sampled texture values for use by a shader program executed by the SM 3000. In at least one embodiment, the texture map is stored in shared memory / L1 cache 3018. In at least one embodiment, according to at least one embodiment, the texture unit uses mip-maps (e.g., texture maps with different levels of detail) to implement texture operations (such as filtering operations). In at least one embodiment, each SM 3000 includes, but is not limited to, two texture units.
[0344] In at least one embodiment, each SM 3000 includes, but is not limited to, N LSUs 3014 that implement load and store operations between the shared memory / L1 cache 3018 and the register file 3008. In at least one embodiment, each SM 3000 includes, but is not limited to, an interconnect network 3016 that connects each functional unit to the register file 3008 and connects the LSUs 3014 to both the register file 3008 and the shared memory / L1 cache 3018. In at least one embodiment, the interconnect network 3016 is a crossbar switch that can be configured to connect any functional unit to any register in the register file 3008 and connect the LSUs 3014 to memory locations in both the register file 3008 and the shared memory / L1 cache 3018.
[0345] In at least one embodiment, the shared memory / L1 cache 3018 is an array of on-chip memory that, in at least one embodiment, allows data storage and communication between the SM 3000 and the primitive engine, as well as between threads within the SM 3000. In at least one embodiment, the shared memory / L1 cache 3018 includes, but is not limited to, a storage capacity of 128KB and is located on the path from the SM 3000 to the partition unit. In at least one embodiment, the shared memory / L1 cache 3018 is used for cache reads and writes. In at least one embodiment, one or more of the shared memory / L1 cache 3018, the L2 cache, and the memory are backup storage.
[0346] In at least one embodiment, combining data caching and shared memory functionality into a single memory block provides improved performance for both types of memory access. In at least one embodiment, the capacity is used by programs that do not use shared memory or is used as a cache, for example, if shared memory is configured to use half its capacity, and texture and load / store operations can use the remaining capacity. According to at least one embodiment, integration within the shared memory / L1 cache 3018 enables the shared memory / L1 cache 3018 to be used as a high-throughput pipeline for streaming data, while providing high-bandwidth and low-latency access to frequently reused data. In at least one embodiment, a simpler configuration can be used compared to graphics processing when configured for general-purpose parallel computing. In at least one embodiment, a fixed-function graphics processing unit is bypassed, creating a simpler programming model. In at least one embodiment, in a general-purpose parallel computing configuration, the work allocation unit directly allocates and distributes blocks of threads to the DPC. In at least one embodiment, threads within a block execute the same program, using unique thread IDs in computation to ensure each thread produces unique results, using an SM 3000 to execute the program and perform computations, using a shared memory / L1 cache 3018 for communication between threads, and using an LSU 3014 to read and write global memory via the shared memory / L1 cache 3018 and memory partitioning units. In at least one embodiment, when configured for general-purpose parallel computing, the SM 3000 writes commands to the scheduler unit 3004 that can be used to start new work on the DPC.
[0347] In at least one embodiment, the PPU is included in or coupled to a desktop computer, laptop computer, tablet computer, server, supercomputer, smartphone (e.g., wireless, handheld device), personal digital assistant (“PDA”), digital camera, vehicle, head-mounted display, handheld electronic device, etc. In at least one embodiment, the PPU is implemented on a single semiconductor substrate. In at least one embodiment, the PPU is included in a system-on-a-chip (“SoC”) along with one or more other devices (e.g., additional PPUs, memory, reduced instruction set computer (“RISC”) CPU, one or more memory management units (“MMU”), digital-to-analog converters (“DAC”), etc.).
[0348] In at least one embodiment, the PPU may be included on a graphics card that includes one or more storage devices. The graphics card may be configured to connect to a PCIe slot on a desktop computer motherboard. In at least one embodiment, the PPU may be an integrated graphics processing unit (“iGPU”) included in the motherboard's chipset.
[0349] The inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. (The following is in conjunction with...) Figure 6A and / or Figure 6B Details regarding the inference and / or training logic 615 are provided. In at least one embodiment, the deep learning application processor is used to train a machine learning model (such as a neural network) to predict or infer information provided to the SM 3000. In at least one embodiment, the SM 3000 is used to infer or predict information based on a machine learning model (e.g., a neural network) that has been trained by another processor or system or by the SM 3000. In at least one embodiment, the SM 3000 can be used to perform one or more neural network use cases described herein.
[0350] Inference and / or training logic 615 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, this logic may be used with components of these graphics to generate one or more images having visual effects approximating those applied to one or more lower-resolution versions of these images.
[0351] In at least one embodiment, a single semiconductor platform may refer to a unique, single semiconductor-based integrated circuit or chip. In at least one embodiment, a multi-chip module with increased connectivity can be used, simulating on-chip operations and representing a substantial improvement over implementations utilizing a conventional central processing unit (“CPU”) and bus. In at least one embodiment, the various modules may also be placed separately or in various combinations of semiconductor platforms, depending on the user's needs.
[0352] In at least one embodiment, a computer program in the form of machine-readable executable code or computer control logic algorithms is stored in main memory 1004 and / or secondary storage. According to at least one embodiment, if executed by one or more processors, the computer program enables system 1000 to perform various functions. In at least one embodiment, memory 1004, storage, and / or any other storage are possible examples of computer-readable media. In at least one embodiment, secondary storage can refer to any suitable storage device or system, such as hard disk drives and / or removable storage drives, representing floppy disk drives, magnetic tape drives, optical disk drives, digital versatile disc (“DVD”) drives, recording devices, universal serial bus (“USB”) flash memory, etc. In at least one embodiment, the architecture and / or functionality of the various preceding figures are implemented within the context of CPU 1002; parallel processing system 1012; integrated circuits capable of having at least a portion of the capabilities of two CPUs 1002; parallel processing system 1012; chipsets (e.g., a set of integrated circuits designed to operate as units performing related functions and sold); and any suitable combination of integrated circuits.
[0353] In at least one embodiment, the architecture and / or functionality of the various preceding figures are implemented in an environment of a general-purpose computer system, a circuit board system, a game console system dedicated to entertainment purposes, a special-purpose system, etc. In at least one embodiment, the computer system 1000 may take the form of a desktop computer, a laptop computer, a tablet computer, a server, a supercomputer, a smartphone (e.g., a wireless, handheld device), a personal digital assistant (“PDA”), a digital camera, a vehicle, a head-mounted display, a handheld electronic device, a mobile phone device, a television, a workstation, a game console, an embedded system, and / or any other type of logic.
[0354] In at least one embodiment, the parallel processing system 1012 includes, but is not limited to, a plurality of parallel processing units (“PPUs”) 1014 and associated memory 1016. In at least one embodiment, the PPUs 1014 are connected to a host processor or other peripheral device via interconnects 1018 and switches 1020 or multiplexers. In at least one embodiment, the parallel processing system 1012 distributes computational tasks across the parallelizable PPUs 1014, for example, as part of a computational task distribution across multiple graphics processing units (“GPUs”) thread blocks. In at least one embodiment, memory is shared and accessed (e.g., for read and / or write access) among some or all of the PPUs 1014, although such shared memory may incur a performance penalty relative to the use of local memory and registers residing on the PPUs 1014. In at least one embodiment, the operation of the PPUs 1014 is synchronized using commands such as __syncthreads(), wherein all threads in a block (e.g., executing across multiple PPUs 1014) arrive at a certain point of code execution before proceeding.
[0355] Virtualization computing platform
[0356] Embodiments of a virtualized computing platform for advanced computing, such as image inference and image processing, are disclosed. (Refer to...) Figure 31This is an example data flow diagram of process 3100 for generating and deploying an image processing and inference pipeline according to at least one embodiment. In at least one embodiment, process 3100 can be deployed for use with imaging equipment, processing equipment, genomics equipment, gene sequencing equipment, radiology equipment, and / or other equipment types at one or more facilities 3102, such as medical facilities, hospitals, medical institutions, clinics, research or diagnostic laboratories, etc. In at least one embodiment, process 3100 can be deployed to perform genomic analysis and inference on sequencing data. Examples of genomic analyses that can be performed using the systems and methods described herein include, but are not limited to, variant invocation, mutation detection, and gene expression quantification. Process 3100 can be executed within training system 3104 and / or deployment system 3106. In at least one embodiment, training system 3104 can be used to train, deploy, and implement machine learning models (e.g., neural networks, object detection algorithms, computer vision algorithms, etc.) for use in deployment system 3106. In at least one embodiment, deployment system 3106 can be configured to offload processing and computing resources between distributed computing environments to reduce infrastructure requirements at facility 3102. In at least one embodiment, deployment system 3106 may provide a streamlined platform for selecting, customizing, and implementing virtual instruments at facility 3102 for use with imaging devices (e.g., MRI, CT scans, X-rays, ultrasound, etc.) or sequencing devices. In at least one embodiment, the virtual instrument may include software-defined applications for performing one or more processing operations on imaging data generated by the imaging device, sequencing device, radiology device, and / or other device types. In at least one embodiment, one or more applications in the pipeline may use or invoke services of deployment system 3106 (e.g., inference, visualization, computation, AI, etc.) during application execution.
[0357] In at least one embodiment, some applications in the advanced processing and inference pipeline may use machine learning models or other AI to perform one or more processing steps. In at least one embodiment, the machine learning model may be trained at facility 3102 using data 3108 (such as imaging data) generated at facility 3102 (and stored on one or more Picture Archiving and Communication System (PACS) servers at facility 3102), or using imaging or sequencing data 3108 from another or more facilities (e.g., different hospitals, laboratories, clinics, etc.), or a combination thereof. In at least one embodiment, the training system 3104 may be used to provide applications, services, and / or other resources for generating jobs, deployable machine learning models, for deployment system 3106.
[0358] In at least one embodiment, the model registry 3124 may be supported by an object storage capable of supporting versioning and object metadata. In at least one embodiment, the object storage may be, for example, cloud storage (e.g., Figure 32 The cloud platform (3226)-compatible application programming interface (API) is accessed from within the cloud platform. In at least one embodiment, machine learning models in model registry 3124 can be uploaded, listed, modified, or deleted by the developer or partner of the system interacting with the API. In at least one embodiment, the API can provide access to methods that allow a user with appropriate credentials to associate a model with an application, enabling the model to be executed as part of the containerized instantiation of the application.
[0359] In at least one embodiment, training pipeline 3204 ( Figure 32 This may include a scenario where facility 3102 is training its own machine learning model or has an existing machine learning model that needs to be optimized or updated. In at least one embodiment, imaging data 3108 generated by one or more imaging devices, sequencing devices, and / or other device types may be received. In at least one embodiment, once the imaging data 3108 is received, AI-assisted annotation 3110 may be used to help generate annotations corresponding to the imaging data 3108 for use as ground-based data for machine learning models. In at least one embodiment, AI-assisted annotation 3110 may include one or more machine learning models (e.g., convolutional neural networks (CNNs)) that may be trained to generate annotations corresponding to certain types of imaging data 3108 (e.g., from certain devices) and / or certain types of anomalies in the imaging data 3108. In at least one embodiment, AI-assisted annotation 3110 may then be used directly or may be adjusted or fine-tuned using annotation tools (e.g., by researchers, clinicians, doctors, scientists, etc.) to generate ground-based data. In at least one embodiment, in some examples, labeled clinical data 3112 (e.g., annotations provided by clinicians, doctors, scientists, technicians, etc.) can be used as ground-based data for training a machine learning model. In at least one embodiment, AI-assisted annotations 3110, labeled clinical data 3112, or a combination thereof can be used as ground-based data for training a machine learning model. In at least one embodiment, the trained machine learning model can be referred to as output model 3116 and can be used by deployment system 3106, as described herein.
[0360] In at least one embodiment, training pipeline 3204 ( Figure 32This could include scenarios where facility 3102 requires a machine learning model to perform one or more processing tasks for one or more applications in deployment system 3106, but facility 3102 may not currently have such a machine learning model (or may not have an optimized, efficient, or effective model for such a purpose). In at least one embodiment, an existing machine learning model can be selected from model registry 3124. In at least one embodiment, model registry 3124 may include machine learning models trained to perform various inference tasks on imaging data. In at least one embodiment, the machine learning models in model registry 3124 may have already been trained on imaging data from facilities other than facility 3102 (e.g., remote facilities). In at least one embodiment, the machine learning models may have already been trained on imaging data from one location, two locations, or any number of locations. In at least one embodiment, when training on imaging data from a particular location, training may occur at that location, or at least in a manner that protects the confidentiality of the imaging data or restricts the off-site transmission of the imaging data (e.g., in compliance with HIPAA regulations, privacy regulations, etc.). In at least one embodiment, once the model has been trained or partially trained at one location, the machine learning model can be added to the model registry 3124. In at least one embodiment, the machine learning model can then be retrained or updated at any number of other facilities, and the retrained or updated model can be made available in the model registry 3124. In at least one embodiment, the machine learning model can then be selected from the model registry 3124—and referred to as output model 3116—and can be used in deployment system 3106 to perform one or more processing tasks for one or more applications of the deployment system.
[0361] In at least one embodiment, training pipeline 3204 ( Figure 32One scenario may include facility 3102, which requires a machine learning model to perform one or more processing tasks for one or more applications in deployment system 3106, but facility 3102 may not currently have such a machine learning model (or may not have an optimized, efficient, or effective model for such purposes). In at least one embodiment, the machine learning model selected from model registry 3124 may not be fine-tuned or optimized for the imaging data 3108 generated at facility 3102 due to population differences, genetic variations, robustness of training data used to train the machine learning model, anomalous diversity of training data, and / or other problems with the training data. In at least one embodiment, AI-assisted annotation 3110 may be used to help generate annotations corresponding to imaging data 3108, which is used as ground-based data for retraining or updating the machine learning model. In at least one embodiment, labeled clinical data 3112 (e.g., annotations provided by clinicians, doctors, scientists, etc.) may be used as ground-based data for training the machine learning model. In at least one embodiment, retraining or updating the machine learning model may be referred to as model training 3114. In at least one embodiment, model training 3114 (e.g., AI-assisted annotation 3110, labeled clinical data 3112, or a combination thereof) can be used as ground-based real-world data for retraining or updating the machine learning model. In at least one embodiment, the trained machine learning model can be referred to as output model 3116 and can be used by deployment system 3106, as described herein.
[0362] In at least one embodiment, deployment system 3106 may include software 3118, service 3120, hardware 3122, and / or other components, features, and functions. In at least one embodiment, deployment system 3106 may include a software "stack" such that software 3118 can be built on top of service 3120 and can use service 3120 to perform some or all of the processing tasks, and service 3120 and software 3118 can be built on top of hardware 3122 and use hardware 3122 to perform processing, storage, and / or other computational tasks of deployment system 3106. In at least one embodiment, software 3118 may include any number of different containers, each of which can perform an instantiation of an application. In at least one embodiment, each application can perform one or more processing tasks (e.g., inference, object detection, feature detection, segmentation, image enhancement, calibration, etc.) in high-level processing and inference pipelines. In at least one embodiment, for each type of imaging device (e.g., CT, MRI, X-ray, ultrasound, ultrasound examination, echocardiography, etc.), sequencing device, radiology device, genomics device, etc., any number of containers may exist that can perform data processing tasks relative to the imaging data 3108 (or other data types, such as those described herein) generated by the device. In at least one embodiment, the high-level processing and inference pipeline may be defined based on the selection of different containers desired or required for processing the imaging data 3108, in addition to receiving and configuring imaging data for use by each container and / or for use by facility 3102 after processing through the pipeline (e.g., converting the output back to available data types, such as Medical Digital Imaging and Communications (DICOM) data, Radiology Information System (RIS) data, Clinical Information System (CIS) data, Remote Procedure Call (RPC) data, data substantially conforming to a Representation State Transition (REST) interface, data substantially conforming to a file-based interface, and / or raw data, for storage and display at facility 3102). In at least one embodiment, a combination of containers within software 3118 (e.g., a combination of containers that make up a pipeline) may be referred to as a virtual tool (as described in more detail herein), and the virtual tool may utilize service 3120 and hardware 3122 to perform some or all of the processing tasks of an application exemplified in the container.
[0363] In at least one embodiment, the data processing pipeline may receive input data (e.g., imaging data 3108) in DICOM, RIS, CIS, REST-compatible, RPC, raw, and / or other formats in response to an inference request (e.g., a request from a user (such as a clinician, physician, radiologist, etc.) of deployment system 3106). In at least one embodiment, the input data may represent one or more images, videos, and / or other data representations generated by one or more imaging devices, sequencing devices, radiology devices, genomics devices, and / or other device types. In at least one embodiment, the data may undergo preprocessing as part of the data processing pipeline to prepare the data for processing by one or more applications. In at least one embodiment, postprocessing may be performed on the output of one or more inference tasks or other processing tasks of the pipeline to prepare output data for the next application and / or to prepare output data for user transmission and / or use (e.g., in response to an inference request). In at least one embodiment, the inference task may be performed by one or more machine learning models, such as trained or deployed neural networks, which may include the output model 3116 of training system 3104.
[0364] In at least one embodiment, the tasks of the data processing pipeline can be encapsulated in one or more containers, each representing a discrete, fully functional instantiation of an application capable of referencing a machine learning model and a virtualized computing environment. In at least one embodiment, a container or application can be published to a private (e.g., restricted access) area of a container registry (described in more detail herein), and trained or deployed models can be stored in a model registry 3124 and associated with one or more applications. In at least one embodiment, an image of the application (e.g., a container image) can be available in the container registry, and once selected by a user from the container registry for deployment in the pipeline, the image can be used to generate containers to provide instantiations of the application for use by the user's system.
[0365] In at least one embodiment, a developer (e.g., a software developer, clinician, physician, etc.) may develop, publish, and store an application (e.g., as a container) for performing image processing and / or inference on the provided data. In at least one embodiment, development, publication, and / or storage may be performed using a software development kit (SDK) associated with the system (e.g., to ensure that the developed application and / or container is compatible with the system). In at least one embodiment, at least some of the services 3120 may be utilized as a system (e.g., Figure 32The system 3200's SDK is used to test the developed application locally (e.g., at the first facility, on data from the first facility). In at least one embodiment, because DICOM objects can contain anywhere from one to hundreds of images or other data types, and due to variations in data, the developer can be responsible for managing (e.g., setting up construction, incorporating preprocessing into the application, etc.) the extraction and preparation of the input DICOM data. In at least one embodiment, once validated by the system 3200 (e.g., for accuracy, security, patient privacy, etc.), the application can be available in the container registry for users (e.g., hospitals, clinics, laboratories, healthcare providers, etc.) to select and / or implement to perform one or more processing tasks on data at the user's facility (e.g., a second facility).
[0366] In at least one embodiment, the developer can then share the application or container over the network for the system (e.g., Figure 32 The system 3200 allows for user access and use. In at least one embodiment, completed and validated applications or containers may be stored in a container registry, and associated machine learning models may be stored in a model registry 3124. In at least one embodiment, a requesting entity (e.g., a user at a medical facility) providing an inference or image processing request can browse the container registry and / or model registry 3124 of applications, select desired combinations of elements such as containers, datasets, machine learning models, etc., to include in the data processing pipeline, and submit an imaging processing request. In at least one embodiment, the request may include input data necessary to perform the request (and, in some examples, associated patient data), and / or may include the selection of one or more applications and / or machine learning models to be performed in the processing request. In at least one embodiment, the request may then be passed to one or more components of the deployment system 3106 (e.g., the cloud) to perform processing in the data processing pipeline. In at least one embodiment, the processing performed by the deployment system 3106 may include referencing the selected elements (e.g., applications, containers, models, etc.) from the container registry and / or model registry 3124. In at least one embodiment, once results are generated by the pipeline, they can be returned to the user for reference (e.g., for viewing in a suite of viewing applications running locally, on a field workstation, or on a terminal). In at least one embodiment, radiologists can receive results from a data processing pipeline that includes any number of applications and / or containers, where results may include the detection of abnormalities in X-rays, CT scans, MRIs, etc.
[0367] In at least one embodiment, service 3120 may be utilized to assist in processing or executing applications or containers in the pipeline. In at least one embodiment, service 3120 may include computing services, artificial intelligence (AI) services, visualization services, and / or other service types. In at least one embodiment, service 3120 may provide common functionality to one or more applications in software 3118, thus abstracting functionality into services that can be invoked or utilized by applications. In at least one embodiment, the functionality provided by service 3120 can operate dynamically and more efficiently, while also allowing applications to process data in parallel (e.g., using parallel computing platform 3230). Figure 32 To scale well. In at least one embodiment, service 3120 can be shared between and among different applications, rather than requiring each application sharing the same functionality provided by service 3120 to have a corresponding example of service 3120. In at least one embodiment, as a non-limiting example, the service may include an inference server or engine that can be used to perform detection or segmentation tasks. In at least one embodiment, a model training service may be included, which can provide machine learning model training and / or retraining capabilities. In at least one embodiment, a data augmentation service may be further included, which can provide GPU-accelerated data (e.g., DICOM, RIS, CIS, REST-compatible, RPC, raw, etc.) extraction, resizing, scaling, and / or other enhancements. In at least one embodiment, a visualization service may be used, which can add image rendering effects—such as ray tracing, rasterization, denoising, sharpening, etc.—to add realism to two-dimensional (2D) and / or three-dimensional (3D) models. In at least one embodiment, a virtual instrument service may be included that provides beamforming, segmentation, inference, imaging, and / or support for other applications within the virtual instrument pipeline.
[0368] In at least one embodiment, service 3120 includes an AI service (e.g., an inference service) that can be invoked (e.g., as an API call) to execute one or more machine learning models associated with an application for anomaly detection (e.g., tumors, growth abnormalities, scar formation, etc.), or their processing, as part of application execution. In at least one embodiment, another application includes one or more machine learning models for a segmentation task, which can be invoked according to the inference service to execute one or more machine learning models for performing processing operations associated with the segmentation task. In at least one embodiment, software 3118 implementing the high-level processing and inference pipelines including the segmentation application and the anomaly detection application can be streamlined because each application can invoke the same inference service to perform one or more inference tasks.
[0369] In at least one embodiment, hardware 3122 may include a GPU, CPU, graphics card, AI / deep learning system (e.g., an AI supercomputer such as NVIDIA's DGX), cloud platform, or a combination thereof. In at least one embodiment, different types of hardware 3122 may be used to provide efficient, specially designed support for software 3118 and services 3120 in deployment system 3106. In at least one embodiment, the use of GPU processing may enable improved efficiency for local processing (e.g., at facility 3102), within AI / deep learning systems, in cloud systems, and / or in other processing components of deployment system 3106, for improving the accuracy and power (e.g., real-time) of image processing, image reconstruction, segmentation, MRI examinations, stroke or heart attack detection, image quality in rendering, etc. In at least one embodiment, the facility may include imaging devices, genomics devices, sequencing devices, and / or other types of on-site devices that may utilize GPUs to generate imaging data representing the anatomical structures of a subject. In at least one embodiment, as a non-limiting example, software 3118 and / or services 3120 may be optimized for GPU processing relative to deep learning, machine learning, and / or high-performance computing. In at least one embodiment, at least some of the computing environments of deployment system 3106 and / or training system 3104 can execute one or more supercomputers or high-performance computing systems in a data center, wherein GPU-optimized software (e.g., the hardware and software combination of NVIDIA's DGX system) is used. In at least one embodiment, the data center can be HIPAA compliant, such that the reception, processing, and transmission of imaging data and / or other patient data are securely handled with respect to the privacy of the patient data. In at least one embodiment, hardware 3122 can include any number of GPUs that can be invoked to perform data processing in parallel, as described herein. In at least one embodiment, the cloud platform can also include GPU processing for GPU-optimized execution of deep learning tasks, machine learning tasks, or other computing tasks. In at least one embodiment, the cloud platform (e.g., NVIDIA's NGC) can be executed using an AI / deep learning supercomputer and / or GPU-optimized software (e.g., provided on NVIDIA's DGX system) as a hardware abstraction and scaling platform. In at least one embodiment, the cloud platform can integrate application container cluster systems or orchestration systems (e.g., KUBERNETES) across multiple GPUs to achieve seamless scaling and load balancing.
[0370] Figure 32 This is a system diagram of an example system 3200 for generating and deploying an imaging deployment pipeline according to at least one embodiment. In at least one embodiment, system 3200 can be used to implement Figure 31The process 3100 and / or other processes including advanced processing and inference pipelines. In at least one embodi...
Claims
1. A processor, comprising: One or more circuits are configured to apply one or more visual effects to one or more images whose resolution is less than a first resolution, and after the one or more visual effects are applied to the one or more images, determine information about the visual effects represented by the parameters of an enhancement function based on the visual effects applied to the one or more images, upsample the parameters of the enhancement function, and apply the upsampled enhancement function to one or more upscaled images of the one or more images at or above the first resolution to enhance the one or more upscaled images.
2. The processor of claim 1, wherein the one or more visual effects are approximated using one or more neural networks.
3. The processor of claim 1, wherein the one or more visual effects include adding floodlight, color correction, motion blur, lens flare, sharpening, filtering, chromatic aberration, lens distortion, color jaggedness, or interface elements.
4. The processor of claim 1, wherein a first parameterization function is determined for the one or more images having a resolution less than the first resolution, and wherein a second parameterization function is determined based on the first parameterization function, and the second parameterization function is used to approximate one or more visual effects for the one or more images having a resolution greater than or equal to the first resolution.
5. The processor of claim 4, wherein the one or more circuits are further configured to apply one or more enhancements to the one or more images having a resolution greater than or equal to the first resolution after the one or more visual effects have been approximated.
6. The processor of claim 1, wherein the one or more circuits are further configured to downscale the one or more images with a resolution less than the first resolution to the resolution of the initial jagged image generated by the rendering engine.
7. A system comprising: One or more processors are configured to apply one or more visual effects to one or more images whose resolution is less than a first resolution, and after the one or more visual effects have been applied to the one or more images, determine information about the visual effects represented by parameters of an enhancement function based on the visual effects applied to the one or more images, upsample the parameters of the enhancement function, and apply the upsampled enhancement function to one or more upscaled images of the one or more images at or above the first resolution to enhance the one or more upscaled images.
8. The system of claim 7, wherein the one or more visual effects are approximated using one or more neural networks.
9. The system of claim 7, wherein the one or more visual effects include adding floodlight, color correction, motion blur, lens flare, sharpening, filtering, chromatic aberration, lens distortion, color jaggedness, or interface elements.
10. The system of claim 7, wherein a first parameterization function is determined for the one or more images whose resolution is less than the first resolution, and wherein a second parameterization function is determined based on the first parameterization function, and the second parameterization function is used to approximate one or more visual effects for the one or more images whose resolution is greater than or equal to the first resolution.
11. The system of claim 10, wherein the one or more processors are further configured to apply one or more enhancements to the one or more images having a resolution greater than or equal to the first resolution after the one or more visual effects have been approximated.
12. The system of claim 7, wherein the one or more circuits are further configured to downscale the one or more images whose resolution is less than the first resolution to the resolution of the initial jagged image generated by the rendering engine.
13. A method comprising: One or more visual effects are applied to one or more images whose resolution is less than a first resolution. After the one or more visual effects are applied to the one or more images, information about the visual effects represented by the parameters of the enhancement function is determined based on the visual effects applied to the one or more images. The parameters of the enhancement function are upsampled. The upsampled enhancement function is then applied to one or more upscaled images of the one or more images at or above the first resolution to enhance the one or more upscaled images.
14. The method of claim 13, wherein the one or more visual effects are approximated using one or more neural networks.
15. The method of claim 13, wherein the one or more visual effects include adding floodlight, color correction, motion blur, lens flare, sharpening, filtering, chromatic aberration, lens distortion, color jaggedness, or interface elements.
16. The method of claim 13, wherein a first parameterization function is determined for the one or more images whose resolution is less than the first resolution, and wherein a second parameterization function is determined based on the first parameterization function, and the second parameterization function is used to approximate one or more visual effects for the one or more images whose resolution is greater than or equal to the first resolution.
17. The method of claim 16, further comprising: After one or more visual effects are approximated, one or more enhancements are applied to the one or more images whose resolution is greater than or equal to the first resolution.
18. The method of claim 13, further comprising: The one or more images whose resolution is less than the first resolution are downscaled to the resolution of the initial jagged image generated by the rendering engine.
19. A machine-readable medium having a set of instructions stored thereon, the instructions, when executed by one or more processors, causing the one or more processors to at least: One or more visual effects are applied to one or more images whose resolution is less than a first resolution. After the one or more visual effects are applied to the one or more images, information about the visual effects represented by the parameters of the enhancement function is determined based on the visual effects applied to the one or more images. The parameters of the enhancement function are upsampled. The upsampled enhancement function is then applied to one or more upscaled images of the one or more images at or above the first resolution to enhance the one or more upscaled images.
20. The machine-readable medium of claim 19, wherein the one or more visual effects are approximated using one or more neural networks.
21. The machine-readable medium of claim 19, wherein the one or more visual effects include adding floodlight, color correction, motion blur, lens flare, sharpening, filtering, chromatic aberration, lens distortion, color vignetting, or interface elements.
22. The machine-readable medium of claim 19, wherein a first parameterization function is determined for the one or more images having a resolution less than the first resolution, and wherein a second parameterization function is determined based on the first parameterization function, and the second parameterization function is used to approximate one or more visual effects for the one or more images having a resolution greater than or equal to the first resolution.
23. The machine-readable medium of claim 22, wherein if the instructions are executed, they further cause the one or more processors to: After one or more visual effects are approximated, one or more enhancements are applied to the one or more images whose resolution is greater than or equal to the first resolution.
24. The machine-readable medium of claim 19, wherein if the instructions are executed, they further cause one or more processors to: The one or more images whose resolution is less than the first resolution are downscaled to the resolution of the initial jagged image generated by the rendering engine.
25. An image generation system, comprising: One or more processors are configured to apply one or more visual effects to one or more images whose resolution is less than a first resolution, and after the one or more visual effects have been applied to the one or more images, determine information of the visual effects represented by the parameters of an enhancement function based on the visual effects applied to the one or more images, upsample the parameters of the enhancement function, and apply the upsampled enhancement function to one or more upscaled images of the one or more images at or above the first resolution to enhance the one or more upscaled images; and A memory for storing network parameters of the one or more neural networks.
26. The image generation system of claim 25, wherein the one or more visual effects are approximated using one or more neural networks.
27. The image generation system of claim 25, wherein the one or more visual effects include adding floodlight, color correction, motion blur, lens flare, sharpening, filtering, chromatic aberration, lens distortion, color jaggedness, or interface elements.
28. The image generation system of claim 25, wherein a first parameterization function is determined for the one or more images having a resolution less than the first resolution, and wherein a second parameterization function is determined based on the first parameterization function, and the second parameterization function is used to approximate one or more visual effects for the one or more images having a resolution greater than or equal to the first resolution.
29. The image generation system of claim 28, wherein the one or more circuits are further configured to apply one or more enhancements to the one or more images having a resolution greater than or equal to the first resolution after the one or more visual effects have been approximated.
30. The image generation system of claim 25, wherein the one or more processors are further configured to downscale the one or more images with a resolution less than the first resolution to the resolution of the initial jagged image generated by the rendering engine.