Dynamic distributed training of machine learning models

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using GPGPU and SIMT architecture in machine learning algorithms and optimizing the graphics pipeline design, the problem of low efficiency in parallel processing is solved, and efficient deep neural network training and inference are achieved.

CN108734642BActive Publication Date: 2026-06-30INTEL CORP

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: INTEL CORP
Filing Date: 2018-04-23
Publication Date: 2026-06-30

AI Technical Summary

Technical Problem

Existing technologies suffer from inefficiency and insufficient resource utilization in parallel processing of machine learning algorithms, especially in the training and inference of deep neural networks, particularly when training on large datasets.

Method used

Parallel processing is achieved using a general-purpose graphics processing unit (GPGPU), leveraging a single instruction multithreading (SIMT) architecture and the synchronous execution of parallel thread groups. Combined with an efficient graphics pipeline design, it optimizes graphics and video processing operations and implements parallel machine learning algorithms to improve processing efficiency.

Benefits of technology

It enables efficient training of deep neural networks on large datasets, improves the utilization of computing resources and processing efficiency, and supports the acceleration of graphics and machine learning operations.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN108734642B_ABST

Patent Text Reader

Abstract

This application discloses dynamically distributed training of machine learning models. In one example, an apparatus includes: a plurality of execution units, including at least a first type of execution unit and a second type of execution unit; and logic, which at least partially includes hardware logic, for analyzing workloads and allocating workloads to one of the first type of execution unit or the second type of execution unit. Other embodiments are also disclosed and claimed.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments generally relate to data processing, and more specifically to machine learning processing via a general-purpose graphics processing unit. background

[0002] Machine learning has successfully solved many types of tasks. The computations generated when training and using machine learning algorithms (e.g., neural networks) are naturally suited to efficient parallel implementations. Therefore, parallel processors such as general-purpose graphics processing units (GPGPUs) play a significant role in the practical implementation of deep neural networks. Parallel graphics processors with a single-instruction, multi-threaded (SIMT) architecture are designed to maximize the amount of parallel processing in the graphics pipeline. In a SIMT architecture, groups of parallel threads attempt to execute program instructions together synchronously as often as possible to improve processing efficiency. The efficiency offered by parallel machine learning algorithm implementations allows for the use of high-capacity networks and enables these networks to be trained on large datasets. Attached Figure Description

[0003] To enable a detailed understanding of the features described above in this embodiment, the briefly summarized embodiments can be described in more detail by referring to the embodiments, some of which are shown in the accompanying drawings. However, it should be noted that the accompanying drawings illustrate only typical embodiments and should not be considered as limiting its scope.

[0004] Figure 1 This is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein.

[0005] Figures 2A to 2D A parallel processor component according to an embodiment is shown.

[0006] Figures 3A to 3B This is a block diagram of a graphics multiprocessor according to an embodiment.

[0007] Figures 4A to 4F An exemplary architecture in which multiple GPUs are communicatively coupled to multiple multi-core processors is shown.

[0008] Figure 5 This is a conceptual diagram of a graphics processing pipeline according to an embodiment.

[0009] Figure 6 and Figures 7A to 7D Exemplary architectures and operations in the technology according to embodiments are shown.

[0010] Figure 8 A machine learning software stack according to an embodiment is shown.

[0011] Figure 9A highly parallel general-purpose graphics processing unit according to an embodiment is shown.

[0012] Figure 10 A multi-GPU computing system according to an embodiment is shown.

[0013] Figures 11A to 11B An example layer of a deep neural network is shown.

[0014] Figure 12 An exemplary recurrent neural network is shown.

[0015] Figure 13 The training and deployment of a deep neural network are illustrated.

[0016] Figure 14 This is a block diagram illustrating distributed learning.

[0017] Figure 15 An exemplary system-on-a-chip (SoC) for inference suitable for performing inference using a trained model is shown.

[0018] Figure 16 This is a block diagram of a processing system according to an embodiment.

[0019] Figure 17 This is a block diagram of a processor according to an embodiment.

[0020] Figure 18 This is a block diagram of a graphics processor according to an embodiment.

[0021] Figure 19 This is a block diagram of a graphics processing engine for a graphics processor according to some embodiments.

[0022] Figure 20 This is a block diagram of a graphics processor provided by an additional embodiment.

[0023] Figure 21 The thread execution logic is illustrated, which includes an array of processing elements employed in some embodiments.

[0024] Figure 22 This is a block diagram illustrating a graphics processor instruction format according to some embodiments.

[0025] Figure 23 This is a block diagram of a graphics processor according to another embodiment.

[0026] Figures 24A to 24B The graphics processor command format and command sequence are illustrated according to some embodiments.

[0027] Figure 25 An exemplary graphical software architecture of a data processing system according to some embodiments is shown.

[0028] Figure 26 This is a block diagram illustrating an IP core development system according to an embodiment.

[0029] Figure 27 This is a block diagram illustrating an exemplary system-on-a-chip integrated circuit according to an embodiment.

[0030] Figure 28 This is a block diagram illustrating an additional exemplary graphics processor.

[0031] Figure 29 This is a block diagram illustrating an additional exemplary graphics processor of a system-on-a-chip integrated circuit according to an embodiment. Detailed Implementation

[0032] In the following description, numerous specific details are set forth to provide a thorough understanding of the various embodiments. However, the embodiments may be practiced without these specific details. In other instances, well-known methods, processes, components, and circuits have not been described in detail so as not to obscure the particular embodiments. Furthermore, aspects of the embodiments can be implemented using various means, such as integrated semiconductor circuitry (“hardware”), computer-readable instructions organized into one or more programs (“software”), or some combination of hardware and software. For the purposes of this disclosure, references to “logic” should refer to hardware, software, firmware, or some combination thereof.

[0033] Some of the embodiments discussed herein can be applied to any processor (such as GPCPU, CPU, GPU, etc.), graphics controller, etc. Other embodiments are also disclosed and claimed.

[0034] Furthermore, some embodiments can be applied to computing systems that include one or more processors (e.g., having one or more processor cores), such as those discussed herein, including, for example, mobile computing devices such as smartphones, tablets, UMPCs (Ultra-Mobile PCs), laptops, and Ultrabooks. TM Computing devices, wearable devices (such as smartwatches or smart glasses), and so on.

[0035] In some embodiments, a graphics processing unit (GPU) is communicatively coupled to a host / processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various general-purpose GPU (GPGPU) functions. The GPU can be communicatively coupled to the host processor / core via a bus or another interconnect (e.g., a high-speed interconnect such as PCIe or NVLink). In other embodiments, the GPU may be integrated on the same package or chip as the core and communicatively coupled to the core via an internal processor bus / interconnect (i.e., inside the package or chip). Regardless of how the GPU is connected, the processor core can assign work to the GPU in the form of a sequence of commands / instructions contained in a job descriptor. The GPU then uses dedicated circuitry / logic to efficiently process these commands / instructions.

[0036] In the following description, numerous specific details are set forth to provide a more comprehensive understanding. However, it will be apparent to those skilled in the art that the embodiments described herein may be practiced without one or more of these specific details. In other instances, well-known features have not been described to avoid obscuring the details of this embodiment.

[0037] System Overview

[0038] Figure 1 This is a block diagram illustrating a computer system 100 configured to implement one or more aspects of the embodiments described herein. The computing system 100 includes a processing subsystem 101 having one or more processors 102 and a system memory 104, the one or more processors and the system memory communicating via an interconnect path, the interconnect path including a memory hub 105. The memory hub 105 may be a separate component within a chipset assembly or integrated within one or more processors 102. The memory hub 105 is coupled to an I / O subsystem 111 via a communication link 106. The I / O subsystem 111 includes an I / O hub 107 that enables the computing system 100 to receive input from one or more input devices 108. Additionally, the I / O hub 107 enables a display controller (which may be included in one or more processors 102) to provide output to one or more display devices 110A. In one embodiment, the one or more display devices 110A coupled to the I / O hub 107 may include a local display device, an internal display device, or an embedded display device.

[0039] In one embodiment, the processing subsystem 101 includes one or more parallel processors 112 coupled to a memory hub 105 via a bus or other communication link 113. The communication link 113 can be one of any number of standards-based communication link technologies or protocols (such as, but not limited to, PCI Express), or a vendor-specific communication interface or communication architecture. In one embodiment, the one or more parallel processors 112 form a computation-centric parallel or vector processing system including a large number of processing cores and / or processing clusters such as integrated many-core (MIC) processors. In one embodiment, the one or more parallel processors 112 form a graphics processing subsystem that can output pixels to one of one or more display devices 110A coupled via an I / O hub 107. The one or more parallel processors 112 may also include a display controller and a display interface (not shown) to enable direct connection to one or more display devices 110B.

[0040] Within the I / O subsystem 111, system storage unit 114 can be connected to I / O hub 107 to provide storage for computing system 100. I / O switch 116 can be used to provide an interface mechanism to enable connections between I / O hub 107 and other components that can be integrated into the platform, such as network adapter 118 and / or wireless network adapter 119, as well as various other devices that can be added via one or more plug-in devices 120. Network adapter 118 can be an Ethernet adapter or another wired network adapter. Wireless network adapter 119 can include one or more of Wi-Fi, Bluetooth, Near Field Communication (NFC), or other network devices including one or more radio devices.

[0041] The computing system 100 may include other components not explicitly shown, such as USB or other port connectors, optical storage drives, video capture devices, etc., and may also be connected to the I / O hub 107. Figure 1 The communication paths for interconnecting various components can be implemented using any suitable protocol, such as a PCI (Peripheral Component Interconnect) based protocol (e.g., PCI-Express), or (multiple) other bus or point-to-point communication interfaces and / or protocols such as NV-Link high-speed interconnect or interconnect protocols known in the art.

[0042] In one embodiment, one or more parallel processors 112 incorporate circuitry optimized for graphics and video processing, including, for example, video output circuitry, and said circuitry constitutes a graphics processing unit (GPU). In another embodiment, one or more parallel processors 112 incorporate circuitry optimized for general-purpose processing while retaining the underlying computing architecture described in more detail herein. In yet another embodiment, components of the computing system 100 may be integrated with one or more other system elements on a single integrated circuit. For example, one or more parallel processors 112, memory hub 105, processor(s)102, and I / O hub 107 may be integrated into a system-on-a-chip (SoC) integrated circuit. Alternatively, components of the computing system 100 may be integrated into a single package to form a system-in-package (SIP) configuration. In other embodiments, at least a portion of the components of the computing system 100 may be integrated into a multi-chip module (MCM) that may interconnect with other MCMs to form a modular computing system.

[0043] It should be understood that the computing system 100 shown herein is exemplary and variations and modifications are possible. The connectivity topology can be modified as needed, including the number and arrangement of bridges, the number of processors(102), and the number of parallel processors(112). For example, in some embodiments, system memory 104 is connected directly to processors(102) instead of via bridges, while other devices communicate with system memory 104 via memory hub 105 and processors(102). In other alternative topologies, parallel processors(112) are connected to I / O hub 107 or directly to one or more processors(102), instead of to memory hub 105. In other embodiments, I / O hub 107 and memory hub 105 may be integrated into a single chip. Some embodiments may include two or more groups of processors(102) attached via multiple sockets, which may be coupled to two or more instances of parallel processors(112).

[0044] Some specific components shown in this document are optional and may not be included in all implementations of the computing system 100. For example, any number of plug-in cards or peripheral devices may be supported, or some components may be omitted. Furthermore, some architectures may be described using different terminology. Figure 1 Similar components are shown. For example, in some architectures, the memory hub 105 may be referred to as the Northbridge, while the I / O hub 107 may be referred to as the Southbridge.

[0045] Figure 2AA parallel processor 200 according to an embodiment is illustrated. Various components of the parallel processor 200 can be implemented using one or more integrated circuit devices such as a programmable processor, an application-specific integrated circuit (ASIC), or a field-programmable gate array (FPGA). According to the embodiment, the illustrated parallel processor 200 is... Figure 1 The one or more variants of the parallel processor 112 shown.

[0046] In one embodiment, the parallel processor 200 includes a parallel processing unit 202. The parallel processing unit includes an I / O unit 204 that enables communication with other devices, including other instances of the parallel processing unit 202. The I / O unit 204 may be directly connected to other devices. In one embodiment, the I / O unit 204 is connected to other devices via a hub or switch interface, such as a memory hub 105. The connection between the memory hub 105 and the I / O unit 204 forms a communication link 113. Within the parallel processing unit 202, the I / O unit 204 is connected to a host interface 206 and a memory crossbar switch 216, wherein the host interface 206 receives commands relating to performing processing operations, and the memory crossbar switch 216 receives commands relating to performing memory operations.

[0047] When host interface 206 receives a command buffer via I / O unit 204, host interface 206 can route work operations for executing those commands to front end 208. In one embodiment, front end 208 is coupled to scheduler 210, which is configured to distribute commands or other work items to processing cluster array 212. In one embodiment, scheduler 210 ensures that processing cluster array 212 is correctly configured and active before distributing tasks to processing clusters within processing cluster array 212.

[0048] Processing cluster array 212 may include up to "N" processing clusters (e.g., cluster 214A, cluster 214B, up to cluster 214N). Each cluster 214A through 214N of processing cluster array 212 can execute a large number of concurrent threads. Scheduler 210 may use various scheduling and / or work distribution algorithms to allocate work to clusters 214A through 214N of processing cluster array 212, and these algorithms may vary depending on the workload caused by each type of program or computation. Scheduling may be handled dynamically by scheduler 210 or may be partially assisted by compiler logic during the compilation of program logic configured to be executed by processing cluster array 212. In one embodiment, different clusters 214A through 214N of processing cluster array 212 may be assigned to process different types of programs or to perform different types of computations.

[0049] The processing cluster array 212 can be configured to perform various types of parallel processing operations. In one embodiment, the processing cluster array 212 is configured to perform general-purpose parallel computing operations. For example, the processing cluster array 212 may include logic for performing processing tasks including filtering video and / or audio data, performing modeling operations including physical operations, and performing data transformations.

[0050] In one embodiment, the processing cluster array 212 is configured to perform parallel graphics processing operations. In embodiments where the parallel processor 200 is configured to perform graphics processing operations, the processing cluster array 212 may include additional logic for supporting the execution of such graphics processing operations, including but not limited to texture sampling logic, tessellation logic, and other vertex processing logic for performing texture operations. Additionally, the processing cluster array 212 may be configured to execute shader programs related to graphics processing, such as, but not limited to, vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. The parallel processing unit 202 may transfer data from system memory via I / O unit 204 for processing. During processing, the transferred data may be stored in on-chip memory (e.g., parallel processor memory 222) and then written back to system memory.

[0051] In one embodiment, when the parallel processing unit 202 is used to perform graphics processing, the scheduler 210 can be configured to divide the processing workload into tasks of approximately equal size to better distribute graphics processing operations across multiple clusters 214A to 214N of the processing cluster array 212. In some embodiments, portions of the processing cluster array 212 can be configured to perform different types of processing. For example, a first portion can be configured to perform vertex shading and topology generation, a second portion can be configured to perform tessellation and geometry shading, and a third portion can be configured to perform pixel shading or other screen-space operations to produce a rendered image for display. Intermediate data generated by one or more of the clusters 214A to 214N can be stored in a buffer to allow intermediate data to be transferred between the clusters 214A to 214N for further processing.

[0052] During operation, the processing cluster array 212 may receive processing tasks to be executed via scheduler 210, which receives commands defining the processing tasks from front-end 208. For graphics processing operations, processing tasks may include data to be processed, such as surface (patch) data, graph data, vertex data, and / or pixel data, as well as state parameters defining how the data is processed and indices of commands (e.g., which program to execute). Scheduler 210 may be configured to retrieve indices corresponding to tasks or may receive indices from front-end 208. Front-end 208 may be configured to ensure that the processing cluster array 212 is configured to be active before a workload specified by an incoming command buffer (e.g., a batch buffer, a push buffer, etc.) is initiated.

[0053] Each of one or more instances of parallel processing unit 202 may be coupled to parallel processor memory 222. Parallel processor memory 222 may be accessed via memory crossbar switch 216, which receives memory requests from processing cluster array 212 and I / O unit 204. Memory crossbar switch 216 may access parallel processor memory 222 via memory interface 218. Memory interface 218 may include multiple partition units (e.g., partition units 220A, 220B, up to partition units 220N), each of which may be coupled to a portion (e.g., memory cell) of parallel processor memory 222. In one implementation, the number of partition units 220A to 220N is configured to be equal to the number of memory cells, such that a first partition unit 220A has a corresponding first memory cell 224A, a second partition unit 220B has a corresponding memory cell 224B, and the Nth partition unit 220N has a corresponding Nth memory cell 224N. In other embodiments, the number of partition units 220A to 220N may not be equal to the number of memory devices.

[0054] In various embodiments, memory cells 224A to 224N may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. In one embodiment, memory cells 224A to 224N may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM). Those skilled in the art will understand that the specific implementation of memory cells 224A to 224N can vary and can be selected from one of a variety of conventional designs. Render targets, such as frame buffers or texture maps, may be stored on memory cells 224A to 224N, thereby allowing partitioning cells 220A to 220N to write portions of each render target in parallel to efficiently utilize the available bandwidth of parallel processor memory 222. In some embodiments, to support a unified memory design utilizing system memory along with local cache memory, local instances of parallel processor memory 222 may be excluded.

[0055] In one embodiment, any of clusters 214A to 214N of the processing cluster array 212 can process data to be written to any of the memory cells 224A to 224N within the parallel processor memory 222. A memory crossbar switch 216 can be configured to pass the output of each cluster 214A to 214N to any partition cell 220A to 220N or another cluster 214A to 214N, which can perform additional processing operations on the output. Each cluster 214A to 214N can communicate with the memory interface 218 via the memory crossbar switch 216 to perform read or write operations for various external memory devices. In one embodiment, the memory crossbar switch 216 can be connected to the memory interface 218 to communicate with the I / O unit 204 and can be connected to a local instance of the parallel processor memory 222, thereby enabling processing units within different processing clusters 214A to 214N to communicate with system memory or other memory that is not local to the parallel processing unit 202. In one embodiment, the memory crossbar switch 216 can use virtual channels to separate traffic flows between clusters 214A to 214N and partition units 220A to 220N.

[0056] While a single instance of the parallel processing unit 202 is shown within the parallel processor 200, any number of instances of the parallel processing unit 202 can also be included. For example, multiple instances of the parallel processing unit 202 can be provided on a single plug-in card, or multiple plug-in cards can be interconnected. Different instances of the parallel processing unit 202 can be configured to interact even if the different instances have different numbers of processing cores, different amounts of local parallel processor storage, and / or other configuration differences. For example, and in one embodiment, some instances of the parallel processing unit 202 may include higher precision floating-point units relative to other instances. Systems incorporating one or more instances of the parallel processing unit 202 or the parallel processor 200 can be implemented in various configurations and form factors, including but not limited to desktop computers, laptop or handheld personal computers, servers, workstations, game consoles, and / or embedded systems.

[0057] Figure 2B This is a block diagram of partitioning unit 220 according to an embodiment. In one embodiment, partitioning unit 220 is... Figure 2A An example of one of partition units 220A to 220N. As shown, partition unit 220 includes an L2 cache 221, a frame buffer interface 225, and a ROP 226 (raster operation unit). The L2 cache 221 is a read / write cache configured to perform load and store operations received from memory crossbar switch 216 and ROP 226. Read misses and urgent write-back requests are output by the L2 cache 221 to the frame buffer interface 225 for processing. Dirty updates can also be sent to the frame buffer via the frame buffer interface 225 for opportunistic processing. In one embodiment, the frame buffer interface 225 interacts with one of the memory cells in the parallel processor memory (e.g., memory cells 224A to 224N of FIG. 2 (e.g., within parallel processor memory 222)).

[0058] In graphics applications, ROP 226 is a processing unit that performs raster operations such as stencil printing, z-testing, and blending. ROP 226 then outputs processed graphics data, which is stored in graphics memory. In some embodiments, ROP 226 includes compression logic for compressing z- or color data written to memory and decompressing z- or color data read from memory. In some embodiments, ROP 226 is included within each processing cluster (e.g., clusters 214A to 214N of FIG. 2) rather than within partitioning unit 220. In such embodiments, read and write requests for pixel data are transmitted via memory crossbar 216 rather than pixel fragment data. The processed graphics data can be displayed on a display device such as... Figure 1 On one or more display devices 110, routed by processor(s) 102 for further processing, or by... Figure 2A One of the processing entities within the parallel processor 200 is routed for further processing.

[0059] Figure 2C This is a block diagram of a processing cluster 214 within a parallel processing unit according to an embodiment. In one embodiment, the processing cluster is an instance of one of the processing clusters 214A to 214N of FIG. 2. The processing cluster 214 can be configured to execute multiple threads in parallel, where the term "thread" refers to an instance of a specific program executed on a specific input dataset. In some embodiments, a Single Instruction Multiple Data (SIMD) instruction issuing technique is used to support the parallel execution of a large number of threads without providing multiple independent instruction units. In other embodiments, a Single Instruction Multiple Threading (SIMT) technique is used to support the parallel execution of a large number of substantially synchronous threads using a common instruction unit configured to issue instructions to a set of processing engines within each of the processing cluster. Unlike the SIMD execution mechanism, where all processing engines typically execute the same instructions, SIMT execution allows different threads to more easily follow divergent execution paths through a given thread program. Those skilled in the art will understand that the SIMD processing mechanism represents a subset of the functionality of the SIMT processing mechanism.

[0060] The operation of processing cluster 214 can be controlled via pipeline manager 232, which distributes processing tasks to SIMT parallel processors. Pipeline manager 232 receives instructions from scheduler 210 of FIG. 2 and manages the execution of those instructions via graphics multiprocessor 234 and / or texture unit 236. The graphics multiprocessor 234 shown is an exemplary instance of a SIMT parallel processor. However, various types of SIMT parallel processors with different architectures can be included within processing cluster 214. One or more instances of graphics multiprocessor 234 can be included within processing cluster 214. Graphics multiprocessor 234 can process data, and data crossover switch 240 can be used to distribute processed data to one of several possible destinations, including other shading units. Pipeline manager 232 can facilitate the distribution of processed data by specifying destinations for data to be distributed via data crossover switch 240.

[0061] Each graphics multiprocessor 234 within the processing cluster 214 may include the same set of functional execution logic (e.g., arithmetic logic units, load-memory units, etc.). The functional execution logic can be configured in a pipelined manner, where new instructions can be issued before completing previous instructions. The functional execution logic supports various operations, including integer and floating-point arithmetic, comparison operations, Boolean operations, bit shifting, and calculations of various algebraic functions. In one embodiment, the same functional unit hardware can be used to perform different operations, and any combination of functional units can exist.

[0062] Instructions transmitted to processing cluster 214 constitute threads. A group of threads executing on a set of parallel processing engines is a thread group. Thread groups execute the same program on different input data. Each thread within a thread group can be assigned to a different processing engine within graphics multiprocessor 234. A thread group may include fewer threads than the number of processing engines within graphics multiprocessor 234. When a thread group includes fewer threads than the number of processing engines, one or more of the processing engines may be idle during the cycle of processing the thread group. A thread group may also include more threads than the number of processing engines within graphics multiprocessor 234. When a thread group includes more threads than the number of processing engines within graphics multiprocessor 234, processing can be performed on consecutive clock cycles. In one embodiment, multiple thread groups can be executed simultaneously on graphics multiprocessor 234.

[0063] In one embodiment, the graphics multiprocessor 234 includes an internal cache memory for performing load and store operations. In one embodiment, the graphics multiprocessor 234 may forgo the internal cache and instead use a cache memory (e.g., L1 cache 308) within the processing cluster 214. Each graphics multiprocessor 234 may also access an L2 cache within a partition unit (e.g., partition units 220A to 220N of FIG. 2) shared across all processing clusters 214, and this cache can be used to transfer data between threads. The graphics multiprocessor 234 may also access off-chip global memory, which may include one or more of local parallel processor memory and / or system memory. Any memory outside of the parallel processing unit 202 may be used as global memory. Embodiments where the processing cluster 214 includes multiple instances of the graphics multiprocessor 234 may share common instructions and data that can be stored in the L1 cache 308.

[0064] Each processing cluster 214 may include an MMU 245 (Memory Management Unit) configured to map virtual addresses to physical addresses. In other embodiments, one or more instances of the MMU 245 may reside within the memory interface 218 of FIG2. The MMU 245 includes a set of page table entries (PTEs) for mapping virtual addresses to physical addresses and optionally cache line indexes. The MMU 245 may include an address translation lookahead buffer (TLB) or cache that may reside within the graphics multiprocessor 234 or the L1 cache or processing cluster 214. Physical addresses are processed to distribute surface data access locality to achieve efficient request interleaving between partition units. Cache line indexes can be used to determine whether a request for a cache line is a hit or a miss.

[0065] In graphics and computing applications, processing cluster 214 can be configured such that each graphics multiprocessor 234 is coupled to texture unit 236 to perform texture mapping operations, such as determining texture sample locations, reading texture data, and filtering texture data. Texture data is read from an internal texture L1 cache (not shown) or, in some embodiments, from an L1 cache within the graphics multiprocessor 234, and is retrieved as needed from an L2 cache, local parallel processor memory, or system memory. Each graphics multiprocessor 234 outputs a processed task to data crossover switch 240 to provide the processed task to another processing cluster 214 for further processing or to store the processed task in an L2 cache, local parallel processor memory, or system memory via memory crossover switch 216. Pre-ROP (Pre-Raster Operation Unit) 242 is configured to receive data from graphics multiprocessor 234 and direct the data to ROP units, which can be located using partitioned units (e.g., partitioned units 220A to 220N of FIG. 2) as described herein. The preROP 242 unit can optimize color blending, organize pixel color data, and perform address translation.

[0066] It should be understood that the core architecture described herein is exemplary and variations and modifications are possible. For example, any number of processing units such as graphics multiprocessors 234, texture units 236, preROP 242, etc., can be included within processing cluster 214. Furthermore, although only one processing cluster 214 is shown, the parallel processing units as described herein can include any number of instances of processing cluster 214. In one embodiment, each processing cluster 214 can be configured to operate independently of other processing clusters 214 using separate and different processing units, L1 caches, etc.

[0067] Figure 2DA graphics multiprocessor 234 according to one embodiment is illustrated. In such an embodiment, the graphics multiprocessor 234 is coupled to a pipeline manager 232 of a processing cluster 214. The graphics multiprocessor 234 has an execution pipeline including, but not limited to, an instruction cache 252, an instruction unit 254, an address mapping unit 256, a register file 258, one or more general-purpose graphics processing unit (GPGPU) cores 262, and one or more load / store units 266. The GPGPU cores 262 and the load / store units 266 are coupled to a cache memory 272 and a shared memory 270 via a memory and cache interconnect 268.

[0068] In one embodiment, instruction cache 252 receives a stream of instructions to be executed from pipeline manager 232. These instructions are cached in instruction cache 252 and dispatched for execution by instruction unit 254. Instruction unit 254 can dispatch instructions as thread groups (e.g., threads), with each thread in the thread group assigned to a different execution unit within GPGPU core 262. Instructions can access any of the local, shared, or global address spaces by specifying an address within a unified address space. Address mapping unit 256 can be used to translate addresses in the unified address space into different memory addresses accessible by load / store unit 266.

[0069] Register file 258 provides a set of registers for the functional units of graphics multiprocessor 324. Register file 258 provides temporary storage for operands on data paths connected to functional units of graphics multiprocessor 324 (e.g., GPGPU core 262, load / store unit 266). In one embodiment, register file 258 is partitioned among each of the functional units such that each functional unit is allocated a dedicated portion of register file 258. In one embodiment, register file 258 is partitioned between different meridians being executed by graphics multiprocessor 324.

[0070] Each GPGPU core 262 may include a floating-point unit (FPU) and / or an integer arithmetic logic unit (ALU) for executing instructions of the graphics multiprocessor 324. According to embodiments, the architecture of the GPGPU core 262 may be similar or different. For example, in one embodiment, a first portion of the GPGPU core 262 includes a single-precision FPU and an integer ALU, while a second portion of the GPGPU core includes a double-precision FPU. In one embodiment, the FPU may implement the IEEE 754-2008 floating-point arithmetic standard or enable variable-precision floating-point arithmetic. Additionally, the graphics multiprocessor 324 may also include one or more fixed-function or special-function units for performing specific functions such as copying rectangles or pixel blending operations. In one embodiment, one or more of the GPGPU cores may also contain fixed-function or special-function logic.

[0071] The memory and cache interconnect 268 is an interconnect network that connects each of the functional units of the graphics multiprocessor 324 to the register file 258 and the shared memory 270. In one embodiment, the memory and cache interconnect 268 is a cross-switch interconnect that allows the load / store unit 266 to perform load and store operations between the shared memory 270 and the register file 258. The register file 258 can operate at the same frequency as the GPGPU core 262, thus data transfer between the GPGPU core 262 and the register file 258 has very low latency. The shared memory 270 can be used to enable communication between threads executing on functional units within the graphics multiprocessor 234. For example, the cache memory 272 can be used as a data cache to cache texture data communicated between functional units and texture units 236. The shared memory 270 can also be used as a cached, managed program. In addition to the automatically cached data stored in the cache memory 272, threads executing on the GPGPU core 262 can also programmatically store data in the shared memory.

[0072] Figures 3A to 3B An additional graphics multiprocessor according to an embodiment is shown. The graphics multiprocessors 325 and 350 shown are... Figure 2C Variants of the graphics multiprocessor 234. The graphics multiprocessors 325 and 350 shown can be configured as streaming multiprocessors (SM) capable of executing a large number of execution threads simultaneously.

[0073] Figure 3A A graphics multiprocessor 325 according to an additional embodiment is shown. The graphics multiprocessor 325 includes, relative to... Figure 2DThe graphics multiprocessor 234 includes multiple additional instances of execution resource units. For example, the graphics multiprocessor 325 may include multiple instances of instruction units 332A to 332B, register files 334A to 334B, and texture units(s) 344A to 344B. The graphics multiprocessor 325 also includes multiple sets of graphics or compute execution units (e.g., GPGPU cores 336A to 336B, GPGPU cores 337A to 337B, GPGPU cores 338A to 338B) and multiple sets of load / memory units 340A to 340B. In one embodiment, the execution resource units have a common instruction cache 330, a texture and / or data cache memory 342, and a shared memory 346. The various components can communicate via interconnect architecture 327. In one embodiment, interconnect architecture 327 includes one or more cross switches to enable communication between the components of the graphics multiprocessor 325.

[0074] Figure 3B A graphics multiprocessor 350 according to an additional embodiment is shown. Figure 2D and Figure 3A As shown, the graphics processor includes multiple sets of execution resources 356A to 356D, each set of execution resources including multiple instruction units, register files, GPGPU cores, and load memory units. Execution resources 356A to 356D can work with (multiple) texture units 360A to 360D to perform texture operations, while sharing instruction cache 354 and shared memory 362. In one embodiment, execution resources 356A to 356D can share instruction cache 354, shared memory 362, and multiple instances of texture and / or data cache memories 358A to 358B. Various components can be connected via... Figure 3A The interconnect structure 327 communicates with the similar interconnect structure 352.

[0075] Those skilled in the art will understand that Figure 1 , Figures 2A to 2D and Figures 3A to 3B The architecture described herein is descriptive and does not limit the scope of the embodiments of the invention. Therefore, the techniques described herein can be implemented on any appropriately configured processing unit, including but not limited to: one or more mobile application processors; one or more desktop computer or server central processing units (CPUs), including multi-core CPUs; one or more parallel processing units such as the parallel processing unit 202 of FIG2; and one or more graphics processors or dedicated processing units, without departing from the scope of the embodiments described herein.

[0076] In some embodiments, a parallel processor or GPGPU, as described herein, is communicatively coupled to a host / processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various general-purpose GPU (GPGPU) functions. The GPU may be communicatively coupled to the host processor / core via a bus or other interconnect (e.g., high-speed interconnects such as PCIe or NVLink). In other embodiments, the GPU may be integrated on the same package or chip as the core and communicatively coupled to the core via an internal processor bus / interconnect (i.e., within the package or chip). Regardless of how the GPU is connected, the processor core can assign work to the GPU in the form of a sequence of commands / instructions contained in a job descriptor. The GPU then uses dedicated circuitry / logic to efficiently process these commands / instructions.

[0077] Technologies for GPU-to-host processor interconnects

[0078] Figure 4A An exemplary architecture is shown in which multiple GPUs 410 to 413 are communicatively coupled to multiple multi-core processors 405 to 406 via high-speed links 440 to 443 (e.g., bus, point-to-point interconnect, etc.). In one embodiment, high-speed links 440 to 443 support communication throughput of 4GB / s, 30GB / s, 80GB / s, or higher, depending on the implementation. Various interconnect protocols can be used, including but not limited to PCIe 4.0 or 5.0 and NVLink 2.0. However, the basic principles of the invention are not limited to any particular communication protocol or throughput.

[0079] Furthermore, in one embodiment, two or more of GPUs 410 to 413 are interconnected via high-speed links 444 to 445, which can be implemented using the same or different protocols / links as those used for high-speed links 440 to 443. Similarly, two or more of multi-core processors 405 to 406 can be connected via high-speed link 433, which can be a symmetric multiprocessor (SMP) bus operating at speeds of 20GB / s, 30GB / s, 120GB / s, or higher. Alternatively, Figure 4A All communication between the various system components shown can be accomplished using the same protocol / link (e.g., via a common interconnect structure). However, as mentioned, the basic principles of the invention are not limited to any particular type of interconnect technology.

[0080] In one embodiment, each multi-core processor 405 to 406 is communicatively coupled to processor memories 401 to 402 via memory interconnects 430 to 431, and each GPU 410 to 413 is communicatively coupled to GPU memories 420 to 423 via GPU memory interconnects 450 to 453. Memory interconnects 430 to 431 and 450 to 453 may utilize the same or different memory access technologies. By way of example and not limitation, processor memories 401 to 402 and GPU memories 420 to 423 may be volatile memories such as dynamic random access memory (DRAM) (including stacked DRAM), graphics DDR SDRAM (GDDR) (e.g., GDDR5, GDDR6), or high-bandwidth memory (HBM), and / or may be non-volatile memories such as 3D XPoint or Nano-RAM. In one embodiment, one portion of the memory may be volatile memory, while another portion may be non-volatile memory (e.g., using a two-level memory (2LM) hierarchy).

[0081] As described below, although the various processors 405 to 406 and GPUs 410 to 413 can each be physically coupled to specific memories 401 to 402 and 420 to 423 respectively, a unified memory architecture can be implemented, in which the same virtual system address space (also known as the “effective address” space) is distributed across all the various physical memories. For example, processor memories 401 to 402 can each include 64 GB of system memory address space, and GPU memories 420 to 423 can each include 32 GB of system memory address space (resulting in a total of 256 GB of addressable memory space in the example described).

[0082] Figure 4B Additional details are shown regarding the interconnection between a multi-core processor 407 and a graphics acceleration module 446 according to one embodiment. The graphics acceleration module 446 may include one or more GPU chips integrated on a line card coupled to the processor 407 via a high-speed link 440. Alternatively, the graphics acceleration module 446 may be integrated on the same package or chip as the processor 407.

[0083] The processor 407 shown includes multiple cores 460A to 460D, each having a translation backstop buffer 461A to 461D and one or more caches 462A to 462D. These cores may include various other components (e.g., instruction fetch units, branch prediction units, decoders, execution units, reordering buffers, etc.) for executing instructions and processing data not shown to avoid obscuring the basic principles of the invention. Caches 462A to 462D may include Level 1 (L1) and Level 2 (L2) caches. Furthermore, one or more shared caches 426 may be included in the cache hierarchy and shared by the respective groups of cores 460A to 460D. For example, one embodiment of the processor 407 includes 24 cores, each having its own L1 cache, 12 shared L2 caches, and 12 shared L3 caches. In this embodiment, one of the L2 and L3 caches is shared by two adjacent cores. The processor 407 and graphics accelerator integration module 446 are connected to the system memory 441, which may include processor memories 401 to 402.

[0084] Consistency is maintained for data and instructions stored in various caches 462A to 462D, 456 and system memory 441 via inter-core communication through a coherence bus 464. For example, each cache may have associated cache coherence logic / circuit to communicate via the coherence bus 464 in response to a detected read or write to a particular cache line. In one implementation, a cache snooping protocol is implemented via the coherence bus 464 to snoop on cache accesses. Cache snooping / coherence techniques will be well understood by those skilled in the art, and to avoid obscuring the basic principles of the invention, they will not be described in detail here.

[0085] In one embodiment, proxy circuitry 425 communicatively couples graphics acceleration module 446 to coherence bus 464, thereby allowing graphics acceleration module 446 to participate in cache coherence protocols as a peer of the core. Specifically, interface 435 provides connectivity to proxy circuitry 425 via high-speed link 440 (e.g., PCIe bus, NVLink, etc.), and interface 437 connects graphics acceleration module 446 to link 440.

[0086] In one implementation, the accelerator integrated circuit 436 provides cache management, memory access, context management, and interrupt management services for multiple graphics processing engines 431, 432, and 43N representing the graphics acceleration module 446. The graphics processing engines 431, 432, and 43N may each include a separate graphics processing unit (GPU). Alternatively, the graphics processing engines 431, 432, and 43N may include different types of graphics processing engines within the GPU, such as graphics execution units, media processing engines (e.g., video encoders / decoders), samplers, and block image transfer engines. In other words, the graphics acceleration module may be a GPU with multiple graphics processing engines 431, 432, and 43N, or the graphics processing engines 431 to 432, 43N may be separate GPUs integrated into a common package, line card, or chip.

[0087] In one embodiment, the accelerator integrated circuit 436 includes a memory management unit (MMU) 439 for performing various memory management functions such as virtual-to-physical memory translation (also known as effective-to-real memory translation) and memory access protocols for accessing system memory 441. The MMU 439 may also include a translation back buffer (TLB) (not shown) for caching virtual / effective-to-physical / real address translations. In one implementation, cache 438 stores commands and data for efficient access by graphics processing engines 431 to 432, 43N. In one embodiment, the data stored in cache 438 and graphics memories 433 to 434, 43N is kept consistent with core caches 462A to 462D, 456 and system memory 411. As mentioned, this can be accomplished via proxy circuitry 425, which participates in cache coherency mechanisms on behalf of cache 438 and memories 433 to 434, 43N (e.g., sending updates to cache 438 related to modifications / accesses to cache lines on processor caches 462A to 462D, 456 and receiving updates from cache 438).

[0088] A set of registers 445 stores context data for threads executed by graphics processing engines 431 to 432, 43N, and context management circuitry 448 manages the thread context. For example, context management circuitry 448 can perform save and restore operations to save and restore the context of various threads during context switching (e.g., where a first thread is saved and a second thread is stored so that the second thread can be executed by the graphics processing engine). For example, during context switching, context management circuitry 448 can store the current register value to a designated area in memory (e.g., identified by a context pointer). The context management circuitry can restore the register value upon returning to the context. In one embodiment, interrupt management circuitry 447 receives and processes interrupts received from system devices.

[0089] In one implementation, the MMU 439 translates the virtual / effective address from the graphics processing engine 431 into a physical / actual address in system memory 411. One embodiment of the accelerator integrated circuit 436 supports multiple (e.g., 4, 8, 16) graphics accelerator modules 446 and / or other accelerator devices. The graphics accelerator module 446 may be dedicated to a single application executing on processor 407, or it may be shared among multiple applications. In one embodiment, a virtual graphics execution environment is presented, where the resources of graphics processing engines 431 to 432, 43N are shared with multiple applications or virtual machines (VMs). Resources may be subdivided into “shards” allocated to different VMs and / or applications based on processing requirements and priorities associated with the VMs and / or applications.

[0090] Therefore, the accelerator integrated circuit acts as a bridge for the system of the graphics acceleration module 446, and provides address translation and system memory caching services. Furthermore, the accelerator integrated circuit 436 can provide virtualization facilities for the host processor to manage the virtualization of the graphics processing engine, interrupts, and memory management.

[0091] Because the hardware resources of the graphics processing engines 431 to 432, 43N are explicitly mapped to the actual address space seen by the host processor 407, any host processor can directly address these resources using valid address values. In one embodiment, one function of the accelerator integrated circuit 436 is the physical separation of the graphics processing engines 431 to 432, 43N, allowing them to appear as independent units on the system.

[0092] As mentioned, in the illustrated embodiment, one or more graphics memories 433 to 434, 43M are coupled to each of the graphics processing engines 431 to 432, 43N, respectively. Graphics memories 433 to 434, 43M store instructions and data being processed by each of the graphics processing engines 431 to 432, 43N. Graphics memories 433 to 434, 43M can be volatile memories such as DRAM (including stacked DRAM), GDDR memory (e.g., GDDR5, GDDR6), or HBM, and / or can be non-volatile memories such as 3D XPoint or Nano-RAM.

[0093] In one embodiment, to reduce data traffic on link 440, a biasing technique is used to ensure that the data stored in graphics memories 433 to 434, 43M is the data most frequently used by graphics processing engines 431 to 432, 43N, and preferably not used (or at least infrequently used) by cores 460A to 460D. Similarly, the biasing mechanism attempts to keep the data required by the cores (and preferably not graphics processing engines 431 to 432, 43N) within the caches 462A to 462D, 456 of the cores and system memory 411.

[0094] Figure 4C Another embodiment in which the accelerator integrated circuit 436 is integrated within the processor 407 is shown. In this embodiment, the graphics processing engines 431 to 432, 43N communicate directly with the accelerator integrated circuit 436 via high-speed link 440 through interfaces 437 and 435 (this can also utilize any form of bus or interface protocol). The accelerator integrated circuit 436 can perform operations related to... Figure 4B The operation described is the same, but given its close proximity to the coherence bus 462 and caches 462A to 462D, 426, it may operate at a higher throughput.

[0095] One embodiment supports different programming models, including a dedicated process programming model (without graphics acceleration module virtualization) and a shared programming model (with virtualization). The shared programming model may include a programming model controlled by accelerator integrated circuit 436 and a programming model controlled by graphics acceleration module 446.

[0096] In one embodiment of the dedicated process model, graphics processing engines 431 to 432, 43N are dedicated to a single application or process within a single operating system. A single application can centralize requests from other applications to graphics engines 431 to 432, 43N, thereby providing virtualization within a VM / partition.

[0097] In a dedicated process programming model, graphics processing engines 431 to 432, 43N can be shared by multiple VM / application partitions. This shared model requires a hypervisor to virtualize the graphics processing engines 431 to 432, 43N, allowing access by each operating system. For single-partition systems without a hypervisor, the graphics processing engines 431 to 432, 43N are owned by the operating system. In both cases, the operating system can virtualize the graphics processing engines 431 to 432, 43N to provide access to each process or application.

[0098] For a shared programming model, the graphics acceleration module 446 or the individual graphics processing engines 431 to 432, 43N use a process handle to select process elements. In one embodiment, the process elements are stored in system memory 411 and can be addressed using the effective address to physical address translation techniques described herein. The process handle may be an implementation-specific value provided to the host process when registering its context with the graphics processing engines 431 to 432, 43N (i.e., invoking system software to add a process element to the process element list). The lower 16 bits of the process handle may be the offset of the process element within the process element list.

[0099] Figure 4D An exemplary accelerator integration slice 490 is shown. As used herein, a “slice” refers to a designated portion of the processing resources of the accelerator integrated circuit 436. The application-effective address space 482 within system memory 411 stores process elements 483. In one embodiment, process element 483 is stored in response to a GPU call 481 from an application 480 executing on processor 407. Process element 483 contains the processing state of the corresponding application 480. The job descriptor (WD) 484 contained in process element 483 may be a single job requested by the application, or it may contain a pointer to a job queue. In the latter case, WD 484 is a pointer to a job request queue in the application address space 482.

[0100] The graphics acceleration module 446 and / or individual graphics processing engines 431 to 432, 43N can be shared by all or some processes in the system. Embodiments of the invention include infrastructure for establishing a processing state and sending a WD 484 to the graphics acceleration module 446 to begin work in a virtual environment.

[0101] In one implementation, the dedicated process programming model is implementation-specific. In this model, a single process owns either the graphics acceleration module 446 or a separate graphics processing engine 431. Since the graphics acceleration module 446 is owned by a single process, the hypervisor initializes the accelerator integrated circuit 436 to obtain its assigned partition, and the operating system initializes the accelerator integrated circuit 436 to obtain its assigned process when the graphics acceleration module 446 is allocated.

[0102] In operation, the WD acquisition unit 491 in the accelerator integration slice 490 acquires the next WD 484, which includes instructions for work to be performed by one of the graphics processing engines of the graphics acceleration module 446. As shown, data from the WD 484 can be stored in register 445 and used by the MMU 439, interrupt management circuitry 447, and / or context management circuitry 446. For example, one embodiment of the MMU 439 includes segment / page walk circuitry for accessing segment / page tables 486 within the OS virtual address space 485. The interrupt management circuitry 447 can handle interrupt events 492 received from the graphics acceleration module 446. When performing graphics operations, the effective address 493 generated by the graphics processing engines 431 to 432, 43N is translated into an actual address by the MMU 439.

[0103] In one embodiment, the same set of registers 445 is copied for each graphics processing engine 431 to 432, 43N and / or graphics acceleration module 446, and this set of registers can be initialized by a hypervisor or operating system. Each of these copied registers can be included in the accelerator integration slice 490. Exemplary registers that can be initialized by a hypervisor are shown in Table 1.

[0104] Table 1 - Supervisor Initialization Registers

[0105] 1 Slice Control Register 2 Real Address (RA) Scheduler Region Pointer 3 Authorization mask override register 4 Interruption vector table entry offset 5 Interruption vector table entry limit 6 Status Register 7 Logical partition ID 8 The Real Address (RA) management accelerator utilizes record pointers 9 Storage description register

[0106] Table 2 shows exemplary registers that can be initialized by the operating system.

[0107] Table 2 - Operating System Initialization Registers

[0108] 1 Process and thread identifiers 2 Valid Address (EA) Context Save / Restore Pointer 3 Virtual Address (RA) accelerators utilize record pointers 4 Virtual Address (RA) segment table pointer 5 Authorization mask 6 Job descriptor

[0109] In one embodiment, each WD 484 is specific to a particular graphics acceleration module 446 and / or graphics processing engines 431 to 432, 43N. The WD contains all the information required for the graphics processing engines 431 to 432, 43N to complete their work, or the WD may be a pointer to a memory location where the application has established a queue of work commands to be completed.

[0110] Figure 4E Additional details of one embodiment of the shared model are shown. This embodiment includes a hypervisor physical address space 498 in which a list of process elements 499 is stored. The hypervisor physical address space 498 is accessible via a hypervisor 496 that virtualizes the graphics acceleration module engine of operating system 495.

[0111] The shared programming model allows all or some processes from all or some partitions of the system to use the graphics acceleration module 446. There are two programming models in which the graphics acceleration module 446 is shared by multiple processes and partitions: time-sliced sharing and direct graphics sharing.

[0112] In this model, the hypervisor 496 owns the graphics acceleration module 446 and makes its functionality available to all operating systems 495. For the graphics acceleration module 446 to support the virtualization of the hypervisor 496, the graphics acceleration module 446 must meet the following requirements: 1) Application job requests must be autonomous (i.e., no need to maintain state between jobs), or the graphics acceleration module 446 must provide context saving and restoration mechanisms. 2) The graphics acceleration module 446 guarantees completion of application job requests within a specified time, including any translation errors, or the graphics acceleration module 446 provides the ability to preempt job processing. 3) When operating in a direct shared programming model, fairness of the graphics acceleration module 446 within the process must be guaranteed.

[0113] In one embodiment, for the shared model, application 480 is required to make an operating system system call 495 using the graphics acceleration module 446 type, working descriptor (WD), authorization mask register (AMR) value, and context save / restore region pointer (CSRP). The graphics acceleration module 446 type describes the target acceleration function of the system call. The graphics acceleration module 446 type can be a system-specific value. The WD is specifically formatted for the graphics acceleration module 446 and can be in the following forms: graphics acceleration module 446 command; valid address pointer to a user-defined structure; valid address pointer to a command queue; or any other data structure describing the work to be performed by the graphics acceleration module 446. In one embodiment, the AMR value is the AMR state for the current process. The value passed to the operating system is similar to that of the application setting the AMR. If the implementation of the accelerator integrated circuit 436 and the graphics acceleration module 446 does not support the User Authorization Mask Override Register (UAMOR), the operating system can apply the current UAMOR value to the AMR value before passing the AMR in the hypervisor call. Before placing the AMR in process element 483, hypervisor 496 may optionally apply the Current Authorization Mask Override Register (AMOR) value. In one embodiment, CSRP is one of the registers 445 containing the effective addresses of regions in application address space 482 for the graphics acceleration module 446 to save and restore context state. This pointer is optional if saving state between jobs is not required or when a job is preempted. The context save / restore region may be plugged-in system memory.

[0114] Upon receiving a system call, the operating system 495 can verify that the application 480 has been registered and authorized to use the graphics acceleration module 446. The operating system 495 then uses the information shown in Table 3 to invoke the hypervisor 496.

[0115] Table 3 - Operating System Call Parameters for the Hypervisor

[0116] 1 Working Descriptor (WD) 2 Authorization Mask Register (AMR) value (may be masked) 3 Valid Address (EA) Context Save / Restore Region Pointer (CSRP) 4 Process ID (PID) and optional thread ID (TID) 5 Virtual address (VA) accelerators utilize record pointers (AURP). 6 Virtual address of the Memory Segment Table Pointer (SSTP) 7 Logical Interrupt Service Number (LISN)

[0117] Upon receiving a call from the management program, the management program 496 can verify that the operating system 495 has been registered and authorized to use the graphics acceleration module 446. The management program 496 then places the process element 483 into a process element linked list corresponding to the graphics acceleration module 446 type. The process element may contain the information shown in Table 4.

[0118] Table 4 - Process Element Information

[0119] 1 Working Descriptor (WD) 2 Authorization Mask Register (AMR) value (may be masked) 3 Valid Address (EA) Context Save / Restore Region Pointer (CSRP) 4 Process ID (PID) and optional thread ID (TID) 5 Virtual address (VA) accelerators utilize record pointers (AURP). 6 Virtual address of the Memory Segment Table Pointer (SSTP) 7 Logical Interrupt Service Number (LISN) 8 Interrupt vector table, exported from hypervisor call parameters 9 Status Register (SR) Value 10 Logical Partition ID (LPID) 11 The Real Address (RA) management accelerator utilizes record pointers 12 Storage Descriptor Register (SDR)

[0120] In one embodiment, the hypervisor initializes the multiple accelerator integration slice 490 of register 445.

[0121] like Figure 4F As shown, one embodiment of the invention employs a unified memory addressable via a common virtual memory address space for accessing physical processor memories 401-402 and GPU memories 420-423. In this implementation, operations performed on GPUs 410-413 utilize the same virtual / effective memory address space to access processor memories 401-402 and vice versa, thereby simplifying programmability. In one embodiment, a first portion of the virtual / effective address space is allocated to processor memory 401, a second portion to second processor memory 402, a third portion to GPU memory 420, and so on. The entire virtual / effective memory space (sometimes referred to as the effective address space) is thus distributed across each of processor memories 401-402 and GPU memories 420-423, thereby allowing any processor or GPU to access any physical memory having a virtual address mapped to said memory.

[0122] In one embodiment, the bias / coherence management circuitry 494A to 494E within one or more of the MMUs 439A to 439E ensures cache coherence between the host processor (e.g., 405) and the caches of the GPUs 410 to 413, and implements biasing techniques that indicate the physical memory where certain types of data should be stored. Although in Figure 4F Several instances of bias / coherence management circuitry 494A to 494E are shown, but bias / coherence circuitry can also be implemented within the MMU of one or more host processors 405 and / or within the accelerator integrated circuit 436.

[0123] One embodiment allows GPU-attached memories 420-423 to be mapped as part of system memory and accessed using shared virtual memory (SVM) technology without suffering the typical performance drawbacks associated with system-wide cache coherence. The ability to access GPU-attached memories 420-423 as system memory avoids heavy cache coherence overhead, providing a favorable operating environment for GPU offloading. This arrangement allows host processor 405 software to set operands and access computation results without the overhead of traditional I / O DMA data copies. These traditional copies involve driver calls, interrupts, and memory-mapped I / O (MMIO) accesses, all of which are inefficient compared to simple memory accesses. Furthermore, the ability to access GPU-attached memories 420-423 without cache coherence overhead can be critical to the execution time of offloading computations. For example, in scenarios with heavy streaming write memory traffic, cache coherence overhead can significantly reduce the effective write bandwidth seen by GPUs 410-413. The efficiency of operand setting, the efficiency of result access, and the efficiency of GPU computation all play a significant role in determining the effectiveness of GPU offloading.

[0124] In one implementation, the choice between GPU bias and host processor bias is driven by a bias tracker data structure. For example, a bias table can be used, which may be a page-granular structure comprising 1 or 2 bits per GPU-attached memory page (i.e., controlled at the memory page level). The bias table can be implemented within the stolen memory range of one or more GPU-attached memories 420-423, with or without a bias cache in GPUs 410-413 (e.g., caching frequently / recently used entries of the bias table). Alternatively, the entire bias table can be maintained within the GPU.

[0125] In one implementation, the bias table entries associated with each access to GPU-attached memories 420-423 are accessed before the actual access to GPU memory, such that: First, local requests from GPUs 410-413 that find their pages in the GPU bias are directly forwarded to the corresponding GPU memories 420-423. Local requests from GPUs that find their pages in the host bias are forwarded to processor 405 (e.g., via a high-speed link as described above). In one embodiment, a request from processor 405 that finds the requested page in the host processor bias completes the request like a normal memory read. Alternatively, requests for GPU-biased pages can be forwarded to GPUs 410-413. If the GPU is not currently using the page, the GPU can convert the page to host processor bias.

[0126] The page bias state can be changed through software-based mechanisms, hardware-assisted software mechanisms, or, for a finite set of cases, hardware-only mechanisms.

[0127] One mechanism for changing the bias state employs an API call (e.g., OpenCL), which in turn invokes the GPU device driver, which then sends a message to the GPU (or enqueues a command descriptor), thereby instructing the GPU to change the bias state. For certain transitions, a cache dump clearing operation is performed on the host machine. This cache dump clearing operation is necessary for transitions from host processor 405 bias to GPU bias, but not necessary for the reverse transition.

[0128] In one embodiment, cache coherence is maintained by temporarily presenting GPU bias pages that the host processor 405 cannot cache. To access these pages, the processor 405 can request access from the GPU 410, which may grant access immediately or not, depending on its implementation. Therefore, to reduce communication between the processor 405 and the GPU 410, it is advantageous to ensure that the GPU bias pages are pages needed by the GPU but not by the host processor 405, and vice versa.

[0129] Graphics processing pipeline

[0130] Figure 5 A graphics processing pipeline 500 according to an embodiment is illustrated. In one embodiment, a graphics processor may implement the illustrated graphics processing pipeline 500. The graphics processor may be included within a parallel processing subsystem such as the parallel processor 200 of FIG2, as described herein. In one embodiment, the parallel processor is... Figure 1Variations of the (multiple) parallel processors 112. As described herein, various parallel processing systems can implement the graphics processing pipeline 500 via one or more instances of parallel processing units (e.g., parallel processing unit 202 of FIG. 2). For example, a shader unit (e.g., graphics multiprocessor 234 of FIG. 3) can be configured to perform the functions of one or more of the vertex processing unit 504, tessellation control processing unit 508, tessellation evaluation processing unit 512, geometry processing unit 516, and fragment / pixel processing unit 524. The functions of the data assembler 502, primitive assemblers 506, 514, 518, tessellation unit 510, rasterizer 522, and raster operation unit 526 can also be performed by other processing engines and corresponding partitioning units (e.g., partitioning units 220A to 220N of FIG. 2) within a processing cluster (e.g., processing cluster 214 of FIG. 3). The graphics processing pipeline 500 can also be implemented using one or more dedicated processing units with specific functions. In one embodiment, one or more portions of the graphics processing pipeline 500 may be executed by parallel processing logic within a general-purpose processor (e.g., a CPU). In one embodiment, one or more portions of the graphics processing pipeline 500 may access on-chip memory (e.g., parallel processor memory 222 as shown in FIG2) via a memory interface 528, which may be an instance of memory interface 218 of FIG2.

[0131] In one embodiment, the data assembler 502 is a processing unit that collects vertex data of surfaces and primitives. The data assembler 502 then outputs vertex data, including vertex attributes, to the vertex processing unit 504. The vertex processing unit 504 is a programmable execution unit that executes a vertex shader program to illuminate and transform vertex data as specified by the vertex shader program. The vertex processing unit 504 reads data stored in a cache, local, or system memory for processing vertex data and can be programmed to transform vertex data from an object-based coordinate representation to world space coordinate space or normalized device coordinate space.

[0132] The first instance of the primitive assembler 506 receives vertex attributes from the vertex processing unit 50. The primitive assembler 506 reads the stored vertex attributes as needed and constructs graphic primitives for processing by the tessellation control processing unit 508. Graphic primitives include triangles, line segments, points, patches, etc., supported by various graphics processing application programming interfaces (APIs).

[0133] The tessellation control processing unit 508 treats input vertices as control points for a geometric patch. These control points are transformed from an input representation of the patch (e.g., the patch's basis) into a representation suitable for surface evaluation by the tessellation evaluation processing unit 512. The tessellation control processing unit 508 can also calculate tessellation factors for the edges of the geometric patch. The tessellation factor is applied to a single edge and quantifies the view-dependent level of detail associated with the edge. The tessellation unit 510 is configured to receive the tessellation factors for the edges of the patch and subdivide the patch into multiple geometric primitives, such as lines, triangles, or quadrilaterals, which are then transmitted to the tessellation evaluation processing unit 512. The tessellation evaluation processing unit 512 operates on the parameterized coordinates of the subdivided patch to generate a surface representation and vertex attributes associated with each vertex of the geometric primitive.

[0134] A second instance of the primitive assembler 514 receives vertex attributes from the tessellation evaluation processing unit 512, reads stored vertex attributes as needed, and constructs graphic primitives for processing by the geometry processing unit 516. The geometry processing unit 516 is a programmable execution unit that executes a geometry shader program to transform the graphic primitives received from the primitive assembler 514 as specified by the geometry shader program. In one embodiment, the geometry processing unit 516 is programmed to subdivide the graphic primitives into one or more new graphic primitives and calculate parameters for rasterizing the new graphic primitives.

[0135] In some embodiments, the geometry processing unit 516 may add or delete elements in the geometry flow. The geometry processing unit 516 outputs parameters and vertices specifying new graphic primitives to the primitive assembler 518. The primitive assembler 518 receives the parameters and vertices from the geometry processing unit 516 and constructs graphic primitives for processing by the viewport scaling, picking, and clipping unit 520. The geometry processing unit 516 reads data stored in the parallel processor memory or system memory for processing geometric data. The viewport scaling, picking, and clipping unit 520 performs clipping, picking, and viewport scaling, and outputs the processed graphic primitives to the rasterizer 522.

[0136] Rasterizer 522 can perform depth picking and other depth-based optimizations. Rasterizer 522 also performs scan transformations on new graphic primitives to generate fragments and outputs these fragments and associated overlay data to fragment / pixel processing unit 524. Fragment / pixel processing unit 524 is a programmable execution unit configured to execute fragment shader programs or pixel shader programs. Fragment / pixel processing unit 524 transforms fragments or pixels received from rasterizer 522 as specified by the fragment or pixel shader program. For example, fragment / pixel processing unit 524 can be programmed to perform operations including but not limited to texture mapping, shading, blending, texture correction, and perspective correction to produce shaded fragments or pixels output to raster operation unit 526. Fragment / pixel processing unit 524 can read data stored in parallel processor memory or system memory for use when processing fragment data. Fragment or pixel shader programs can be configured to shade at sample, pixel, tile, or other granularities according to a sampling rate configured for the processing unit.

[0137] The raster operation unit 526 is a processing unit that performs raster operations including but not limited to stencil printing, z-testing, and blending, and outputs pixel data as processed graphic data for storage in a graphics memory (e.g., the parallel processor memory 222 in Figure 2, and / or such as...). Figure 1 The system memory 104 is used for display on one or more display devices 110 or for further processing by one or more processors 102 or one or more parallel processors 112. In some embodiments, the raster operation unit 526 is configured to compress z-or color data written to memory and decompress z-or color data read from memory.

[0138] The foregoing description and accompanying drawings should be considered illustrative rather than restrictive. Those skilled in the art will understand that various modifications and changes can be made to the embodiments described herein without departing from the broader spirit and scope of the invention as set forth in the appended claims.

[0139] refer to Figure 6 Currently, deep learning neural networks (DNNs) can be retrained at different times of the day. Training convolutional neural networks (CNNs) requires manipulating the same patterns to ensure sufficient levels of variability can be introduced to produce robust results. When using a 3D pipeline, a rendering process requiring software intervention can be implemented to manipulate images. If it is a purely hardware-based pipeline, the training results can be examined to understand sensitivity tones and shadow levels. By using this technique, a trained network can be used to re-infer information on a 3D pipeline.

[0140] Furthermore, in some examples, distributed training weights can be centralized, for example, by using a parameter server. The processing associated with this can be expensive, and all nodes need to receive new weights, resulting in lower scaling efficiency.

[0141] In some examples, these and other issues can be addressed by adding a hardware engine that accelerates weight updates. The hardware engine could include a fast operation, a fast weight averaging operation, or a voting operation to average weights originating from all nodes.

[0142] In some examples, deep learning neural networks (DNNs) can be built using a bottom-up approach, referred to in this paper as layerlets (similar to codelets used in high-performance libraries). The network can be identified based on library encoding. Furthermore, in some examples, complex DNNs can be built by grouping smaller, more specialized DNNs together in a manner similar to a complex software program composed of many functions.

[0143] For example, refer to Figure 7A If we assume that DNN 710 recognizes mammals, then smaller DNNs can be trained to recognize dogs, cats, humans, etc. Each of these DNNs can be built from smaller DNNs that recognize subcategories. For example, the sub-DNN for cats could be the entirety of lions, tigers, leopards, etc.

[0144] In bottom-up DNN construction, sub-DNNs are trained separately. Then, the parent DNN is constructed from all its sub-DNNs, and further training is performed on the entire set to fine-tune the network. Shared weights can be removed, and sub-DNNs can be merged / compressed to some extent.

[0145] This bottom-up and modular approach, built upon basic DNN building blocks, enables flexible and easy customization of DNNs. For example, if a DNN needs to be deployed in an environment where only human and dog recognition is required, the cat sub-DNN is not included in the overall dataset, and the second-level training is applied only to the overall dataset of dogs and humans. This customized DNN architecture reduces the size of DNNs that need to be placed / deployed in different environments.

[0146] Furthermore, the semi-decomposition architecture of the DNN constructed in this way can prioritize different layers and / or sub-DNNs / regions of the DNN. For example, in the dog / human example, if faster human / child recognition is required, the code generation and resource allocation of the host hardware can be managed so that the output generated by the human / child sub-DNN is generated first.

[0147] refer to Figure 7BBackground segmentation in dynamic environments is extremely computationally intensive. In static environments, when the camera is stationary, subtracting two consecutive images to access changes in the environment is relatively easy because the background remains the same. However, in dynamic environments, the camera itself is moving, and therefore the distinction between foreground and background disappears.

[0148] Computer vision algorithms that attempt to estimate the motion and distance trajectories of objects in dynamic environments, such as those with moving vehicles, must rely on features detected based on assumed static objects such as trees, buildings, traffic signs, etc. This requires sophisticated methods such as classification, localization, and possibly labeling to understand these objects. Once these objects are identified, their features are used for motion estimation and ultimately for distance estimation and the trajectory of the motion. These algorithms are often used in conjunction with dense optical flow (DOF) for end-to-end environmental cognition systems, where static objects are distinguished from moving objects. Depending on the velocity of the moving object (from the DOF) and the trajectory of the motion, they are also distinguished based on their risk probability.

[0149] For comprehensive surround cognition, this operation requires significant GPU resources because feature detection and DOF are pixel-by-pixel operations and require complete and atomic operations that are not GPU-friendly. Hardware acceleration for these functions is necessary for real-time operation, especially to alleviate these GPU- and / or CPU-intensive and computationally intensive operations.

[0150] refer to Figure 7C and 7D In some examples, the decision routine runs on two different GPUs (i.e., GPU 0 and GPU 1) to determine errors. Furthermore, in some examples, the machines can be separated, and the same computation can be run multiple times in parallel. The GPUs may include logic for comparing the results of the decision routine, and if the results match, processing can continue. Conversely, if the results do not match, an error may have been generated.

[0151] In some examples, the result of the Cyclic Redundancy Check (CRC) can be stored in memory. The GPU can verify that the CRC has not changed in order to continue processing. If the CRC has indeed changed, an error may be generated.

[0152] Machine Learning Overview

[0153] Machine learning algorithms are algorithms that can learn from a set of data. Implementations of machine learning algorithms can be designed to model high-level abstractions within a dataset. For example, image recognition algorithms can be used to determine which of several categories a given input belongs to; regression algorithms can output numerical values given input; and pattern recognition algorithms can be used to generate translated text or perform text-to-speech and / or speech recognition.

[0154] One example type of machine learning algorithm is a neural network. Many types of neural networks exist; a simple type is the feedforward network. A feedforward network can be implemented as an acyclic graph, where nodes are arranged in layers. Typically, a feedforward network topology consists of an input layer and an output layer, separated by at least one hidden layer. The hidden layer transforms the input received by the input layer into a representation useful for generating the output in the output layer. Network nodes are fully connected to nodes in adjacent layers via edges, but there are no edges between nodes within a single layer. Data received at the nodes in the input layer of the feedforward network is propagated (i.e., “feedforward”) to the nodes in the output layer via activation functions that compute the state of nodes in each consecutive layer of the network based on coefficients (“weights”) associated with each of the edges connecting these layers. Depending on the specific model represented by the algorithm being executed, the output from a neural network algorithm can take various forms.

[0155] Before a machine learning algorithm can be used to model a specific problem, it is trained using a training dataset. Training a neural network involves: selecting a network topology; using a set of training data representing the problem being modeled by the network; and adjusting the weights until the network model exhibits minimum error for all instances in the training dataset. For example, during supervised learning training for a neural network, the output generated by the network in response to inputs representing instances in the training dataset is compared to the “correct” labeled output of those instances; an error signal representing the difference between the output and the labeled output is calculated; and the weights associated with the connections are adjusted to minimize the error as the error signal is backpropagated through the layers of the network. The network is considered “trained” when the error of each output generated from the instances in the training dataset is minimized.

[0156] The accuracy of machine learning algorithms is greatly affected by the quality of the dataset used to train the algorithm. The training process can be computationally intensive and may require a significant amount of time on a conventional general-purpose processor. Therefore, parallel processing hardware is used to train many types of machine learning algorithms. This is particularly useful for optimizing the training of neural networks, as the computations performed when adjusting the coefficients in a neural network are naturally suited to parallel implementation. Specifically, many machine learning algorithms and software applications have been adapted to use parallel processing hardware within general-purpose graphics processing devices.

[0157] Figure 8 This is a generalized graph of machine learning software stack 800. Machine learning application 802 can be configured to train a neural network using a training dataset or to achieve machine intelligence using a trained deep neural network. Machine learning application 802 may include training and inference capabilities for the neural network and / or dedicated software, which can be used to train the neural network prior to deployment. Machine learning application 802 can achieve any type of machine intelligence, including but not limited to: image recognition, mapping and localization, autonomous navigation, speech synthesis, medical imaging, or language translation.

[0158] Hardware acceleration for machine learning applications 802 can be achieved via machine learning framework 804. Machine learning framework 804 provides a library of machine learning primitives. Machine learning primitives are the fundamental operations typically performed by machine learning algorithms. Without machine learning framework 804, developers of machine learning algorithms would need to create and optimize the main computational logic associated with their algorithms, and then re-optimize that computational logic when a new parallel processor is developed. Instead, machine learning applications can be configured to use primitives provided by machine learning framework 804 to perform the necessary computations. Exemplary primitives include tensor convolution, activation functions, and pooling, which are computational operations performed when training convolutional neural networks (CNNs). Machine learning framework 804 can also provide primitives for implementing basic linear algebra subroutines performed by many machine learning algorithms, such as matrix and vector operations.

[0159] The machine learning framework 804 can process input data received from the machine learning application 802 and generate appropriate input for the computing framework 806. The computing framework 806 can abstract the low-level instructions provided to the GPGPU driver 808, enabling the machine learning framework 804 to utilize hardware acceleration via the GPGPU hardware 810 without the machine learning framework 804 being very familiar with the architecture of the GPGPU hardware 810. Furthermore, the computing framework 806 can implement hardware acceleration for the machine learning framework 804 across various types and generations of GPGPU hardware 810.

[0160] GPGPU Machine Learning Acceleration

[0161] Figure 9 A highly parallel general-purpose graphics processing unit 900 is illustrated according to an embodiment. In one embodiment, the general-purpose processing unit (GPGPU) 900 can be configured to be particularly efficient in handling the type of computational workload associated with training deep neural networks. Additionally, the GPGPU 900 can be directly linked to other instances of GPGPUs to create multi-GPU clusters, thereby improving the training speed of particularly deep neural networks.

[0162] The GPGPU 900 includes a host interface 902 for connecting to a host processor. In one embodiment, the host interface 902 is a PCI Express interface. However, the host interface can also be a vendor-specific communication interface or communication architecture. The GPGPU 900 receives commands from the host processor and uses a global scheduler 904 to distribute the execution threads associated with those commands to a group of compute clusters 906A to 906H. Compute clusters 906A to 906H share a cache memory 908. The cache memory 908 can act as a high-level cache within the cache memory of the compute clusters 906A to 906H.

[0163] The GPGPU 900 includes memories 914A to 914B, which are coupled to computing clusters 906A to 906H via a set of memory controllers 912A to 912B. In various embodiments, memories 914A to 914B may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory (e.g., synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory). In one embodiment, memory cells 224A to 224N may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM).

[0164] In one embodiment, each computing cluster GPLAB06A-H includes a set of graphics multiprocessors, such as Figure 4A The graphics multiprocessor 400 of the computing cluster includes various types of integer and floating-point logic units that can perform computational operations at a range of precisions, including precision suitable for machine learning computations. For example, in one embodiment, at least a subset of the floating-point units of each of the computing clusters 906A to 906H can be configured to perform 16-bit or 32-bit floating-point operations, while a different subset of the floating-point units can be configured to perform 64-bit floating-point operations.

[0165] Multiple instances of the GPGPU 900 can be configured to operate as a computing cluster. The communication mechanisms used by the computing cluster for synchronization and data exchange vary across embodiments. In one embodiment, multiple instances of the GPGPU 900 communicate via a host interface 902. In one embodiment, the GPGPU 900 includes an I / O hub 908 that couples the GPGPU 900 to a GPU link 910, which enables direct connections to other instances of the GPGPU. In one embodiment, the GPU link 910 is coupled to a dedicated GPU-GPU bridge that enables communication and synchronization between multiple instances of the GPGPU 900. In one embodiment, the GPU link 910 is coupled to a high-speed interconnect for transmitting and receiving data to and from other GPGPUs or parallel processors. In one embodiment, multiple instances of the GPGPU 900 reside in a separate data processing system and communicate via a network device accessible via the host interface 902. In one embodiment, in addition to or as an alternative to the host interface 902, the GPU link 910 may also be configured to connect to a host processor.

[0166] While the illustrated configuration of the GPGPU 900 can be configured to train neural networks, one embodiment provides an alternative configuration of the GPGPU 900 that can be configured for deployment within a high-performance or low-power inference platform. In the inference configuration, the GPGPU 900 includes fewer compute clusters 906A to 906H compared to the training configuration. Additionally, the memory technology associated with memories 914A to 914B may differ between the inference and training configurations. In one embodiment, the inference configuration of the GPGPU 900 may support inference-specific instructions. For example, the inference configuration may provide support for one or more 8-bit integer dot product instructions, which are typically used during inference operations for deployed neural networks.

[0167] Figure 10 A multi-GPU computing system 1000 according to an embodiment is illustrated. The multi-GPU computing system 1000 may include a processor 1002 coupled to a plurality of GPGPUs 1006A to D via a host interface switch 1004. In one embodiment, the host interface switch 1004 is a PCI Express switch device that couples the processor 1002 to a PCI Express bus through which the processor 1002 can communicate with the group of GPGPUs 1006A to D. Each of the plurality of GPGPUs 1006A to 1006D may be... Figure 9An instance of the GPGPU 900. GPGPUs 1006A to D can be interconnected via a set of high-speed point-to-point GPU-GPU links 1016. High-speed GPU-GPU links can be via dedicated GPU links (e.g., such as...). Figure 9 The P2P GPU link 910 is connected to each of the GPGPUs 1006A to 1006D. The P2P GPU link 1016 enables direct communication between each of the GPGPUs 1006A to D without requiring communication via the host interface bus (to which the processor 1002 is connected). In the case of GPU-GPU traffic targeting the P2P GPU link, the host interface bus can still be used for system memory access or communication with other instances of the multi-GPU computing system 1000 (e.g., via one or more network devices). While in the illustrated embodiment the GPGPUs 1006A to D are connected to the processor 1002 via the host interface switch 1004, in one embodiment, the processor 1002 includes direct support for the P2P GPU link 1016 and can be directly connected to the GPGPUs 1006A to 1006D.

[0168] Machine learning neural network implementation methods

[0169] The computational architectures provided by the embodiments described herein can be configured to perform these types of parallel processing, which are particularly well-suited for training and deploying neural networks for machine learning. Neural networks can be generalized as networks of functions with graph relationships. As is well known in the art, there are various types of neural network implementations used in machine learning. One exemplary type of neural network is the feedforward network as previously described.

[0170] The second exemplary type of neural network is the Convolutional Neural Network (CNN). A CNN is a specialized feedforward neural network designed for processing data with a known, grid-like topology (e.g., image data). Therefore, CNNs are commonly used in computer vision and image recognition applications, but they can also be used in other types of pattern recognition, such as speech and language processing. Nodes in the input layer of a CNN are organized as a set of “filters” (feature detectors inspired by receptive fields found in the retina), and the output of each set of filters is propagated to nodes in successive layers of the network. The computation for a CNN involves applying convolutional mathematics to each filter to produce the output of that filter. Convolution is a specialized mathematical operation performed by two functions to produce a third function, which is a modified version of one of the two original functions. In convolutional network terminology, the first function related to convolution can be referred to as the input, and the second function can be referred to as the convolution kernel. The output can be referred to as a feature map. For example, the input to a convolutional layer can be a multidimensional array of data that defines various color components of the input image. The convolution kernel can be a multidimensional array of parameters, which are adapted through a training process for the neural network.

[0171] Recurrent Neural Networks (RNNs) are a type of feedforward neural network that includes feedback connections between layers. RNNs enable the modeling of sequential data by sharing parameter data across different parts of the neural network. The architecture of an RNN includes loops. These loops represent the effect of the current value of a variable on its own value at future times, because at least a portion of the output data from the RNN is used as feedback to process subsequent inputs in the sequence. This variable nature of language data makes RNNs particularly useful for language processing.

[0172] The diagrams described below illustrate exemplary feedforward, CNN, and RNN networks, and describe the general process for training and deploying each of those types of networks respectively. It will be understood that these descriptions are exemplary and non-limiting with respect to any particular embodiment described herein, and that the concepts shown can generally be applied to deep neural networks and machine learning techniques.

[0173] The exemplary neural network described above can be used to perform deep learning. Deep learning is machine learning performed using deep neural networks. In contrast to shallow neural networks that contain only a single hidden layer, the deep neural networks used in deep learning are artificial neural networks composed of multiple hidden layers. Deeper neural networks are generally more computationally intensive to train. However, the additional hidden layers of the network enable multi-step pattern recognition, which results in reduced output error compared to shallow machine learning techniques.

[0174] Deep neural networks used in deep learning typically include a front-end network for performing feature recognition coupled to a back-end network representing a mathematical model, which can then perform operations (e.g., object classification, speech recognition, etc.) based on the feature representations provided to the model. Deep learning enables machine learning to be performed without requiring manual feature engineering on the model. Instead, deep neural networks can learn features based on statistical structure or correlations within the input data. The learned features can be provided to a mathematical model, which can then map the detected features to the output. The mathematical model used by the network is typically specialized for a specific task to be performed, and different models will be used to perform different tasks.

[0175] Once a neural network is structured, a learning model can be applied to it to train it to perform a specific task. The learning model describes how weights are adjusted within the model to reduce the network's output error. Backpropagation of error is a common method used to train neural networks. An input vector is presented to the network for processing. The network's output is compared to the expected output using a loss function, and an error value is calculated for each neuron in the output layer. These error values are then backpropagated until each neuron has an associated error value that roughly represents its contribution to the original output. The network can then learn from those errors using an algorithm (e.g., stochastic gradient descent) to update the neural network's weights.

[0176] Figure 11A-11B This demonstrates an exemplary convolutional neural network. Figure 11A Show the individual layers within a CNN. For example... Figure 11A As shown, an exemplary CNN for modeling image processing can receive input 1102, which describes the red, green, and blue (RGB) components of an input image. Input 1102 can be processed by multiple convolutional layers (e.g., convolutional layer 1104, convolutional layer 1106). Optionally, the output from the multiple convolutional layers can be processed by a set of fully connected layers 1108. Neurons in a fully connected layer have full connections to all activation functions in the previous layer, as previously described for feedforward networks. The output from the fully connected layer 1108 can be used to generate an output from the network. Matrix multiplication can be used instead of convolution to compute the activation functions within the fully connected layer 1108. Not all CNN implementations use fully connected layers 1108. For example, in some implementations, convolutional layers 1106 can generate the CNN output.

[0177] Convolutional layers are sparsely connected, unlike the traditional neural network configuration found in fully connected layers 1108. Traditional neural network layers are fully connected, such that each output unit interacts with each input unit. However, convolutional layers are sparsely connected because the output of the convolution of the receptive field (rather than the corresponding state value of each node in the receptive field) is fed to nodes in subsequent layers, as illustrated. The kernels associated with the convolutional layers perform convolution operations, the output of which is sent to the next layer. Dimensionality reduction performed within convolutional layers is one aspect that enables CNNs to scale to handle large images.

[0178] Figure 11B This illustrates an exemplary computational phase within a convolutional layer of a CNN. The input 1112 to the convolutional layer of the CNN can be processed in three phases within convolutional layer 1114. These three phases may include a convolution phase 1116, a detector phase 1118, and a pooling phase 1120. Convolutional layer 1114 can then output the data to successive convolutional layers. The last convolutional layer of the network can generate output feature map data or provide input to fully connected layers, for example, to generate classification values for the input to the CNN.

[0179] Several convolutions are performed in parallel in convolution stage 1116 to produce a set of linear activation functions. Convolution stage 1116 may include affine transformations, which are any transformations that can be specified as a linear transformation plus a translation. Affine transformations include rotation, translation, scaling, and combinations of these transformations. The convolution stage computes the output (e.g., a neuron) of a function connected to a specific region in the input, which can be determined as a local region associated with the neuron. The neuron computes the dot product between the neuron's weights and a region in the local input to which the neuron is connected. The output from convolution stage 1116 defines a set of linear activation functions processed by the successive stages of convolutional layer 1114.

[0180] Linear activation functions can be processed by detector stage 1118. In detector stage 1118, each linear activation function is processed by a nonlinear activation function. Nonlinear activation functions add nonlinearity to the overall network without affecting the receptive field of the convolutional layers. Several types of nonlinear activation functions can be used. One specific type is the Modified Linear Unit (ReLU), which uses an activation function defined as f(x) = max(0, x) such that the activation function is thresholded to zero.

[0181] Pooling stage 1120 uses a pooling function that replaces the output of convolutional layer 1106 with a generalized statistical value of the nearby output. The pooling function can be used to introduce translation invariance into the neural network, such that slight translations to the input do not change the pooling output. Local translation invariance can be useful when the presence of features in the input data is more important than the precise location of the features. Various types of pooling functions can be used during pooling stage 1120, including max pooling, average pooling, and L2-norm pooling. Additionally, some CNN implementations do not include a pooling stage. Instead, such implementations use an additional convolutional stage with a larger stride relative to the previous convolutional stage.

[0182] The output from convolutional layer 1114 can then be processed by the next layer 1122. The next layer 1122 can be either an additional convolutional layer or a fully connected layer 1108. For example, Figure 11A The first convolutional layer 1104 can output to the second convolutional layer 1106, and the second convolutional layer can output to the first layer in the fully connected layer 1108.

[0183] Figure 12 An exemplary recurrent neural network 1200 is shown. In a recurrent neural network (RNN), the previous state of the network influences the output of the current state. RNNs can be constructed in a variety of ways using a wide variety of functions. The use of RNNs typically revolves around using mathematical models to predict the future based on previous input sequences. For example, an RNN can be used to perform statistical language modeling to predict an upcoming word given a previous sequence of words. The shown RNN 1200 can be described as having the following components: an input layer 1202 that receives an input vector; a hidden layer 1204 for implementing the recurrent function; a feedback mechanism 1205 for implementing a 'memory' of previous states; and an output layer 1206 for outputting the result. The RNN 1200 operates based on time steps. The state of the RNN at a given time step is influenced by the feedback mechanism 1205 based on previous time steps. For a given time step, the state of the hidden layer 1204 is defined by the previous state and the input at the current time step. The initial input (x1) at the first time step can be processed by the hidden layer 1204. The second input (x2) can be processed by hidden layer 1204 using the state information determined during the processing of the initial input (x1). The given state can be computed as s. t =f(Ux t +Ws t-1), where U and W are parameter matrices. The function f is typically nonlinear, such as a variant of the hyperbolic tangent function (Tanh) or the correction function f(x) = max(O, x). However, the specific mathematical function used in hidden layer 1204 can vary depending on the specific implementation details of the RNN1200.

[0184] In addition to the basic CNN and RNN networks described, variations of those networks can be implemented. An example RNN variant is the Long Short-Term Memory (LSTM) RNN. LSTM RNNs are capable of learning long-term dependencies necessary for processing longer language sequences. A CNN variant is the Convolutional Deep Belief Network, which has a similar structure to a CNN and is trained in a similar manner to a deep belief network. A deep belief network (DBN) is a generative neural network consisting of multiple layers of stochastic (random) variables. A DBN can be trained layer by layer using greedy unsupervised learning. The learned weights of the DBN can then be used to provide a pre-trained neural network by determining an optimal set of initial weights for the neural network.

[0185] Figure 13 This demonstrates the training and deployment of deep neural networks. Once a given network has been structured for a task, it is trained using the training dataset 1302. Various training frameworks have been developed to accelerate the training process using hardware. For example, Figure 8 The machine learning framework 804 can be configured as a training framework 1304. The training framework 1304 can be hooked up with an untrained neural network 1306 and enables the use of the parallel processing resources described herein to train the untrained neural network to generate a trained neural network 1308.

[0186] To begin the training process, initial weights can be selected randomly or by pre-training using a deep belief network. Training loops are then performed in a supervised or unsupervised manner.

[0187] Supervised learning is a learning method in which training is performed as an arbitration operation, such as when the training dataset 1302 includes inputs (which are paired with the expected outputs of said inputs), or when the training dataset includes inputs with known outputs and the outputs of the neural network are manually graded. The network processes the inputs and compares the resulting outputs with a set of expected or desired outputs. The error is then backpropagated through the system. The training framework 1304 can be tuned to adjust the weights controlling the untrained neural network 1306. The training framework 1304 can provide tools for monitoring the extent to which the untrained neural network 1306 converges to a model suitable for generating correct answers based on known input data. The training process occurs repeatedly as the network weights are adjusted to improve the outputs generated by the neural network. The training process can continue until the neural network reaches the statistically expected accuracy associated with the trained neural network 1308. The trained neural network 1308 can then be deployed to implement any number of machine learning operations.

[0188] Unsupervised learning is a learning method in which a network attempts to train itself using unlabeled data. Therefore, for unsupervised learning, the training dataset 1302 will include input data without any associated output data. The untrained neural network 1306 can learn groupings within the unlabeled inputs and can determine how individual inputs relate to the overall dataset. Unsupervised training can be used to generate self-organizing maps, which are a type of trained neural network 1307 capable of performing operations useful in data dimensionality reduction. Unsupervised training can also be used to perform anomaly detection, which allows the identification of data points in the input dataset that deviate from the normal pattern of the data.

[0189] Variations in supervised and unsupervised training can also be employed. Semi-supervised learning is a technique in which the training dataset 1302 comprises a mixture of labeled and unlabeled data with the same distribution. Incremental learning is a variant of supervised learning in which input data is continuously used for further training of the model. Incremental learning enables the trained neural network 1308 to adapt to new data 1312 without forgetting the knowledge embedded within the network during the initial training.

[0190] Whether supervised or unsupervised, training very deep neural networks can be computationally too intensive for a single computing node. A distributed network of computing nodes can be used instead of a single node to accelerate the training process.

[0191] Figure 14This is a block diagram illustrating distributed learning. Distributed learning trains a model using multiple distributed computing nodes to perform supervised or unsupervised training of a neural network. Each of these distributed computing nodes may include one or more host processors and general-purpose processing nodes, such as... Figure 9 The system features a highly parallel general-purpose graphics processing unit 900. As shown, distributed learning can perform model parallelism 1402, data parallelism 1404, or a combination of model and data parallelism 1404.

[0192] In model parallelism 1402, different computing nodes in a distributed system can perform training computations on different parts of a single network. For example, each layer of a neural network can be trained by different processing nodes in a distributed system. Benefits of model parallelism include the ability to scale to particularly large models. Splitting the computations associated with different layers of a neural network makes it possible to train very large neural networks, where the weights of all layers are not fitted into the memory of a single computing node. In some instances, model parallelism can be particularly useful in performing unsupervised training of large neural networks.

[0193] In data parallelization 1404, different nodes in a distributed network have complete instances of the model, and each node receives a different portion of the data. The results from the different nodes are then combined. While different approaches to data parallelization are possible, all data-parallel training methods require a technique for combining the results and synchronizing the model parameters between each node. Exemplary methods for combining data include parameter averaging and update-based data parallelization. Parameter averaging trains each node on a subset of the training data and sets global parameters (e.g., weights, biases) to the average of the parameters from each node. Parameter averaging uses a central parameter server that maintains the parameter data. Update-based data parallelization is similar to parameter averaging, except that model updates are passed instead of parameters from nodes to a parameter server. Additionally, update-based data parallelization can be performed in a distributed manner, where updates are compressed and passed between nodes.

[0194] For example, combined model and data parallelism can be implemented in a distributed system, where each computing node includes multiple GPUs. Each node can have a complete instance of the model, with individual GPUs within each node used to train different parts of the model.

[0195] Distributed training incurs increased overhead compared to training on a single machine. However, the parallel processors and GPGPUs described in this paper can each implement techniques to reduce the overhead of distributed training, including techniques for enabling high-bandwidth GPU-to-GPU data transfer and accelerating remote data synchronization.

[0196] Exemplary machine learning applications

[0197] Machine learning can be applied to solve a wide range of technical problems, including but not limited to computer vision, autonomous driving and navigation, speech recognition, and language processing. Computer vision has traditionally been one of the most active research areas for machine learning applications. Applications of computer vision range from reproducing human visual abilities (e.g., recognizing faces) to creating new categories of visual abilities. For example, a computer vision application can be configured to identify sound waves from vibrations induced in objects visible in a video. Parallel processor-accelerated machine learning enables the training of computer vision applications using training datasets significantly larger than previously feasible ones, and allows the deployment of inference systems using low-power parallel processors.

[0198] Parallel processor-accelerated machine learning has applications in autonomous driving, including lane and road sign recognition, obstacle avoidance, navigation, and driving control. Accelerated machine learning techniques can be used to train driving models based on datasets that define appropriate responses to specific training inputs. The parallel processors described in this paper enable the rapid training of increasingly sophisticated neural networks for autonomous driving solutions and allow the deployment of low-power inference processors in mobile platforms suitable for integration into autonomous vehicles.

[0199] Parallel processor-accelerated deep neural networks have been implemented as machine learning methods for Automatic Speech Recognition (ASR). ASR involves creating functions that compute the most probable language sequence given an input speech sequence. Accelerated machine learning using deep neural networks has replaced previous Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) used for ASR.

[0200] Parallel processor-accelerated machine learning can also be used to accelerate natural language processing. Automated learning programs can use statistical inference algorithms to generate models that are robust to errors or unfamiliar inputs. Exemplary natural language processor applications include automated machine translation between human languages.

[0201] Parallel processing platforms for machine learning can be divided into training platforms and deployment platforms. Training platforms are typically highly parallel and include optimizations to accelerate multi-GPU single-node training and multi-node multi-GPU training. Exemplary parallel processors suitable for training include... Figure 9The highly parallel general-purpose graphics processing unit 900 and Figure 10 The multi-GPU computing system is 1000. In contrast, deployed machine learning platforms typically include low-power parallel processors suitable for use in products such as cameras, autonomous robots, and autonomous vehicles.

[0202] Figure 15 An exemplary System-on-Chip (SOC) 1500 for inference, suitable for performing inference using a trained model, is shown. The SOC 1500 may integrate multiple processing units, including a media processor 1502, a vision processor 1504, a GPGPU 1506, and a multi-core processor 1508. The SOC 1500 may additionally include on-chip memory 1505, which can implement a shared on-chip data pool accessible by each of the processing units. The processing units may be optimized for low-power operation to enable deployment to a wide variety of machine learning platforms, including autonomous vehicles and autonomous robots. For example, one implementation of the SOC 1500 can be used as part of a main control system for an autonomous vehicle. When the SOC 1500 is configured for use in an autonomous vehicle, the SOC is designed and configured to comply with the relevant functional safety standards of the deployment jurisdiction.

[0203] During operation, the media processor 1502 and the vision processor 1504 can work consistently to accelerate computer vision operations. The media processor 1502 enables low-latency decoding of multiple high-resolution (e.g., 4K, 8K) video streams. The decoded video streams can be written to a buffer in on-chip memory 1505. The vision processor 1504 can then parse the decoded video and perform preliminary processing operations on the frames of the decoded video to prepare them for processing using a trained image recognition model. For example, the vision processor 1504 can accelerate convolution operations for CNNs (for performing image recognition on high-resolution video data), while back-end model computation is performed by the GPGPU 1506.

[0204] The multi-core processor 1508 may include control logic for ordering and synchronization to facilitate data transfer, as well as shared memory operations performed by the media processor 1502 and the vision processor 1504. The multi-core processor 1508 may also act as an application processor for executing software applications that can utilize the inference computing capabilities of the GPGPU 1506. For example, at least a portion of navigation and driving logic may be implemented in software executing on the multi-core processor 1508. Such software may directly offload computational workloads to the GPGPU 1506, or it may offload computational workloads to the multi-core processor 1508, which may offload at least a portion of those operations to the GPGPU 1506.

[0205] The GPGPU 1506 may include compute clusters, such as low-power configurations of compute clusters 906A to 906H within the highly parallel general-purpose graphics processing unit 700. The compute clusters within the GPGPU 1506 can support instructions that are explicitly optimized for performing inference computations on trained neural networks. For example, the GPGPU 1506 can support instructions for performing low-precision computations (e.g., 8-bit and 4-bit integer vector operations).

[0206] Additional exemplary graphics processing system

[0207] Details of the embodiments described above may be included in the graphics processing system and apparatus described below. Figures 16 to 29 The graphics processing systems and apparatuses demonstrate alternative systems and graphics processing hardware that can implement any and all of the technologies described above.

[0208] Additional Exemplary Graphics Processing System Overview

[0209] Figure 16 This is a block diagram of a processing system 1600 according to an embodiment. In various embodiments, system 1600 includes one or more processors 1602 and one or more graphics processors 1608, and may be a single-processor desktop system, a multi-processor workstation system, or a server system having a large number of processors 1602 or processor cores 1607. In one embodiment, system 1600 is a processing platform incorporated within a system-on-a-chip (SoC) integrated circuit for use in mobile devices, handheld devices, or embedded devices.

[0210] Embodiments of system 1600 may include or incorporate server-based game platforms, game consoles, including game and media consoles, mobile game consoles, handheld game consoles, or online game consoles. In some embodiments, system 1600 is a mobile phone, smartphone, tablet computing device, or mobile internet device. Data processing system 1600 may also include wearable devices (such as smartwatches, smart glasses, augmented reality devices, or virtual reality devices), coupled to said wearable device, or integrated into said wearable device. In some embodiments, data processing system 1600 is a television or set-top box device having one or more processors 1602 and a graphical interface generated by one or more graphics processors 1608.

[0211] In some embodiments, one or more processors 1602 each include one or more processor cores 1607 for processing instructions that, when executed, perform operations on the system and user software. In some embodiments, each of the one or more processor cores 1607 is configured to process a particular instruction set 1609. In some embodiments, the instruction set 1609 may facilitate Complex Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), or computation via Very Long Instruction Word (VLIW). Multiple processor cores 1607 may each process different instruction sets 1609, which may include instructions for facilitating emulation of other instruction sets. Processor cores 1607 may also include other processing devices, such as digital signal processors (DSPs).

[0212] In some embodiments, processor 1602 includes cache memory 1604. Depending on the architecture, processor 1602 may have a single internal cache or multiple levels of internal caches. In some embodiments, cache memory is shared among components of processor 1602. In some embodiments, processor 1602 also uses an external cache (e.g., a Level 3 (L3) cache or a Last Level Cache (LLC)) (not shown), which can be shared among processor core 1607 using known cache coherence techniques. Additionally, register file 1606 is included in processor 1602, which may include different types of registers for storing different types of data (e.g., integer registers, floating-point registers, status registers, and instruction pointer registers). Some registers may be general-purpose registers, while others may be specific to the design of processor 1602.

[0213] In some embodiments, processor 1602 is coupled to processor bus 1610, which is used to transmit communication signals, such as address, data, or control signals, between processor 1602 and other components within system 1600. In one embodiment, system 1600 uses an exemplary 'central' system architecture including a memory controller central hub 1616 and an input / output (I / O) controller central hub 1630. Memory controller central hub 1616 facilitates communication between memory devices and other components of system 1600, while I / O controller central hub (ICH) 1630 provides connectivity to I / O devices via a local I / O bus. In one embodiment, the logic of memory controller central hub 1616 is integrated within the processor.

[0214] Memory device 1620 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, a phase-change memory device, or some other memory device with suitable performance for use as processing memory. In one embodiment, memory device 1620 may operate as system memory of system 1600 to store data 1622 and instructions 1621 for use when one or more processors 1602 execute an application or process. Memory controller hub 1616 is also coupled to an optional external graphics processor 1612, which may communicate with one or more graphics processors 1608 in processor 1602 to perform graphics and media operations.

[0215] In some embodiments, ICH 1630 enables peripheral components to be connected to memory device 1620 and processor 1602 via a high-speed I / O bus. I / O peripherals include, but are not limited to: an audio controller 1646, a firmware interface 1628, a wireless transceiver 1626 (e.g., Wi-Fi, Bluetooth), a data storage device 1624 (e.g., hard disk drive, flash memory, etc.), and a conventional I / O controller 1640 for coupling conventional (e.g., Personal System 2 (PS / 2)) devices to the system. One or more Universal Serial Bus (USB) controllers 1642 connect multiple input devices, such as a keyboard and mouse combination 1644. A network controller 1634 may also be coupled to ICH 1630. In some embodiments, a high-performance network controller (not shown) is coupled to processor bus 1610. It should be understood that the illustrated system 1600 is exemplary and not limiting, as other types of data processing systems configured differently may also be used. For example, the I / O controller hub 1630 may be integrated within one or more processors 1602, or the memory controller hub 1616 and the I / O controller hub 1630 may be integrated within a discrete external graphics processor (such as external graphics processor 1612).

[0216] Figure 17 This is a block diagram of an embodiment of processor 1700, which has one or more processor cores 1702A to 1702N, an integrated memory controller 1714, and an integrated graphics processor 1708. Figure 17Those elements having the same reference numerals (or names) as elements in any other figure herein may operate or function in any manner similar to, but not limited to, those described elsewhere herein. Processor 1700 may include, and include, an additional core 1702N, indicated by a dashed box. Each processor core 1702A to 1702N includes one or more internal cache units 1704A to 1704N. In some embodiments, each processor core may also access one or more shared cache units 1706.

[0217] Internal cache units 1704A to 1704N and shared cache unit 1706 represent the cache memory hierarchy within processor 1700. The cache memory hierarchy may include at least one level of instruction and data cache within each processor core and one or more levels of shared intermediate cache, such as Level 2 (L2), Level 3 (L3), Level 4 (L4), or other levels of cache, wherein the highest-level cache is classified as LLC before external memory. In some embodiments, cache coherence logic maintains coherence between cache units 1706 and 1704A to 1704N.

[0218] In some embodiments, the processor 1700 may further include a set of one or more bus controller units 1716 and a system agent core 1710. The one or more bus controller units 1716 manage a set of peripheral buses, such as one or more peripheral component interconnect buses (e.g., PCI, PCI Express). The system agent core 1710 provides management functions for each processor unit. In some embodiments, the system agent core 1710 includes one or more integrated memory controllers 1714 for managing access to various external memory devices (not shown).

[0219] In some embodiments, one or more of processor cores 1702A to 1702N include support for simultaneous multithreading. In this embodiment, system agent core 1710 includes components for coordinating and operating processor cores 1702A to 1702N during multithreaded processing. Additionally, system agent core 1710 may also include a power control unit (PCU) including logic and components for regulating the power states of processor cores 1702A to 1702N, and a graphics processor 1708.

[0220] In some embodiments, processor 1700 further includes a graphics processor 1708 for performing graphics processing operations. In some embodiments, graphics processor 1708 is coupled to a shared cache unit 1706 and a system proxy core 1710, the system proxy core including one or more integrated memory controllers 1714. In some embodiments, display controller 1711 is coupled to graphics processor 1708 to drive graphics processor output to one or more coupled displays. In some embodiments, display controller 1711 may be a separate module coupled to graphics processor via at least one interconnect, or it may be integrated within graphics processor 1708 or system proxy core 1710.

[0221] In some embodiments, ring-based interconnect units 1712 are used to couple internal components of processor 1700. However, alternative interconnect units, such as point-to-point interconnects, switched interconnects, or other technologies, including those well known in the art, may be used. In some embodiments, graphics processor 1708 is coupled to ring interconnect 1712 via I / O link 1713.

[0222] Exemplary I / O link 1713 represents at least one of a variety of I / O interconnects, including packaged I / O interconnects that facilitate communication between various processor components and a high-performance embedded memory module 1718 (such as an eDRAM module). In some embodiments, each of the processor cores 1702A to 1702N and the graphics processor 1708 uses the embedded memory module 1718 as a shared final-level cache.

[0223] In some embodiments, processor cores 1702A to 1702N are homogeneous cores executing the same instruction set architecture. In another embodiment, processor cores 1702A to 1702N are heterogeneous in terms of instruction set architecture (ISA), wherein one or more of processor cores 1702A to 1702N execute a first instruction set, while at least one of the other cores executes a subset of the first instruction set or a different instruction set. In one embodiment, processor cores 1702A to 1702N are homogeneous in terms of microarchitecture, wherein one or more cores with relatively high power consumption are coupled to one or more power cores with lower power consumption. Additionally, processor 1700 can be implemented on one or more chips or implemented as a SoC integrated circuit having, among other components, the components shown.

[0224] Figure 18This is a block diagram of a graphics processor 1800, which may be a discrete graphics processing unit or a graphics processor integrated with multiple processing cores. In some embodiments, the graphics processor communicates with memory via a mapped I / O interface to registers on the graphics processor and using commands placed in processor memory. In some embodiments, the graphics processor 1800 includes a memory interface 1814 for accessing memory. The memory interface 1814 may be an interface to local memory, one or more internal caches, one or more shared external caches, and / or to system memory.

[0225] In some embodiments, the graphics processor 1800 further includes a display controller 1802 for driving display output data to a display device 1820. The display controller 1802 includes hardware for one or more overlapping planes of the display and a multi-layer video or user interface component. In some embodiments, the graphics processor 1800 includes a video codec engine 1806 for encoding, decoding, or converting media codes to, from, or between one or more media encoding formats, including but not limited to: Moving Picture Experts Group (MPEG) (such as MPEG-2), Advanced Video Coding (AVC) formats (such as H.264 / MPEG-4 AVC), and Society of Motion Picture & Television Engineers (SMPTE) 421M / VC-1, and Joint Picture Experts Group (JPEG) formats (such as JPEG and Motion JPEG (MJPEG)).

[0226] In some embodiments, the graphics processor 1800 includes a block image transfer (BLIT) engine 1804 for performing two-dimensional (2D) rasterizer operations, including, for example, bit boundary block transfer. However, in one embodiment, 2D graphics operations are performed using one or more components of a graphics processing engine (GPE) 1810. In some embodiments, the GPE 1810 is a computational engine for performing graphics operations, including three-dimensional (3D) graphics operations and media operations.

[0227] In some embodiments, GPE 1810 includes a 3D pipeline 1812 for performing 3D operations, such as rendering 3D images and scenes using processing functions acting on 3D primitive shapes (e.g., rectangles, triangles, etc.). The 3D pipeline 1812 includes programmable and fixed functional elements that perform various tasks within elements and / or generated execution threads leading to the 3D / media subsystem 1815. While the 3D pipeline 1812 can be used to perform media operations, embodiments of GPE 1810 also include a media pipeline 1816 specifically for performing media operations such as video post-processing and image enhancement.

[0228] In some embodiments, the media pipeline 1816 includes fixed-function or programmable logic units to perform one or more specialized media operations, such as video decoding acceleration, video deinterleaving, and video encoding acceleration, in place of or on behalf of the video codec engine 1806. In some embodiments, the media pipeline 1816 further includes a thread generation unit to generate threads for execution on the 3D / media subsystem 1815. The generated threads perform calculations on media operations on one or more graphics execution units included in the 3D / media subsystem 1815.

[0229] In some embodiments, the 3D / media subsystem 1815 includes logic for executing threads generated by the 3D pipeline 1812 and the media pipeline 1816. In one embodiment, the pipelines send thread execution requests to the 3D / media subsystem 1815, the 3D / media subsystem including thread dispatch logic for arbitrating and dispatching requests to available thread execution resources. Execution resources include an array of graphics execution units for processing 3D and media threads. In some embodiments, the 3D / media subsystem 1815 includes one or more internal caches for thread instructions and data. In some embodiments, the subsystem also includes shared memory (including registers and addressable memory) for sharing data between threads and for storing output data.

[0230] Graphics processing engine

[0231] Figure 19 This is a block diagram of a graphics processing engine 1910 of a graphics processor according to some embodiments. In one embodiment, the graphics processing engine (GPE) 1910 is... Figure 18 The image shows a version of GPE 1810. Figure 19 Those elements having the same reference numerals (or names) as elements in any other figure herein may operate or function in any manner similar to, but not limited to, those described elsewhere herein. For example, shown Figure 18 The 3D pipeline 1812 and media pipeline 1816 are included. The media pipeline 1816 is optional in some embodiments of the GPE 1910 and may not be explicitly included within the GPE 1910. For example, and in at least one embodiment, a separate media and / or image processor is coupled to the GPE 1910.

[0232] In some embodiments, GPE 1910 is coupled to or includes command stream converter 1903, which provides command streams to 3D pipeline 1812 and / or media pipeline 1816. In some embodiments, command stream converter 1903 is coupled to memory, which may be system memory, or one or more cache memories of internal cache memory and shared cache memory. In some embodiments, command stream converter 1903 receives commands from memory and sends these commands to 3D pipeline 1812 and / or media pipeline 1816. The commands are instructions obtained from a ring buffer storing instructions for 3D pipeline 1812 and media pipeline 1816. In one embodiment, the ring buffer may also include a batch command buffer storing multiple batches of commands. Commands for 3D pipeline 1812 may also include references to data stored in memory, such as, but not limited to, vertex and geometry data for 3D pipeline 1812 and / or image data and memory objects for media pipeline 1816. The 3D pipeline 1812 and the media pipeline 1816 process the commands by performing operations via logic within their respective pipelines or by dispatching one or more execution threads to the execution unit array 1914.

[0233] In various embodiments, the 3D pipeline 1812 can execute one or more shader programs, such as vertex shaders, geometry shaders, pixel shaders, fragment shaders, compute shaders, or other shader programs, by processing instructions and dispatching execution threads to the graphics core array 1914. The graphics core array 1914 provides a unified block of execution resources. The multipurpose execution logic (e.g., execution units) within the graphics core array 1914 includes support for various 3D API shader languages and can execute multiple concurrent threads associated with multiple shaders.

[0234] In some embodiments, the graphics core array 1914 further includes execution logic for performing media functions such as video and / or image processing. In one embodiment, in addition to graphics processing operations, the execution unit also includes general-purpose logic programmable to perform parallel general-purpose computing operations. The general-purpose logic can be... Figure 16 (Multiple) processor cores 1607 or Figure 17 The general logic within the processor cores 1702A to 1702N performs processing operations in parallel or in combination.

[0235] Output data generated by threads executing on the graphics core array 1914 can be output to memory in a unified return buffer (URB) 1918. The URB 1918 can store data from multiple threads. In some embodiments, the URB 1918 can be used to send data between different threads executing on the graphics core array 1914. In some embodiments, the URB 1918 can also be used for synchronization between threads on the graphics core array and fixed-function logic within shared-function logic 1920.

[0236] In some embodiments, the graphics core array 1914 is scalable, such that the array includes a variable number of graphics cores, each having a variable number of execution units based on the target power and performance level of the GPE 1910. In one embodiment, the execution resources are dynamically scalable, allowing them to be enabled or disabled as needed.

[0237] The graphics core array 1914 is coupled to shared function logic 1920, which includes multiple resources shared among the graphics cores in the graphics core array. The shared functions within the shared function logic 1920 are hardware logic units that provide dedicated supplementary functions to the graphics core array 1914. In various embodiments, the shared function logic 1920 includes, but is not limited to, sampler 1921, math 1922, and inter-thread communication (ITC) 1923 logic. Additionally, some embodiments implement one or more caches 1925 within the shared function logic 1920. Shared functions are implemented when the demand for a given dedicated function is insufficient to be contained within the graphics core array 1914. Instead, a single instance of the dedicated function is implemented as a separate entity within the shared function logic 1920 and shared among execution resources within the graphics core array 1914. The exact set of functions shared and included within the graphics core array 1914 varies between embodiments.

[0238] Figure 20 This is a block diagram of another embodiment of the graphics processor 2000. Figure 20 Those elements having the same reference numerals (or names) as those in any other figure herein may operate or function in any manner similar to, but not limited to, those described elsewhere herein.

[0239] In some embodiments, the graphics processor 2000 includes a ring interconnect 2002, a pipeline front-end 2004, a media engine 2037, and graphics cores 2080A to 2080N. In some embodiments, the ring interconnect 2002 couples the graphics processor to other processing units, including other graphics processors or one or more general-purpose processor cores. In some embodiments, the graphics processor is one of a plurality of processors integrated within a multi-core processing system.

[0240] In some embodiments, the graphics processor 2000 receives multiple batches of commands via a ring interconnect 2002. The incoming commands are interpreted by a command stream converter 2003 in a pipeline front-end 2004. In some embodiments, the graphics processor 2000 includes scalable execution logic for performing 3D geometry processing and media processing via graphics cores(a plurality of) 2080A to 2080N. For 3D geometry processing commands, the command stream converter 2003 supplies commands to a geometry pipeline 2036. For at least some media processing commands, the command stream converter 2003 supplies commands to a video front-end 2034, which is coupled to a media engine 2037. In some embodiments, the media engine 2037 includes a video quality engine (VQE) 2030 for video and image post-processing and a multi-format encoding / decoding (MFX) engine 2033 for providing hardware-accelerated media data encoding and decoding. In some embodiments, the geometry pipeline 2036 and the media engine 2037 each generate execution threads for use with thread execution resources provided by at least one graphics core 2080A.

[0241] In some embodiments, the graphics processor 2000 includes Scalable Thread Execution Resource Characteristic (SRM) cores 2080A to 2080N (sometimes referred to as core shards), each SRM core having multiple sub-cores 2050A to 550N, 2060A to 2060N (sometimes referred to as core sub-shards). In some embodiments, the graphics processor 2000 may have any number of graphics cores 2080A to 2080N. In some embodiments, the graphics processor 2000 includes a graphics core 2080A, which has at least a first sub-core 2050A and a second sub-core 2060A. In other embodiments, the graphics processor is a low-power processor having a single sub-core (e.g., 2050A). In some embodiments, the graphics processor 2000 includes multiple graphics cores 2080A to 2080N, each graphics core including a set of first sub-cores 2050A to 2050N and a set of second sub-cores 2060A to 2060N. Each of the first set of sub-cores 2050A to 2050N includes at least a first set of execution units 2052A to 2052N and media / texture samplers 2054A to 2054N. Each of the second set of sub-cores 2060A to 2060N includes at least a second set of execution units 2062A to 2062N and samplers 2064A to 2064N. In some embodiments, each sub-core 2050A to 2050N and 2060A to 2060N shares a set of shared resources 2070A to 2070N. In some embodiments, the shared resources include shared cache memory and pixel operation logic. Other shared resources may also be included in various embodiments of the graphics processor.

[0242] Execution unit

[0243] Figure 21 Thread execution logic 2100 is shown, which includes an array of processing elements employed in some embodiments of GPE. Figure 21 Those elements having the same reference numerals (or names) as those in any other figure herein may operate or function in any manner similar to, but not limited to, those described elsewhere herein.

[0244] In some embodiments, thread execution logic 2100 includes a shader processor 2102, a thread dispatcher 2104, an instruction cache 2106, a scalable execution unit array including multiple execution units 2108A to 2108N, a sampler 2110, a data cache 2112, and a data port 2114. In one embodiment, the scalable execution unit array can be dynamically scaled by enabling or disabling one or more execution units (e.g., any one of execution units 2108A, 2108B, 2108C, 2108D, up to 2108N-1 and 2108N) based on workload computational requirements. In one embodiment, the included components are interconnected via an interconnect structure linking to each of the components. In some embodiments, thread execution logic 2100 includes one or more connections to memory (such as system memory or cache memory) via one or more of the instruction cache 2106, data port 2114, sampler 2110, and execution unit arrays 2108A to 2108N. In some embodiments, each execution unit (e.g., 2108A) is an independent programmable general-purpose computing unit capable of executing multiple synchronous hardware threads while processing multiple data elements in parallel for each thread. In various embodiments, the array of execution units 2108A to 2108N is scalable to include any number of individual execution units.

[0245] In some embodiments, execution units 2108A to 2108N are primarily used to execute shader programs. Shader processor 2102 can handle various shader programs and dispatch execution threads associated with the shader programs via thread dispatcher 2104. In one embodiment, the thread dispatcher includes logic for arbitrating thread requests from the graphics and media pipeline and instantiating the requested threads on one or more execution units 2108A to 2108N. For example, a geometry pipeline (e.g., Figure 20 The 2036) can dispatch vertex processing, tessellation, or geometry processing threads to thread execution logic 2100. Figure 21The thread dispatcher 2104 can also handle runtime thread generation requests from the shader execution program.

[0246] In some embodiments, execution units 2108A to 2108N support instruction sets that include native support for many standard 3D graphics shader instructions, enabling minimal conversion to execute shader programs from graphics libraries (e.g., Direct3D and OpenGL). These execution units support vertex and geometry processing (e.g., vertex programs, geometry programs, vertex shaders), pixel processing (e.g., pixel shaders, fragment shaders), and general-purpose processing (e.g., computation and media shaders). Each of execution units 2108A to 2108N is capable of executing multiple-issue single-instruction multiple-data (SIMD) operations, and multithreaded operation enables an efficient execution environment in the face of memory accesses with high latency. Each hardware thread within each execution unit has a dedicated high-bandwidth register file and associated independent thread state. For pipelines with integer, single-precision floating-point and double-precision floating-point operations, SIMD branching capabilities, logical operations, transcendental operations, and other mixed operations, execution is multiple-issue per clock cycle. While waiting for data from one of the memory or shared functions, the dependency logic within execution units 2108A to 2108N causes the waiting thread to sleep until the requested data has been returned. While the waiting thread is sleeping, hardware resources may be dedicated to processing other threads. For example, during the latency associated with vertex shader operations, the execution unit may perform operations on a pixel shader, fragment shader, or another type of shader program that includes different vertex shaders.

[0247] Each execution unit in execution units 2108A to 2108N operates on an array of data elements. The number of data elements is the "execution size," or the number of instruction channels. An execution channel is a logical unit that performs data element access, masking, and flow control within instructions. The number of channels may be independent of the number of physical arithmetic logic units (ALUs) or floating-point units (FPUs) for a particular graphics processor. In some embodiments, execution units 2108A to 2108N support both integer and floating-point data types.

[0248] The execution unit instruction set includes SIMD instructions. Various data elements can be stored in registers as compressed data types, and the execution unit will process these elements based on their data size. For example, when operating on a 256-bit wide vector, the 256-bit vector is stored in registers, and the execution unit operates on the vector as four individual 64-bit compressed data elements (four times the word length (QW) size), eight individual 32-bit compressed data elements (double the word length (DW) size), sixteen individual 16-bit compressed data elements (word length (W) size), or thirty-two individual 8-bit data elements (byte (B) size). However, different vector widths and register sizes are possible.

[0249] One or more internal instruction caches (e.g., 2106) are included in the thread execution logic 2100 to cache thread instructions of the execution unit. In some embodiments, one or more data caches (e.g., 2112) are included for caching thread data during thread execution. In some embodiments, sampler 2110 is included for providing texture sampling for 3D operations and media sampling for media operations. In some embodiments, sampler 2110 includes dedicated texture or media sampling functions to process texture or media data during the sampling process before providing sampled data to the execution unit.

[0250] During execution, the graphics and media pipeline sends thread initiation requests to thread execution logic 2100 via thread generation and dispatch logic. Once a set of geometric objects has been processed and rasterized into pixel data, pixel processor logic (e.g., pixel shader logic, fragment shader logic, etc.) within shader processor 2102 is invoked to further compute output information and write the results to output surfaces (e.g., color buffer, depth buffer, stencil buffer, etc.). In some embodiments, the pixel shader or fragment shader computes values for vertex attributes interpolated across the rasterized object. In some embodiments, the pixel processor logic within shader processor 2102 then executes a pixel or fragment shader program provided by an application programming interface (API). To execute the shader program, shader processor 2102 dispatches threads to execution units (e.g., 2108A) via thread dispatcher 2104. In some embodiments, pixel shader 2102 uses texture sampling logic in sampler 2110 to access texture data in a texture map stored in memory. Arithmetic operations are performed on the texture data and the input geometry data to calculate the pixel color data of each geometric fragment, or to discard one or more pixels without further processing.

[0251] In some embodiments, data port 2114 provides a memory access mechanism for thread execution logic 2100 to output processed data to memory for processing on the graphics processor output pipeline. In some embodiments, data port 2114 includes or is coupled to one or more cache memories (e.g., data cache 2112) to cache data via the data port for memory access.

[0252] Figure 22 This is a block diagram illustrating a graphics processor instruction format 2200 according to some embodiments. In one or more embodiments, the graphics processor execution unit supports an instruction set having multiple instruction formats. Solid lines represent components typically included in the execution unit instructions, while dashed lines represent optional components or components included only in subsets of the instructions. In some embodiments, the instruction format 2200 described and illustrated are macro instructions, as they are instructions supplied to the execution unit, as opposed to micro-operations generated from instruction decoding (once the instruction is processed).

[0253] In some embodiments, the graphics processor execution unit natively supports instructions using the 128-bit instruction format 2210. A 64-bit compact instruction format 2230 can be used for some instructions based on the selected instruction, multiple instruction options, and the number of operands. The native 128-bit instruction format 2210 provides access to all instruction options, while some options and operations are limited to the 64-bit format 2230. The native instructions available in the 64-bit format 2230 vary depending on the embodiment. In some embodiments, instructions are partially compressed using a set of index values in the index field 2213. The execution unit hardware references a set of compression tables based on the index values and uses the output of the compression tables to reconstruct the native instructions using the 128-bit instruction format 2210.

[0254] For each format, instruction opcode 2212 defines the operation to be performed by the execution unit. The execution unit executes each instruction in parallel across multiple data elements of each operand. For example, in response to an add instruction, the execution unit performs a synchronous add operation across each color channel, which represents a texture element or a picture element. By default, the execution unit executes each instruction across all data channels of the operand. In some embodiments, instruction control field 2214 enables control over certain execution options, such as channel selection (e.g., prediction) and data channel ordering (e.g., blending). For instructions using 128-bit instruction format 2210, execution size field 2216 limits the number of data channels to be executed in parallel. In some embodiments, execution size field 2216 is not available for 64-bit compact instruction format 2230.

[0255] Some execution unit instructions have up to three operands, including two source operands (src0 2220, src1 2222) and a destination 2218. In some embodiments, the execution unit supports dual-destination instructions, where one of these destinations is implicit. Data manipulation instructions may have a third source operand (e.g., SRC2 2224), where the instruction opcode 2212 determines the number of source operands. The last source operand of the instruction may be an on-the-fly (e.g., hard-coded) value passed using the instruction.

[0256] In some embodiments, the 128-bit instruction format 2210 includes an access / address mode field 2226, which specifies, for example, whether direct register addressing mode or indirect register addressing mode is used. When direct register addressing mode is used, the register addresses of one or more operands are provided directly by bits in the instruction.

[0257] In some embodiments, the 128-bit instruction format 2210 includes an access / address mode field 2226 that specifies the address mode and / or access mode of the instruction. In one embodiment, the access mode is used to define the data access alignment for the instruction. Some embodiments support access modes including a 16-byte aligned access mode and a 1-byte aligned access mode, wherein the byte alignment of the access mode determines the access alignment of the instruction operands. For example, in a first mode, the instruction can use byte-aligned addressing for both source and destination operands, and in a second mode, the instruction can use 16-byte aligned addressing for both source and destination operands.

[0258] In one embodiment, the address mode portion of the access / address mode field 2226 determines whether the instruction uses direct or indirect addressing. When using direct register addressing mode, bits in the instruction directly provide the register addresses of one or more operands. When using indirect register addressing mode, the register addresses of one or more operands can be calculated based on the address register value and the address immediate number field in the instruction.

[0259] In some embodiments, instructions are grouped based on the 2212-bit field of the opcode to simplify opcode decoding 2240. For an 8-bit opcode, bits 4, 5, and 6 allow the execution unit to determine the type of opcode. The precise opcode grouping shown is merely exemplary. In some embodiments, the move and logic opcode group 2242 includes data move and logic instructions (e.g., move (mov), compare (cmp)). In some embodiments, the move and logic opcode group 2242 shares five most significant bits (MSB), where move (mov) instructions are in the form of 0000xxxxb, and logic instructions are in the form of 0001xxxxb. The flow control instruction group 2244 (e.g., call, jump (jmp)) includes instructions in the form of 0010xxxxb (e.g., 0x20). The mixed instruction group 2246 includes a mixture of instructions, including synchronous instructions (e.g., wait, send) in the form of 0011xxxxb (e.g., 0x30). Parallel math instruction set 2248 includes component-based arithmetic instructions (e.g., add, multiply) in the form 0100xxxxb (e.g., 0x40). Parallel math set 2248 performs arithmetic operations in parallel across data channels. Vector math set 2250 includes arithmetic instructions (e.g., dp4) in the form 0101xxxxb (e.g., 0x50). Vector math set performs arithmetic operations, such as dot product, on vector operands.

[0260] Graphics Pipeline

[0261] Figure 23 This is a block diagram of another embodiment of the graphics processor 2300. Figure 23 Those elements having the same reference numerals (or names) as those in any other figure herein may operate or function in any manner similar to, but not limited to, those described elsewhere herein.

[0262] In some embodiments, the graphics processor 2300 includes a graphics pipeline 2320, a media pipeline 2330, a display engine 2340, thread execution logic 2350, and a rendering output pipeline 2370. In some embodiments, the graphics processor 2300 is a graphics processor within a multi-core processing system including one or more general-purpose processing cores. The graphics processor is controlled by register writes to one or more control registers (not shown) or by commands issued to the graphics processor 2300 via a ring interconnect 2302. In some embodiments, the ring interconnect 2302 couples the graphics processor 2300 to other processing components, such as other graphics processors or general-purpose processors. Commands from the ring interconnect 2302 are interpreted by a command stream converter 2303, which supplies instructions to individual components of the graphics pipeline 2320 or the media pipeline 2330.

[0263] In some embodiments, command stream converter 2303 guides the operation of vertex acquirer 2305, which reads vertex data from memory and executes vertex processing commands provided by command stream converter 2303. In some embodiments, vertex acquirer 2305 provides vertex data to vertex shader 2307, which performs coordinate space transformation and lighting operations on each vertex. In some embodiments, vertex acquirer 2305 and vertex shader 2307 execute vertex processing instructions by dispatching execution threads to execution units 2352A to 2352B via thread dispatcher 2331.

[0264] In some embodiments, execution units 2352A to 2352B are vector processor arrays having an instruction set for performing graphics and media operations. In some embodiments, execution units 2352A to 2352B have an attached L1 cache 2351, which is dedicated to each array or shared between arrays. The cache may be configured as a data cache, an instruction cache, or a single cache partitioned to contain data and instructions in different partitions.

[0265] In some embodiments, the graphics pipeline 2320 includes a tessellation component for performing hardware-accelerated tessellation of 3D objects. In some embodiments, a programmable shell shader 811 configures the tessellation operation. A programmable domain shader 817 provides back-end evaluation of the tessellation output. A tessellation unit 2313 operates in the direction of the shell shader 2311 and includes dedicated logic for generating a detailed set of geometric objects based on a rough geometry model that is provided as input to the graphics pipeline 2320. In some embodiments, if tessellation is not used, the tessellation components (e.g., shell shader 2311, tessellation unit 2313, domain shader 2317) can be bypassed.

[0266] In some embodiments, the complete geometry object may be processed by the geometry shader 2319 via one or more threads dispatched to the execution units 2352A to 2352B, or it may proceed directly to the clipper 2329. In some embodiments, the geometry shader operates on the entire geometry object (rather than vertices or vertex patches in previous stages of the graphics pipeline). If tessellation is disabled, the geometry shader 2319 receives input from the vertex shader 2307. In some embodiments, the geometry shader 2319 may be programmed by a geometry shader program to perform geometric tessellation when the tessellation unit is disabled.

[0267] Prior to rasterization, clipper 2329 processes vertex data. Clipper 2329 may be a fixed-function clipper or a programmable clipper with clipping and geometry shader capabilities. In some embodiments, raster and depth testing unit 2373 in rendering output pipeline 2370 dispatches pixel shaders to convert geometry objects into their per-pixel representations. In some embodiments, pixel shader logic is included in thread execution logic 2350. In some embodiments, the application may bypass raster and depth testing unit 2373 and access unrasterized vertex data via outflow unit 2323.

[0268] The graphics processor 2300 has an interconnect bus, interconnect structure, or some other interconnect mechanism that allows data and messages to be transferred among the main components of the graphics processor. In some embodiments, execution units 2352A to 2352B and(multiple) associated caches 2351, texture and media samplers 2354, and texture / sampler cache 2358 are interconnected via data port 2356 to perform memory accesses and communicate with the processor's rendering output pipeline components. In some embodiments, samplers 2354, caches 2351, 2358, and execution units 2352A to 2352B each have a separate memory access path.

[0269] In some embodiments, the rendering output pipeline 2370 includes a raster and depth testing unit 2373 that converts vertex-based objects into associated pixel-based representations. In some embodiments, rasterizer logic includes a windower / mask unit for performing fixed-function triangle and line rasterization. Associated rendering cache 2378 and depth cache 2379 are also available in some embodiments. Pixel manipulation unit 2377 performs pixel-based operations on the data; however, in some instances, pixel operations associated with 2D operations (e.g., using mixed bit-block image passing) are performed by the 2D engine 2341, or at display time by the display controller 2343 using an overlay display plane. In some embodiments, a shared L3 cache 2375 is available for all graphics components, allowing data to be shared without using main system memory.

[0270] In some embodiments, the graphics processor media pipeline 2330 includes a media engine 2337 and a video front-end 2334. In some embodiments, the video front-end 2334 receives pipeline commands from a command stream converter 2303. In some embodiments, the media pipeline 2330 includes a separate command stream converter. In some embodiments, the video front-end 2334 processes media commands before sending them to the media engine 2337. In some embodiments, the media engine 2337 includes a thread generation function for generating threads for dispatching to thread execution logic 2350 via a thread dispatcher 2331.

[0271] In some embodiments, the graphics processor 2300 includes a display engine 2340. In some embodiments, the display engine 2340 is external to the processor 2300 and coupled to the graphics processor via a ring interconnect 2302, or some other interconnect bus or mechanism. In some embodiments, the display engine 2340 includes a 2D engine 2341 and a display controller 2343. In some embodiments, the display engine 2340 includes dedicated logic capable of operating independently of the 3D pipeline. In some embodiments, the display controller 2343 is coupled to a display device (not shown), which may be a system-integrated display device (such as in a laptop computer) or an external display device attached via a display device connector.

[0272] In some embodiments, the graphics pipeline 2320 and media pipeline 2330 may be configured to perform operations based on multiple graphics and media programming interfaces and are not specific to any one application programming interface (API). In some embodiments, the graphics processor's driver software translates API schedules specific to a particular graphics or media library into commands that can be processed by the graphics processor. In some embodiments, support is provided for all Open Graphics Library (OpenGL), Open Computing Language (OpenCL), and / or Vulkan graphics and computing APIs from the Khronos Group. In some embodiments, support may also be provided for Microsoft's Direct3D library. In some embodiments, combinations of these libraries may be supported. Support may also be provided for the open-source computer vision library (OpenCV). Future APIs with compatible 3D pipelines will also be supported if a pipeline mapping from future APIs to the graphics processor's pipeline can be made.

[0273] Graphical Pipeline Programming

[0274] Figure 24A This is a block diagram illustrating a graphics processor command format 2400 according to some embodiments. Figure 24B This is a block diagram illustrating a graphics processor command sequence 2410 according to an embodiment. Figure 24A Solid lines in the diagram represent components that are typically included in the drawing command, while dashed lines represent components that are optional or included only in a subset of the drawing command. Figure 24A An exemplary graphics processor command format 2400 includes data fields for identifying the target client 2402 of the command, a command operation code (opcode) 2404, and related data 2406 for the command. Some commands also include a sub-opcode 2405 and a command size 2408.

[0275] In some embodiments, client 2402 defines a client unit of a graphics device that processes command data. In some embodiments, a graphics processor command parser examines the client field of each command to adjust further processing of the command and route command data to the appropriate client unit. In some embodiments, the graphics processor client unit includes a memory interface unit, a rendering unit, a 2D unit, a 3D unit, and a media unit. Each client unit has a corresponding processing pipeline for processing commands. Once a command is received by a client unit, the client unit reads opcode 2404 and sub-opcode 2405 (if present) to determine the operation to be performed. The client unit uses information within data field 2406 to execute the command. For some commands, an explicit command size 2408 is desired to define the size of the command. In some embodiments, the command parser automatically determines the size of at least some commands in the command based on the command opcode. In some embodiments, commands are aligned via multiples of double word length.

[0276] Figure 24B The flowchart illustrates an exemplary graphics processor command sequence 2410. In some embodiments, software or firmware of a data processing system characterized by an embodiment of a graphics processor uses a version of the illustrated command sequence to initiate, execute, and terminate a set of graphics operations. Sample command sequences are shown and described for illustrative purposes only, and embodiments are not limited to these specific commands or this command sequence. Moreover, the commands may be issued as a batch of commands in a command sequence, such that the graphics processor will process the command sequence in at least partially simultaneous manner.

[0277] In some embodiments, the graphics processor command sequence 2410 may begin with a pipeline dump clearing command 2412 to cause any active graphics pipeline to complete its current pending commands. In some embodiments, the 3D pipeline 2422 and the media pipeline 2424 do not operate simultaneously. Pipeline dump clearing is performed to cause any pending commands to be completed by the active graphics pipeline. In response to pipeline dump clearing, the command parser for the graphics processor will stop command processing until the active rendering engine completes its pending operations and invalidates the associated read cache. Optionally, any data marked as 'dirty' in the render cache may be dumped and cleared into memory. In some embodiments, pipeline dump clearing command 2412 may be used for pipeline synchronization or before placing the graphics processor into a low-power state.

[0278] In some embodiments, pipeline selection command 2413 is used when the command sequence requires explicit switching of the graphics processor between pipelines. In some embodiments, pipeline selection command 2413 is only required once in an execution context before a pipeline command is issued, unless the context requires issuing commands for two pipelines. In some embodiments, pipeline dump clearing command 2412 is required exactly before the pipeline switch via pipeline selection command 2413.

[0279] In some embodiments, pipeline control command 2414 configures a graphics pipeline for operation and programs the 3D pipeline 2422 and the media pipeline 2424. In some embodiments, pipeline control command 2414 configures the pipeline state of an active pipeline. In one embodiment, pipeline control command 2414 is used for pipeline synchronization and for clearing data from one or more cache memories within an active pipeline before processing a batch of commands.

[0280] In some embodiments, the return buffer state command 2416 is used to configure a set of return buffers for corresponding pipelined write data. Some pipelined operations require allocating, selecting, or configuring one or more return buffers, in which intermediate data is written during processing. In some embodiments, the graphics processor also uses one or more return buffers to store output data and perform cross-thread communication. In some embodiments, the return buffer state 2416 includes selecting the size and number of return buffers for a set of pipelined operations.

[0281] The remaining commands in the command sequence differ based on the active pipeline used for the operation. Based on pipeline determination 2420, the command sequence is tailored for either the 3D pipeline 2422 starting at 3D pipeline state 2430, or the media pipeline 2424 starting at media pipeline state 2440.

[0282] Commands for 3D pipeline state 2430 include 3D state setting commands for vertex buffer state, vertex element state, constant color state, depth buffer state, and other state variables to be configured before processing 3D primitive commands. The values of these commands are determined at least in part based on the specific 3D API in use. In some embodiments, 3D pipeline state 2430 commands can also selectively disable or bypass specific pipeline components (if those components will not be used).

[0283] In some embodiments, the 3D primitive 2432 command is used to submit 3D primitives to be processed by the 3D pipeline. The command and associated parameters passed to the graphics processor via the 3D primitive 2432 command are forwarded to the vertex acquisition function in the graphics pipeline. The vertex acquisition function uses the 3D primitive 2432 command data to generate multiple vertex data structures. These vertex data structures are stored in one or more return buffers. In some embodiments, the 3D primitive 2432 command is used to perform vertex operations on the 3D primitives via a vertex shader. To process the vertex shader, the 3D pipeline 2422 dispatches shader execution threads to the graphics processor execution unit.

[0284] In some embodiments, the 3D pipeline 2422 is triggered by executing command 2434 or an event. In some embodiments, register writing triggers command execution. In some embodiments, execution is triggered via a 'go' or 'kick' command in a command sequence. In one embodiment, pipeline synchronization commands are used to trigger command execution so that the command sequence is cleared via a graphics pipeline dump. The 3D pipeline performs geometry processing on 3D primitives. Once the operation is complete, the resulting geometry is rasterized, and the pixel engine shades the resulting pixels. Additional commands for controlling pixel shading and pixel backend operations may also be included for these operations.

[0285] In some embodiments, when performing media operations, a sequence of graphics processor commands 2410 follows the media pipeline 2424 path. Generally, the specific purpose and manner of programming the media pipeline 2424 depends on the media or computational operation to be performed. During media decoding, specific media decoding operations can be offloaded to the media pipeline. In some embodiments, the media pipeline can also be bypassed, and media decoding can be performed wholly or partially using resources provided by one or more general-purpose processing cores. In one embodiment, the media pipeline also includes elements for general-purpose graphics processing unit (GPGPU) operations, wherein the graphics processor is used to perform SIMD vector operations using computation shader programs that are not explicitly associated with rendering graphics primitives.

[0286] In some embodiments, the media pipeline 2424 is configured in a manner similar to that of the 3D pipeline 2422. A set of commands for configuring media pipeline state 2440 is dispatched or placed in a command queue before media object commands 2442. In some embodiments, the commands for media pipeline state 2440 include data for configuring media pipeline elements that will be used to process media objects. This includes data for configuring video decoding and video encoding logic within the media pipeline, such as encoding or decoding formats. In some embodiments, the command implementation for media pipeline state 2440 supports using one or more pointers to "indirect" state elements that contain a batch of state settings.

[0287] In some embodiments, media object command 2442 supplies pointers to media objects for processing by the media pipeline. The media object includes a memory buffer containing video data to be processed. In some embodiments, all media pipeline states must be valid before issuing media object command 2442. Once the pipeline states are configured and media object command 2442 is queued, media pipeline 2424 is triggered by executing command 2444 or an equivalent execution event (e.g., register write). The output from media pipeline 2424 can then be post-processed by operations provided by 3D pipeline 2422 or media pipeline 2424. In some embodiments, GPGPU operations are configured and executed in a manner similar to media operations.

[0288] Graphical software architecture

[0289] Figure 25 An exemplary graphics software architecture of a data processing system 2500 according to some embodiments is illustrated. In some embodiments, the software architecture includes a 3D graphics application 2510, an operating system 2520, and at least one processor 2530. In some embodiments, the processor 2530 includes a graphics processor 2532 and one or more general-purpose processor cores 2534. The graphics application 2510 and the operating system 2520 each execute in the system memory 2550 of the data processing system.

[0290] In some embodiments, the 3D graphics application 2510 includes one or more shader programs, which include shader instructions 2512. The shader language instructions may employ a high-level shader language, such as High-Level Shading Language (HLSL) or OpenGL Shading Language (GLSL). The application also includes executable instructions 2514, which employ a machine language suitable for execution by a general-purpose processor core 2534. The application also includes graphics objects 2516 defined by vertex data.

[0291] In some embodiments, the operating system 2520 is from Microsoft Corporation. The operating system 2520 may be a dedicated UNIX-like operating system or an open-source UNIX-like operating system using a variant of the Linux kernel. The operating system 2520 may support graphics APIs 2522, such as the Direct3D API, OpenGL API, or Vulkan API. When the Direct3D API is in use, the operating system 2520 uses a front-end shader compiler 2524 to compile any shader instructions 2512 in HLSL into a lower-level shader language. This compilation may be just-in-time (JIT) compilation or pre-compilation of the application-executable shaders. In some embodiments, high-level shaders are compiled into low-level shaders during the compilation of the 3D graphics application 2510. In some embodiments, the shader instructions 2512 are provided in an intermediate form, such as a version of the standard Portable Intermediate Representation (SPIR) used by the Vulkan API.

[0292] In some embodiments, the user-mode graphics driver 2526 includes a back-end shader compiler 2527 for translating shader instructions 2512 into a hardware-specific representation. When using the OpenGL API, shader instructions 2512 in the GLSL high-level language are passed to the user-mode graphics driver 2526 for compilation. In some embodiments, the user-mode graphics driver 2526 uses an operating system kernel-mode feature 2528 to communicate with a kernel-mode graphics driver 2529. In some embodiments, the kernel-mode graphics driver 2529 communicates with a graphics processor 2532 to dispatch commands and instructions.

[0293] IP core implementation

[0294] One or more aspects of at least one embodiment can be implemented by representative code stored on a machine-readable medium that represents and / or defines logic within an integrated circuit, such as a processor. For example, the machine-readable medium may include instructions representing various logic within a processor. When read by a machine, these instructions can cause the machine to manufacture logic for performing the techniques described herein. Such representations (referred to as “IP cores”) are reusable units of logic for an integrated circuit, which can be stored on a tangible, machine-readable medium as a hardware model describing the structure of the integrated circuit. The hardware model can be supplied to various consumers or manufacturing facilities that load the hardware model onto manufacturing machines that manufacture integrated circuits. Integrated circuits can be manufactured such that the circuits perform the operations described in association with any of the embodiments described herein.

[0295] Figure 26 This is a block diagram illustrating an IP core development system 2600, which can be used to manufacture integrated circuits to perform operations according to an embodiment. The IP core development system 2600 can be used to generate modular, reusable designs that can be incorporated into larger designs or used to build entire integrated circuits (e.g., SOC integrated circuits). Design facility 2630 can generate software simulations 2610 of the IP core designs using a high-level programming language (e.g., C / C++). Software simulation 2610 can be used to design, test, and verify the behavior of the IP cores using simulation model 2612. Simulation model 2612 can include functional, behavioral, and / or timing simulations. Register transfer level (RTL) designs 2615 can then be created or synthesized from simulation model 2612. RTL design 2615 is an abstraction of the behavior of an integrated circuit (including associated logic performed using the modeled digital signals) that models the flow of digital signals between hardware registers. In addition to RTL design 2615, lower-level designs at logic or transistor levels can also be created, designed, or synthesized. Thus, the specific details of the initial design and simulation can vary.

[0296] The RTL design 2615 or an equivalent can be further synthesized into a hardware model 2620 by the design facility. This hardware model may employ a hardware description language (HDL) or some other representation of the physical design data. The HDL can be further simulated or tested to validate the IP core design. The IP core design can be stored in non-volatile memory 2640 (e.g., hard disk, flash memory, or any non-volatile storage medium) for delivery to a third-party manufacturing facility 2665. Alternatively, the IP core design can be transmitted (e.g., via the Internet) through a wired connection 2650 or a wireless connection 2660. The manufacturing facility 2665 can then manufacture an integrated circuit at least partially based on the IP core design. The manufactured integrated circuit can be configured to perform operations according to at least one embodiment described herein.

[0297] Exemplary System-on-Chip Integrated Circuit

[0298] Figures 27 to 29 Exemplary integrated circuits and associated graphics processors that can be fabricated using one or more IP cores according to various embodiments described herein are shown. In addition to those shown, other logic and circuitry may be included, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.

[0299] Figure 27This is a block diagram illustrating an exemplary system-on-a-chip integrated circuit 2700 that can be fabricated using one or more IP cores according to an embodiment. The exemplary integrated circuit 2700 includes one or more application processors 2705 (e.g., CPU), at least one graphics processor 2710, and may additionally include an image processor 2715 and / or a video processor 2720, any of which can be modular IP cores from the same or multiple different design facilities. The integrated circuit 2700 includes peripheral or bus logic, including a USB controller 2725, a UART controller 2730, an SPI / SDIO controller 2735, and an I / O controller 2720. 2 S / I 2 C controller 2740. Additionally, the integrated circuit may include a display device 2745, which is coupled to one or more of a High Definition Multimedia Interface (HDMI) controller 2750 and a Mobile Industry Processor Interface (MIPI) display interface 2755. Storage may be provided by a flash memory subsystem 2760 (including flash memory and a flash memory controller). A memory interface may be provided via a memory controller 2765 to access SDRAM or SRAM memory devices. Furthermore, some integrated circuits also include an embedded security engine 2770.

[0300] Figure 28 This is a block diagram illustrating an exemplary system-on-a-chip integrated circuit 2810 that can be fabricated using one or more IP cores according to an embodiment. The graphics processor 2810 may be... Figure 27 A variant of the graphics processor 2710. The graphics processor 2810 includes a vertex processor 2805 and one or more fragment processors 2815A to 2815N (e.g., 2815A, 2815B, 2815C, 2815D, up to 2815N-1 and 2815N). The graphics processor 2810 can execute different shader programs via separate logic, such that the vertex processor 2805 is optimized to perform vertex shader program operations, while one or more fragment processors 2815A to 2815N perform fragment (e.g., pixel) shading operations for use in fragment or pixel shader programs. The vertex processor 2805 performs the vertex processing stage of the 3D graphics pipeline and generates primitive and vertex data. The fragment processors (multiple) 2815A to 2815N use the primitive and vertex data generated by the vertex processor 2805 to produce frame buffers displayed on a display device. In one embodiment, fragment processors (multiple) 2815A to 2815N are optimized to execute fragment shader programs provided in the OpenGL API, which can be used to perform operations similar to those of pixel shader programs provided in the Direct 3D API.

[0301] Additionally, the graphics processor 2810 includes one or more memory management units (MMUs) 2820A to 2820B, one or more caches 2825A to 2825B, and (a plurality of) circuit interconnects 2830A to 2830B. The one or more MMUs 2820A to 2820B provide virtual-to-physical address mappings for the graphics processor 2810, including for vertex processors 2805 and / or one or more fragment processors 2815A to 2815N. These virtual-to-physical address mappings may reference vertex or image / texture data stored in memory, in addition to vertex or image / texture data stored in the one or more caches 2825A to 2825B. In one embodiment, the one or more MMUs 2820A to 2820B may interact with other MMUs within the system, including those related to... Figure 27 One or more application processors 2705, image processor 2715, and / or video processor 2720 are associated with one or more MMUs for synchronization, enabling each processor 2705 to 2720 to participate in a shared or unified virtual memory system. According to an embodiment, one or more circuit interconnects 2830A to 2830B enable the graphics processor 2810 to interact with other IP cores within the SoC via the SoC's internal bus or via a direct connection.

[0302] Figure 29 This is a block diagram illustrating an additional exemplary graphics processor 2910 of a system-on-a-chip integrated circuit that can be fabricated using one or more IP cores according to an embodiment. The graphics processor 2910 may be... Figure 27 A variant of the graphics processor 2710. The graphics processor 2910 includes... Figure 28 The integrated circuit 2800 includes one or more MMUs 2820A to 2820B, caches 2825A to 2825B, and circuit interconnects 2830A to 2830B.

[0303] The graphics processor 2910 includes one or more shader cores 2915A to 2915N (e.g., 2915A, 2915B, 2915C, 2915D, 2915E, 2915F, up to 2915N-1 and 2915N), which provide a unified shader core architecture, wherein a single core or type or core can execute all types of programmable shader code, including shader program code to implement vertex shaders, fragment shaders, and / or compute shaders. The exact number of shader cores present may vary in embodiments and implementations. Additionally, the graphics processor 2910 includes an inter-core task manager 2905, which acts as a thread dispatcher for distributing execution threads to one or more shader cores 2915A to 2915N and a chunking unit 2918 for accelerating chunked operations for tile-based rendering, wherein scene rendering operations are subdivided in the image space, for example to utilize local spatial consistency within the scene or optimize the use of internal caches.

[0304] Further examples are provided below.

[0305] Example 1 may optionally include an apparatus comprising: a plurality of computing engines, including logic that at least partially includes hardware logic, for training a neural network; and a hardware engine for accelerating a weight update process for training the neural network.

[0306] Example 2 may optionally include the subject of Example 1, where the hardware engine implements a fast operation for averaging weights from multiple nodes in a neural network.

[0307] Example 3 may optionally include the subject of any of Examples 1-2, wherein the neural network comprises multiple sub-neural networks; and each sub-neural network is trained separately.

[0308] Example 4 may optionally include the subject of any of Examples 1-3, where multiple sub-neural networks operate according to priority.

[0309] Example 5 may optionally include the apparatus of any of Examples 1-4, wherein the output of the first sub-neural network can be provided as the input of the second sub-neural network.

[0310] Example 6 may optionally include the subject of any of Examples 1-5, wherein the decision routine of the neural network is executed on at least two different computation engines.

[0311] Example 7 may optionally include the subject of any of Examples 1-6, wherein the results of decision routines executed on at least two different computing engines are compared.

[0312] Example 8 may optionally include the subject of any of Examples 1-7, and further include a driver that includes logic, which at least partially includes hardware logic, for: continuing processing if the results of decision routines executed on at least two different computing engines match.

[0313] Example 9 may optionally include the subject of any of Examples 1-8, and further include logic, which at least partially includes hardware logic, for: generating a cyclic redundancy check (CRC) using the results of decision routines executed on at least two different computing engines.

[0314] Example 10 may optionally include the subject of any of Examples 1-9, where multiple computing engines reside on a single integrated circuit.

[0315] Example 11 may optionally include an electronic device comprising: a processor having a plurality of computing engines including logic, the plurality of computing engines including at least part of hardware logic, for training a neural network; and a hardware engine for accelerating a weight update process for training the neural network.

[0316] Example 12 may optionally include the subject of Example 11, where a hardware engine implements a fast operation for averaging weights from multiple nodes in a neural network.

[0317] Example 13 may optionally include the subject of any of Examples 11-12, wherein the neural network comprises multiple sub-neural networks; and each sub-neural network is trained separately.

[0318] Example 14 may optionally include the topic of any of Examples 11-13, where multiple sub-neural networks operate according to priority.

[0319] Example 15 may optionally include the means of any of Examples 11-14, wherein the output of the first sub-neural network may be provided as the input of the second sub-neural network.

[0320] Example 16 may optionally include the subject of any of Examples 11-15, wherein the decision routine of the neural network is executed on at least two different computing engines.

[0321] Example 17 may optionally include the subject of any of Examples 11-16, wherein the results of decision routines executed on at least two different computing engines are compared.

[0322] Example 18 may optionally include the subject of any of Examples 11-17, and further include a driver that includes logic, which at least partially includes hardware logic, for: continuing processing if the results of decision routines executed on at least two different computing engines match.

[0323] Example 19 may optionally include the subject of any of Examples 11-18, and further include logic, which at least partially includes hardware logic, for: generating a cyclic redundancy check (CRC) using the results of decision routines executed on at least two different computing engines.

[0324] Example 20 may optionally include the subject of any of Examples 11-19, where multiple computing engines reside on a single integrated circuit.

[0325] In various embodiments, the operations discussed herein may be implemented as hardware (e.g., logic circuitry), software, firmware, or a combination thereof, and may be provided as a computer program product, for example, including a tangible (e.g., non-volatile) machine-readable or computer-readable medium having instructions (or software processes) stored thereon for programming a computer to perform the processes discussed herein. The machine-readable medium may include an information storage device.

[0326] Furthermore, this computer-readable medium can be downloaded as a computer program product, wherein the program can be transmitted from a remote computer (e.g., a server) to a requesting computer (e.g., a client) via a communication link (e.g., a bus, modem, or network connection) as a data signal provided in a carrier or other propagation medium.

[0327] In this specification, references to "an embodiment" or "embodiment" mean that a particular feature, structure, and / or characteristic described in connection with that embodiment may be included in at least one implementation. The phrase "in an embodiment" appearing throughout this specification may or may not refer to the same embodiment.

[0328] Furthermore, the terms "coupled" and "connected," as well as their derivatives, may be used in the specification and claims. In some embodiments, "connected" may be used to indicate that two or more elements are in direct physical and / or electrical contact with each other. "Coupled" may mean that two or more elements are in direct physical or electrical contact. However, "coupled" may also mean that two or more elements are not in direct contact with each other, but still cooperate and / or interact with each other.

[0329] Thus, although the embodiments have been described using language specific to structural features and / or method actions, it is understood that the claimed subject matter is not limited to the specific features or actions described. Rather, specific features and actions are disclosed as sample forms of implementing the claimed subject matter.

Claims

1. An apparatus comprising: Multiple computing engines, including logic, the logic at least partially including hardware logic, for training a neural network, the multiple computing engines including a first set of computing engines and a second set of computing engines, the neural network including at least a first layer and a second layer; as well as A hardware engine is used to accelerate the weight update process for training the neural network. In this configuration, the first layer of the neural network is assigned to the first set of computing engines, and the second layer of the neural network is assigned to the second set of computing engines, such that the execution of the first layer takes precedence over the execution of the second layer. The neural network is trained via the execution of the first layer on the first set of computing engines and the execution of the second layer on the second set of computing engines. The hardware engine is used to accelerate the weight update process by applying a weight averaging operation, which is used to average the weights of multiple nodes from the first layer for the first computing engine set and multiple nodes from the second layer for the second computing engine set.

2. The apparatus as described in claim 1, characterized in that: The hardware engine enables a fast operation for averaging the weights from multiple nodes in the neural network.

3. The apparatus as described in claim 2, characterized in that: The neural network includes multiple sub-neural networks; and Each sub-neural network is trained separately.

4. The apparatus as described in claim 3, characterized in that: The multiple sub-neural networks operate according to priority.

5. The apparatus as described in claim 4, characterized in that: The output of the first sub-neural network can be used as the input of the second sub-neural network.

6. The apparatus as claimed in claim 1, characterized in that: The decision routines of the neural network are executed on at least two different computing engines.

7. The apparatus as claimed in claim 6, characterized in that, The system further includes logic, which at least partially comprises hardware logic, for: The results of the decision routines executed on the at least two different computing engines are compared.

8. The apparatus as claimed in claim 7, characterized in that, The driver further includes logic, which at least partially includes hardware logic for: If the results of the decision routines executed on the at least two different computing engines match, then processing continues.

9. The apparatus as claimed in claim 8, characterized in that, The system further includes logic, which at least partially comprises hardware logic, for: Cyclic Redundancy Check (CRC) is generated using the results of the decision routines executed on the at least two different computing engines.

10. The apparatus as claimed in claim 1, characterized in that, The multiple computing engines are on a single integrated circuit.

11. An electronic device, comprising: A processor having multiple computing engines, the multiple computing engines including logic, the logic at least partially including hardware logic, for training a neural network, the multiple computing engines including a first set of computing engines and a second set of computing engines, the neural network including at least a first layer and a second layer; as well as A hardware engine is used to accelerate the weight update process for training the neural network. In this configuration, the first layer of the neural network is assigned to the first set of computing engines, and the second layer of the neural network is assigned to the second set of computing engines, such that the execution of the first layer takes precedence over the execution of the second layer. The neural network is trained via the execution of the first layer on the first set of computing engines and the execution of the second layer on the second set of computing engines. The hardware engine is used to accelerate the weight update process by applying a weight averaging operation, which is used to average the weights of multiple nodes from the first layer for the first computing engine set and multiple nodes from the second layer for the second computing engine set.

12. The electronic device as claimed in claim 11, characterized in that: The hardware engine enables a fast operation for averaging the weights from multiple nodes in the neural network.

13. The electronic device as claimed in claim 12, characterized in that: The neural network includes multiple sub-neural networks; and Each sub-neural network is trained separately.

14. The electronic device as claimed in claim 13, characterized in that, Further including logic, where: The multiple sub-neural networks operate according to priority.

15. The electronic device as claimed in claim 14, characterized in that: The output of the first sub-neural network can be used as the input of the second sub-neural network.

16. The electronic device as claimed in claim 11, characterized in that: The decision routines of the neural network are executed on at least two different computing engines.

17. The electronic device as claimed in claim 16, characterized in that, The system further includes logic, which at least partially comprises hardware logic, for: The results of the decision routines executed on the at least two different computing engines are compared.

18. The electronic device as claimed in claim 17, characterized in that, The driver further includes logic, which at least partially includes hardware logic for: If the results of the decision routines executed on the at least two different computing engines match, then processing continues.

19. The electronic device as claimed in claim 18, characterized in that, The system further includes a thread scheduler, the thread scheduler comprising logic, the logic at least partially comprising hardware logic, for: Cyclic Redundancy Check (CRC) is generated using the results of the decision routines executed on the at least two different computing engines.

20. The electronic device as claimed in claim 11, characterized in that, Multiple execution units reside on a single integrated circuit.