Memory addressing
By centrally calculating the memory addresses of parallel threads through the CUDA execution engine, the problem of inefficient memory access by parallel threads on the processor is solved, achieving efficient memory addressing and a simplified programming model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NVIDIA CORP
- Filing Date
- 2025-12-10
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, parallel threads on processors lack efficient memory addressing methods when accessing memory locations, resulting in low computational efficiency.
The CUDA execution engine is used to centrally compute the memory addresses of parallel threads. Instructions corresponding to the software kernel are generated through APIs, reducing the computational overhead of individual threads and improving efficiency.
Centralized memory address calculation reduces memory overhead and improves computational efficiency, simplifies the programming process for developers, and reduces the possibility of programming errors.
Smart Images

Figure CN122195871A_ABST
Abstract
Description
Technical Field
[0001] Apparatus, systems, and methods for performing thread memory addressing. At least one embodiment relates to execution instructions for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel. Background Technology
[0002] Parallel threads on processors such as graphics processing units (GPUs) access different memory locations. The methods used to determine these memory locations can be improved. Attached Figure Description
[0003] Figure 1 It is a block diagram illustrating a system for calculating the memory address of a thread according to at least one embodiment;
[0004] Figure 2 It is a pseudocode illustration of a kernel according to at least one embodiment and how to invoke the kernel so that the API translates a pointer to memory into a reference to a different memory address in said memory;
[0005] Figure 3 It is a block diagram illustrating a thread initiation operation (operation) according to at least one embodiment;
[0006] Figure 4 It is a block diagram illustrating a process for starting a thread by the CUDA execution engine according to at least one embodiment;
[0007] Figure 5 It is a block diagram illustrating a driver and / or runtime comprising one or more libraries for providing one or more application programming interfaces (APIs) according to at least one embodiment;
[0008] Figure 6 The illustration shows an example of a system according to at least one embodiment that may include software and hardware for executing an API to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described herein;
[0009] Figure 7 An example data center according to at least one embodiment is illustrated;
[0010] Figure 8 The illustration depicts a processor as a system-on-a-chip (SOC) (which may be referred to as a system-on-a-chip, super chip, or another name) according to at least one embodiment;
[0011] Figure 9A The illustration depicts a parallel processor according to at least one embodiment;
[0012] Figure 9B A block diagram including a processing cluster within a parallel processing unit according to at least one embodiment;
[0013] Figure 9C A graphics multiprocessor according to at least one embodiment is shown;
[0014] Figure 10 A processor according to at least one embodiment is shown;
[0015] Figure 11A The illustration shows a processor according to at least one embodiment;
[0016] Figure 11B The diagram illustrates components within the core according to at least one embodiment;
[0017] Figure 12 The illustration shows an AI accelerator according to at least one embodiment;
[0018] Figure 13 This is a simplified block diagram illustrating an example of at least a portion of such a neuromorphic computing device according to at least one embodiment;
[0019] Figure 14 This is a block diagram of an embodiment of a multi-node network that can implement remote in-memory computing, according to any of the embodiments;
[0020] Figure 15 The figure illustrates an acceleration processing unit according to at least one embodiment;
[0021] Figure 16 The illustration shows a processor 1600, such as, but not limited to, a processor based on the Zen architecture from AMD Inc. of Santa Clara, California, or another processor that shares at least some of the components described herein;
[0022] Figure 17 The illustration shows an example of a processing core that can implement an Arm architecture (e.g., v9.0-A) or another processor that shares at least some of the components described herein;
[0023] Figure 18 The illustration shows one or more chips including one or more tensor processing units (TPUs) according to at least one embodiment;
[0024] Figure 19 The figure illustrates a vector processor according to at least one embodiment;
[0025] Figure 20A The illustration shows a schematic diagram of an example multi-core sharded processor microarchitecture according to at least one embodiment;
[0026] Figure 20B An arrow in the image illustrates, for example, the instruction flow within a processor architecture according to some embodiments;
[0027] Figure 21 The illustration shows a software stack of a programming platform according to at least one embodiment;
[0028] Figure 22 The illustration shows a method for use in the above-described embodiment according to at least one embodiment. Figure 21 Compiled code executed on one of the programming platforms;
[0029] Figure 23 The illustration depicts a system configured to compile and execute CUDA source code using different types of processing units according to at least one embodiment;
[0030] Figure 24 An example of an autonomous vehicle according to at least one embodiment is illustrated;
[0031] Figure 25A and Figure 25B The illustration depicts logic, according to at least one embodiment, for one or more devices to perform operations such as, but not limited to, those discussed herein, as described elsewhere.
[0032] Figure 25C The illustration shows the training and deployment of a deep neural network according to at least one embodiment. Detailed Implementation
[0033] Numerous specific details are set forth in the following description to provide a more thorough understanding of at least one embodiment. However, it will be apparent to those skilled in the art that the inventive concept can be practiced without one or more of these specific details.
[0034] In at least one embodiment, the API (e.g., cudaLaunchKernel()) is used to cause a thread to be executed to launch one or more kernels in multiple parallel threads. In at least one embodiment, the processor includes one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel. In at least one embodiment, the API is used to cause threads to be executed, wherein the API determines the memory address of a corresponding portion of information to be used by each thread and provides each thread with the memory address to be used. In at least one embodiment, the input to the API includes an identifier of the thread (e.g., a kernel identifier), an indication of the information to be operated on by the thread, and information about the information that enables the API to determine the memory address (e.g., a stride length parameter). In at least one embodiment, the API can be executed by a GPU driver (e.g., when the kernel is launched) to compute the memory address to be used by each thread, so that the programmer does not need to write source code to enable each thread to compute the memory address itself.
[0035] In at least one embodiment, the parameter is defined as a pointer specifying the spacing between memory accesses by different threads. In at least one embodiment, the parameter is defined by a StrideLength (e.g., stride dimension) that instructs a thread to traverse to access subsequent data elements. In at least one embodiment, the parameter is defined as part of a function invoked by starting the kernel. In at least one embodiment, the pointer is updated using different memory addresses based on the spacing between memory accesses by different threads.
[0036] In at least one embodiment, as used herein, a pointer is a variable that stores the base memory address or starting memory location of a memory block that stores data that will be accessed by multiple concurrent threads. In at least one embodiment, as used herein, a reference refers to a specific memory location in memory. In at least one embodiment, a pointer is a reference pointing to the beginning memory location of a memory block.
[0037] In at least one embodiment, the size of a memory location within a memory block is determined by the data type stored at that memory location. In at least one embodiment, for example, if a pointer is declared as "int*", then the pointer points to a memory location of size int (e.g., 4 bytes) (the beginning memory location of the memory block). In at least one embodiment, the size of the entire memory block pointed to by the pointer at the beginning location is determined by the data type of the elements in the block and the number of elements in the block.
[0038] Figure 1This is a block diagram illustrating a system 100 for memory addresses of computation threads according to at least one embodiment. In at least one embodiment, system 100 is a parallel computing system that supports multiple GPU programming frameworks to enable efficient parallel computation. Examples of such frameworks in at least one embodiment include CUDA (Computational Unified Device Architecture), OpenCL (Open Computing Language), HIP (Portable Heterogeneous Computing Interface), SYCL (Single-Source C++ Heterogeneous Computing Language), and Vulkan Computing. In at least one embodiment, the framework allows developers to write programs in languages such as C, C++, and Python and perform GPU-accelerated computations on various hardware architectures, thereby ensuring broad compatibility and flexibility for different use cases. In at least one embodiment, system 100 enables developers to write programs in languages such as C, C++, and Fortran, where certain functions are designated as GPU "kernels" and executed by parallel GPU threads, thereby significantly accelerating computation compared to traditional CPU processing. In at least one embodiment, system 100 may include a software stack, such as those combined with... Figure 22 The software stack 2200 is described.
[0039] In at least one embodiment, API 103 is used as an interface for initiating operations on a processor (e.g., CPU). In at least one embodiment, API 103 is used as an interface for initiating CUDA operations. In at least one embodiment, API 103 is used in conjunction with "cudaLaunchKernel()" (as in...). Figure 22 This corresponds to the runtime API in the described software stack 2200. In at least one embodiment, API 103 utilizes GPU driver 105 to communicate with GPU resources, thereby enabling efficient execution of parallel tasks. In at least one embodiment, API 103 invokes a kernel on the host (e.g., CPU) for execution on the device (e.g., GPU). In at least one embodiment, the CUDA execution engine pre-optimizes and organizes these workloads and provides optimized kernel configuration and data layout to ensure high hardware utilization of the API invocation.
[0040] In at least one embodiment, GPU driver 105 manages hardware interactions. In at least one embodiment, API 103 supports image processing and real-time video analytics. In at least one embodiment, API 103 supports neural network training by managing data inputs and outputs.
[0041] In at least one embodiment, API 103 enables the execution of instructions based on API parameters, including, for example, configuring kernel startup with grid, block, and stride dimensions and / or operand sizes, thereby facilitating the management of data input and output. In at least one embodiment, operand size refers to the size of a variable or value being processed or manipulated during kernel execution in terms of memory or data type. In at least one embodiment, operand size defines how much data is involved in a particular operation, thus impacting performance optimization and memory management when using GPUs for parallel computing. In at least one embodiment, operand size is expressed in various data types, including, for example, integers, single-precision floating-point numbers, or double-precision floating-point numbers.
[0042] In at least one embodiment, system 100 includes a processor 102 with a CUDA execution engine, memory 122, and a streaming processor (SM) 124. In at least one embodiment, processor 102 is combined with... Figure 9A The described processor 906 is similar to a GPU. In at least one embodiment, processor 102 represents a GPU that performs parallel computing. In at least one embodiment, processor 102 utilizes multiple cores to perform high-speed data processing.
[0043] In at least one embodiment, the CUDA execution engine 104 is a component having a processor 102 configured to manage the execution of CUDA programs (e.g., kernels). In at least one embodiment, the CUDA execution engine 104 acts as an orchestration layer within the processor 102 to facilitate parallel processing and ensure efficient execution of CUDA programs. In at least one embodiment, the CUDA execution engine 104 coordinates the scheduling and launching of threads to perform or otherwise implement specified operations. In at least one embodiment, the CUDA execution engine schedules threads in a grid / block structure, where each thread is identified by a unique thread and block index. In at least one embodiment, the CUDA execution engine 104 manages the use of various types of memory (such as global memory and shared memory) on the GPU 102. In at least one embodiment, global memory 122 is as follows: Figures 9A-9C The memory described. In at least one embodiment, shared memory is combined with, as in, the memory described above. Figure 7 The shared memory 754 described is similar to memory. In at least one embodiment, the shared memory 754 is on-chip memory located within the streaming multiprocessor (SM) 124.
[0044] In at least one embodiment, SM 124 is a core processing unit within processor 102 that executes in parallel or otherwise implements a large number of threads. In at least one embodiment, SM 124 includes multiple execution units, such as in combination Figure 17 The tensor kernel 1706 is described.
[0045] In at least one embodiment, the CUDA execution engine 104 is an optimization layer in the CUDA ecosystem configured to efficiently schedule and execute structured workloads, such as workloads for tensor kernel operations (e.g., matrix multiplication). In at least one embodiment, the CUDA execution engine 104 manages operand reuse, memory partitioning, and prefetching to minimize data movement overhead and maximize computational throughput.
[0046] In at least one embodiment, the CUDA execution engine 104 calculates the memory addresses of all parallel threads it initiates. In at least one embodiment, when the processor 102 is accessed from the host (e.g., in conjunction with...), Figure 9A When a CPU (similar to a CPU) in the described CPU 902 receives a request 106 to start a thread, in at least one embodiment, the CUDA execution engine 104 may execute a memory address calculation function 108 to determine the memory addresses 110, 112, and 114 of each corresponding thread 116, 118, and 120.
[0047] In at least one embodiment, the CUDA execution engine 104 can use thread-specific information (e.g., grid dimension and block dimension) along with other parameters provided in the startup configuration (e.g., stride dimension and pointers to memory storing the data these threads will access) to determine the correct memory location in memory 122 that each thread will access. In at least one embodiment, the CUDA execution engine 104 performs this computation once for all threads to generate the necessary memory addresses for each thread, thereby reducing the computational overhead that could occur if each thread computed its own address.
[0048] In at least one embodiment, these calculated memory addresses are then stored in a designated location (e.g., in combination with...). Figure 8 The described L1 cache or as combined Figure 7 The memory address is stored in the L2 cache (described) for each thread to access. In at least one embodiment, when each thread is executed by or otherwise implemented by the streaming multiprocessor 124, the thread reads or otherwise retrieves its corresponding pre-computed memory address. In at least one embodiment, this centralized approach to memory address computation helps improve efficiency by offloading the task from individual threads to the CUDA execution engine 104.
[0049] In at least one embodiment, the CUDA execution engine 104 performs memory address calculations for the threads to be launched, and then uses the calculated memory addresses to launch those threads, without storing the addresses in memory or a cache. In at least one embodiment, this approach reduces memory overhead while still offloading memory address calculations from individual threads to the CUDA execution engine 104.
[0050] In at least one embodiment, as used herein, a pointer is a variable that stores the base memory address or starting memory location of a memory block containing data to be accessed by multiple concurrent threads. In at least one embodiment, as used herein, a reference refers to a specific memory location. In at least one embodiment, a pointer is a reference pointing to the beginning memory location of a memory block.
[0051] In at least one embodiment, the size of a memory location within a memory block is determined by the data type stored at that memory location. In at least one embodiment, for example, if a pointer is declared as "int*", then the pointer points to a memory location of size int (e.g., 4 bytes) (the beginning memory location of the memory block). In at least one embodiment, the size of the entire memory block pointed to by the pointer at the beginning location is determined by the data type of the elements in the block and the number of elements in the block.
[0052] In at least one embodiment, system 100 can therefore perform the conversion of pointers to memory 122 into different references to threads 116, 118, 120 by using CUDA execution engine 104 or APIs (such as cudaLaunchKernel()) to compute the memory addresses 110, 112, 118 of these threads in a centralized location.
[0053] In at least one embodiment, combined Figure 1 At least one of the described features uses Figure 2-25C The description and / or combination Figure 2-25C Implemented by at least one feature of one or more of any systems and / or one or more processes. In at least one embodiment, combined with Figure 1 At least one of the described features is used to achieve Figure 2-25C The description and / or about Figure 2-25C The description includes at least one feature of one or more systems and / or one or more processes.
[0054] Figure 2This is pseudocode 200 illustrating a kernel according to at least one embodiment and how said kernel is invoked so that an API translates a pointer to memory into a reference to a different memory address in said memory. In at least one embodiment, the kernel "Image Composition()" 202 is written to perform an image composition task to combine two images into one image. In at least one embodiment, kernel 202 is used by each parallel thread's SM (e.g., combined) Figure 1 The described SM 124) is executed. In at least one embodiment, kernel 202 executes in each parallel thread running on said SM but on different data stored in different memory locations.
[0055] In at least one embodiment, kernel 202 takes three references, "&out", "&inA", and "&inB", as input parameters, which refer to memory locations at three different memory addresses. In at least one embodiment, "&inA" is a memory location storing a portion (e.g., one or more pixels) of a source / input image. In at least one embodiment, "&inB" is a memory location storing a portion (e.g., one or more pixels) of another source / input image. In at least one embodiment, "&out" is a memory location storing the result of combining the pixels stored in "&inA" and "&inB".
[0056] In at least one embodiment, the ampersand "&" indicates that the parameter is of type reference, which can point to one of the following memory locations: an initial memory location (i.e., a pointer) or another memory location. In at least one embodiment, the CUDA software stack (e.g., as combined with...) Figure 22 The described software stack (2200) can be modified to allow the CUDA execution engine (e.g., as combined with...) Figure 1 The described CUDA execution engine 104 is able to interpret the ampersand "&" as a reference to a memory location of a memory block rather than the beginning memory location of the memory block.
[0057] In at least one embodiment, kernel 202 is in the host (e.g., in conjunction with...) Figure 9A The CPU described in CPU 902 (a CPU similar to a CPU) is invoked (e.g., requested to be started), and in a device (e.g., as combined) Figure 1 It is started and executed on the processor 102 described.
[0058] In at least one embodiment, as shown in line 206, kernel 202 is invoked using pointers as input parameters “d_out”, “d_inA”, and “d_inB”, as well as block-level and thread-level parameters. In at least one embodiment, “out”, “d_inA”, and “d_inB” are pointers to the beginning memory locations of memory blocks storing the composite image, the first source image, and the second source image, respectively, as defined at line 204. In at least one embodiment, line 206 does not include all the parameters used to start kernel 202.
[0059] In at least one embodiment, the kernel declaration 202 notifies the CUDA execution engine (e.g., as in conjunction with...) Figure 1 The described CUDA execution engine 104) converts a pointer to the beginning memory location of a memory block into a reference to different memory locations within the memory, so that the CUDA execution engine can launch parallel threads to access these memory locations.
[0060] In at least one embodiment, combined Figure 2 At least one of the described features is the use of Figure 1 and Figure 3-25C The description and / or combination thereof Figure 1 and Figure 3-25C This is achieved by at least one feature of one or more of any systems and / or one or more processes. In at least one embodiment, in combination with Figure 2 At least one of the described features is used to achieve Figure 1 and Figure 3-25C The description and / or information in Figure 1 and Figure 3-25C The description includes at least one feature of one or more systems and / or one or more processes.
[0061] Figure 3 This is a block diagram illustrating a thread initiation operation 300 (“Operation 300”) according to at least one embodiment. In at least one embodiment, Operation 300 is one or more computational operations, which, if executed, cause the processor (e.g., in conjunction with...) to... Figure 1 The processor 102 described is configured and the CUDA kernel is started for parallel execution on the GPU.
[0062] In at least one embodiment, operation 300 includes a cudaLaunchKernel API call 302 (“Call 302”) and a cudaLaunchKernel API response 322 (“Response 304”). In at least one embodiment, call 302 is a function call to be executed by one or more software programs (such as CUDA kernels) to be executed by or otherwise implemented by a parallel processing unit (e.g., a GPU). In at least one embodiment, call 302 is executed by a kernel (such as a CUDA kernel) Figure 2 The kernel 202 described is executed. In at least one embodiment, call 302 is a call to instructions for causing one or more processors to perform one or more computational operations. In at least one embodiment, call 902 is a call to an API for causing one or more processors to perform one or more computational operations.
[0063] In at least one embodiment, call 302 receives parameters 304-316 as input. In at least one embodiment, func 304 refers to the name of the kernel to be booted. In at least one embodiment, func 304 refers to a symbol of the kernel in the device-side code generated by NVCC. In at least one embodiment, func 304 points to the compiled binary instructions of the kernel. In at least one embodiment, func 304 calls a kernel function on the block grid gridDim 306 (gridDim.xgridDim.ygridDim.z). In at least one embodiment, each block contains blockDim 308 (blockDim.xblockDim.yblockDim.z). In at least one embodiment, gridDim 306 defines the number of threaded blocks in the grid, thereby allowing multi-dimensional grid configurations (1D, 2D, or 3D). In at least one embodiment, a grid dimension of (10, 10) configures the grid to contain 100 blocks, each block potentially containing multiple threads as specified by blockDim 308. In at least one embodiment, `blockDim` specifies the number of threads within each block. In at least one embodiment, `blockDim` 308 is provided in a 1D, 2D, or 3D configuration format. For example, in at least one embodiment, a block dimension of (16, 16) configures each block to contain 256 threads arranged in a 2D grid. In at least one embodiment, `gridDim` 306 and `blockDim` 308 can be used to compute thread identifiers (also referred to as global indexes of threads) across the entire thread grid to be launched on one or more SMs.
[0064] In at least one embodiment, args 310 refers to the parameters of the kernel to be launched. In at least one embodiment, args 310 explains how the arguments (parameters) of the kernel function are passed from the host (e.g., CPU) to the device (e.g., GPU) when using the cudaLaunchKernel API. In at least one embodiment, if the kernel has N parameters, args 310 should point to an array of N pointers. In at least one embodiment, each pointer from args[0] to args[N-1] points to a memory region from which the actual parameters are copied. In at least one embodiment, sharedmem 312 sets the amount of dynamically shared memory that will be available for each thread block. In at least one embodiment, stream 314 specifies the stream to be invoked.
[0065] In at least one embodiment, strideDim 316 indicates the spacing between memory accesses by different threads. In at least one embodiment, strideDim 316 represents the number of memory elements each thread must skip to reach its assigned data. In at least one embodiment, for example, in a 2D matrix, stride can represent the number of elements in a row. In at least one embodiment, the exact memory reference that each thread should access can be determined by multiplying the stride by a global index to calculate the thread's memory location. In at least one embodiment, for a 1D array, the stride is 1 because each consecutive element in the array is stored next to the previous element. In at least one embodiment, for a 2D array (matrix), the stride is the number of columns in the matrix, meaning that in order to access an element in the next row, the thread will skip an number of elements equal to the stride (the width of the matrix). In at least one embodiment, for example, if the matrix has 10 columns, the stride will be 10, meaning that a thread accessing an element in row 0, column 0 would need to skip 10 elements to access the corresponding element in row 1, column 0. In at least one embodiment, for a 3D array (volume or tensor), each dimension has a stride (e.g., row stride, depth stride). In at least one embodiment, the row stride can be the number of elements in a row, while the depth stride will be the number of elements in a slice of the 3D array.
[0066] In at least one embodiment, dataType 318 indicates the size and type of the data element being processed, such as an integer, a floating-point number, or a more complex structure. In at least one embodiment, dataType 318 can be used to calculate memory addresses because the size of the data type determines how much memory each element occupies. In at least one embodiment, for example, if the data type is a single-precision floating-point number (e.g., 4 bytes), the memory location for each thread must be adjusted accordingly to take the element size into account.
[0067] In at least one embodiment, memPtr 320 represents the starting memory address within the memory block where data is stored. In at least one embodiment, memPtr 320 serves as the base address for all memory accesses performed by the thread. In at least one embodiment, memPtr 320 can be used to calculate the memory location for each thread when combined with the thread's global index, stride, and data type. In at least one embodiment, memPtr 320 can point to a memory location to be accessed by one of these threads (e.g., the thread with thread identifier 0, which identifies the first thread in the first block of the thread grid).
[0068] In at least one embodiment, response 322 includes one or more of a plurality of statuses 323-344. In at least one embodiment, each status is data that includes any additional information presented by operation 300. In at least one embodiment, if a function call is used to execute invocation 302, each status is data that will be output in response to the function call. In at least one embodiment, if an API is used to cause invocation 302 to be executed, each status is data that will be output by the API. In at least one embodiment, each status indicates that operation 300 was executed successfully. In at least one embodiment, each status indicates that operation 300 was not executed successfully or otherwise failed.
[0069] In at least one embodiment, response 322 includes a cudaSuccess 323 indicating that the API call was returned without error. In at least one embodiment, response 322 includes a cudaErrorInvalidDeviceFunction 324 indicating that the requested device function does not exist or is not compiled for the appropriate device architecture. In at least one embodiment, response 322 includes a cudaErrorInvalidConfiguration 326 indicating that the kernel startup is requesting a resource that the current device can never satisfy. In at least one embodiment, requesting more shared memory per block than the device supports will trigger this error, as will requesting too many threads or blocks. In at least one embodiment, response 322 includes a cudaErrorLaunchFailure 328 indicating that an exception occurred on the device while the kernel was being executed or otherwise implemented. In at least one embodiment, the cause includes dereferencing an invalid device pointer and accessing shared memory out of bounds. In at least one embodiment, response 322 includes a cudaErrorLaunchTimeout 330 indicating that the device kernel execution or otherwise implementation took too long. In at least one embodiment, this state only occurs if timeout is enabled. In at least one embodiment, this state leaves the process in an inconsistent state, where any other CUDA job will return the same error. In at least one embodiment, the process must be terminated and restarted to continue using CUDA. In at least one embodiment, response 322 includes a cudaErrorLaunchOutofResource 332 indicating that the kernel launch did not occur because it did not have the appropriate resources. In at least one embodiment, response 322 includes a cudaErrorSharedObjectFailed 334 indicating that the initialization of a shared object failed. In at least one embodiment, response 322 includes a cudaErrorInvalidPtx 336 indicating that the PTX compilation failed. In at least one embodiment, when this state occurs, the runtime can fall back to compiling the PTX if the application does not contain binaries suitable for the current device. In at least one embodiment, response 322 includes a cudaErrorUnsupportedPtxVersion 338 indicating that the provided PTX was compiled using an unsupported toolchain. In at least one embodiment, the most common reason for this situation is that the PTX was generated by a compiler newer than the compiler supported by the CUDA driver and the PTX JIT compiler.In at least one embodiment, response 322 includes a cudaErrorNoKernelImageForDevice state 340 indicating that no kernel image suitable for the device is available. In at least one embodiment, this state occurs when the user specifies code generation options for a specific CUDA source file that does not include the corresponding device configuration. In at least one embodiment, response 322 includes a cudaErrorJitCompilerNotFound state 342 indicating that the PTX JIT compiler library was not found and that the JIT compiler library was used for PTX compilation. In at least one embodiment, if the application does not contain binaries suitable for the current device, the runtime can fall back to compiling PTX. In at least one embodiment, response 322 includes a cudaErrorJitCompilationDisabled state 344 indicating that JIT compilation is disabled. In at least one embodiment, JIT compilation compiles PTX. In at least one embodiment, if the application does not contain binaries suitable for the current device, the runtime can fall back to compiling PTX.
[0070] In at least one embodiment, combined Figure 3 At least one of the described features is the use of Figure 1-2 and Figure 4-25C The description and / or combination thereof Figure 1-2 and Figure 4-25C This is achieved by at least one feature of one or more of any systems and / or one or more processes. In at least one embodiment, in combination with Figure 3 At least one of the described features is used to achieve Figure 1-2 and Figure 4-25C The description and / or information in Figure 1-2 and Figure 4-25C The description includes at least one feature of one or more systems and / or one or more processes.
[0071] Figure 4 This is a block diagram illustrating a process 400 for launching a thread via a CUDA execution engine according to at least one embodiment. In at least one embodiment, these steps can be performed by a processor (such as...) Figure 1 The processor described is used for execution.
[0072] In at least one embodiment, at step 402, the processor receives a request from the host (e.g., the CPU) to boot a kernel on the processor. In at least one embodiment, the request may be in conjunction with, for example... Figure 3 Parameters similar to those described are used as input. In at least one embodiment, the request is combined with... Figure 3The described call is similar to a 302 API call.
[0073] In at least one embodiment, at step 404, the processor identifies parameters of the request. In at least one embodiment, these parameters may include grid dimensions, block dimensions, stride dimensions, data types, and pointer information necessary for configuring the kernel startup. In at least one embodiment, these parameters include, as in combination with... Figure 3 All parameters described are 304-320.
[0074] In at least one embodiment, at step 406, the processor calculates the memory location of each thread to be launched, including the kernel. In at least one embodiment, calculating the memory location of the threads to be launched involves using identified parameters including grid dimensions, block dimensions, stride dimensions, data types, and pointers. In at least one embodiment, the process begins by calculating a global index, which serves as a unique identifier for each thread across the entire thread grid to be launched. In at least one embodiment, the grid dimension specifies the number of blocks in each dimension (such as x, y, or z), and the block dimension specifies the number of threads within each block along these same dimensions. In at least one embodiment, each thread has a local index within its block, and each block has a block index within the grid. In at least one embodiment, the global index is determined by combining these indices and dimensions to identify the location of each thread configured across the entire grid.
[0075] In at least one embodiment, for example, in a one-dimensional mesh configuration, the global index is calculated by multiplying the block index along the x-axis by the block dimension along the same axis, and then adding it to the local thread index within the block. In a two-dimensional mesh, the calculation is extended by considering both the mesh dimension and the block dimension along both the x-axis and y-axis, and in a three-dimensional mesh, the global index incorporates all three axes to determine a unique thread identifier.
[0076] In at least one embodiment, a stride dimension is applied to determine the appropriate spacing between memory elements accessed by different threads. In at least one embodiment, the stride represents the number of memory elements a thread must skip to reach its assigned data. In at least one embodiment, for example, in a two-dimensional matrix, the stride may correspond to the number of elements in a row, thereby ensuring that threads access data in different rows without overlap. In at least one embodiment, the exact memory location to be accessed by each thread is determined by calculating the memory offset for each thread by multiplying the global index by the stride.
[0077] In at least one embodiment, the data type is used to adjust the memory offset based on the size of each data element being processed. In at least one embodiment, the size of the data type (such as four bytes for a single-precision floating-point number or eight bytes for a double-precision floating-point number) is used to scale the memory offset accordingly. In at least one embodiment, this adjustment ensures that the memory location reflects the actual size of each data element, wherein each thread accesses a different portion of memory as determined by the adjusted memory offset.
[0078] In at least one embodiment, the pointer serves as the base address for all memory accesses. In at least one embodiment, the pointer represents the starting memory address of the memory block in which data to be accessed by the thread to be started is stored. In at least one embodiment, to determine the memory location of each thread, an adjusted memory offset is added to the base address provided by the pointer.
[0079] In at least one embodiment, for example, a 1D mesh configuration has four blocks, each containing eight threads. In at least one embodiment, the mesh dimension and block dimension are used to calculate the global index of the thread (thread 3) in block 1, resulting in a global index of 11. In at least one embodiment, the stride dimension (which can be set to 1 for a simple one-dimensional array) is then applied to calculate a memory offset of 11. In at least one embodiment, given a single-precision floating-point data type occupying four bytes, the memory offset is adjusted to take the element size into account, resulting in an adjusted memory offset of 44. In at least one embodiment, the base address, then specified as pointer value 0x1000, is combined with the adjusted memory offset to determine a final memory location of 0x102C.
[0080] In at least one embodiment, at step 408, the processor stores the computed memory locations of the thread to be launched in a storage device (such as shared memory, L1 cache, or L2 cache). In at least one embodiment, this step is optional. In at least one embodiment, the CUDA execution engine can directly launch the parallel thread to access these memory addresses without storing these memory addresses in storage locations.
[0081] In at least one embodiment, at step 410, the processor starts the thread to execute the kernel using the different memory addresses calculated by the CUDA execution engine.
[0082] In at least one embodiment, the process 400 uses the CUDA execution engine or API to perform memory location calculations for all threads in the entire grid that are to be launched once at the central location, rather than allowing each thread to perform its own memory location calculation, thereby reducing the computational overhead for launching these threads. In at least one embodiment, the process also simplifies the code that developers need to write. In at least one embodiment, developers no longer need to manually implement memory address calculations within the kernel code repeated for each thread. In at least one embodiment, the process 400 abstracts these calculations, thereby providing a simplified programming model that makes CUDA programming more accessible and reduces the likelihood of programming errors related to memory address calculations.
[0083] In at least one embodiment, process 400 allows code used to compute memory allocations in the kernel to be moved to the CUDA execution engine, thereby producing functional programming functions. In at least one embodiment, the output of a functional programming function is determined solely by its input parameters, without depending on or modifying any external variables or state. In at least one embodiment, for a functional programming function, given the same input values, it produces the same output, making the function predictable and consistent. In at least one embodiment, support for functional programming functions in the CUDA programming model enables high-level languages such as Python to directly access these functional programming functions without using an intermediate library stack.
[0084] In at least one embodiment, combined Figure 4 At least one of the described features uses Figure 1-3 and Figure 5-25C The description and / or combination thereof Figure 1-3 and Figure 5-25C This is implemented by at least one feature of one or more of any systems and / or one or more processes. In at least one embodiment, in combination with Figure 4 At least one of the described features is used to achieve Figure 4 and Figure 5-25C The description and / or information in Figure 4 and Figure 5-25C The description includes at least one feature of one or more systems and / or one or more processes.
[0085] Figure 5 This is a block diagram 500 illustrating a driver and / or runtime comprising one or more libraries for providing one or more application programming interfaces (APIs) according to at least one embodiment. In at least one embodiment, the software program 502 is stored on a processor (such as... Figure 1The software modules described herein. In at least one embodiment, software program 502 includes one or more software modules. In at least one embodiment, software program 502 is a collection of software code, commands, instructions, or other text sequences that instruct a computing device to perform one or more computational operations and / or invoke one or more other instruction sets (such as one or more APIs 510 or one or more API functions 512) to be executed. In at least one embodiment, the software program is a compiler.
[0086] In at least one embodiment, one or more APIs 510 are software instruction sets that, when executed, cause one or more processors to perform one or more computational operations. In at least one embodiment, one or more APIs 510 are distributed or otherwise provided as part of one or more libraries 506, runtimes 504, drivers, and / or any other software and / or executable code groups further described herein. In at least one embodiment, one or more APIs 510 perform one or more computational operations in response to a call to software program 502. In at least one embodiment, the functionality provided by one or more APIs 510 includes one or more software functions 512, such as those capable of accelerating one or more portions of software program 502 using one or more parallel processing units (PPUs) (such as graphics processing units (GPUs)).
[0087] In at least one embodiment, one or more APIs 510 are hardware interfaces for one or more circuits to perform one or more computational operations. In at least one embodiment, one or more APIs 510 described herein are implemented for performing combinations Figure 1-20B One or more circuits representing one or more of the techniques described. In at least one embodiment, the software program 502 includes instructions that, if executed, cause one or more hardware devices and / or circuits to perform the above-described combination. Figure 1-4 One or more technologies described.
[0088] In at least one embodiment, a software program 502, such as a user-implemented software program, utilizes one or more APIs 510 to perform various computational operations, such as the memory location of a computation thread, starting a thread, or any computational operation performed by a parallel processing unit (PPU) (such as a graphics processing unit (GPU)), as further described herein. In at least one embodiment, one or more APIs 510 provide a set of callable functions 512 (referred to herein as APIs, API functions, and / or functions), which each perform one or more computational operations, such as computational operations related to parallel computing.
[0089] In at least one embodiment, software program 502 interacts with or otherwise communicates with one or more APIs 510 to perform one or more computational operations using one or more PPUs (such as GPUs). In at least one embodiment, the computational operations include at least a group of one or more computational operations that will be accelerated by being executed at least partially by the PPUs. In at least one embodiment, software program 502 interacts with one or more APIs 510 to facilitate parallel computing using remote or local interfaces.
[0090] In at least one embodiment, the interface is software instructions that, when executed, provide access to one or more functions 512 provided by one or more APIs 510. In at least one embodiment, when a software developer compiles a software program 502 in conjunction with one or more libraries 506, the software program uses a native interface, the one or more libraries 506 including one or more APIs 510 or otherwise providing access to one or more APIs 510. In at least one embodiment, the software program 502 is statically compiled in conjunction with pre-compiled libraries 506 or uncompiled source code including instructions for executing one or more APIs 510. In at least one embodiment, the software program 502 is dynamically compiled, and the one or more software programs are linked to one or more pre-compiled libraries 506 including one or more APIs 510 using a linker.
[0091] In at least one embodiment, when a software developer executes a software program that utilizes a library 506 including one or more APIs 510 or otherwise communicates with the library 506 via a network or other remote communication medium, the software program 502 uses a remote interface. In at least one embodiment, the one or more libraries 506 including one or more APIs 510 will be executed by a remote computing service such as a computing resource service provider. In another embodiment, the one or more libraries 506 including one or more APIs 510 will be executed by any other computing host that provides the one or more APIs 510 to the software program 502.
[0092] In at least one embodiment, software program 502 utilizes one or more APIs 510 to allocate and otherwise manage memory to be used by software program 502. In at least one embodiment, software program 502 utilizes one or more APIs 510 to allocate and manage memory to be used by one or more portions of software program 502, which will be accelerated using one or more PPUs (such as GPUs or any other accelerators or processors further described herein). In at least one embodiment, software program 502 selects one or more portions to be deactivated during the training of one or more neural networks based at least in part on whether one or more portions of the one or more neural networks will be used after training the one or more neural networks.
[0093] In at least one embodiment, each of the one or more APIs 510 is an API for facilitating parallel computing. In at least one embodiment, each of the one or more APIs 510 is any other API further described herein. In at least one embodiment, each of the one or more APIs 510 is provided by a driver and / or runtime 504. In at least one embodiment, each of the one or more APIs 510 is provided by a CUDA user-mode driver. In at least one embodiment, each of the one or more APIs 510 is provided by a CUDA runtime. In at least one embodiment, the driver is data values and software instructions that, if executed, perform or otherwise facilitate the operation of one or more functions 512 of one or more APIs 510 during the loading and execution of one or more portions of the software program 502. In at least one embodiment, the runtime 504 is data values and software instructions that, if executed, perform or otherwise facilitate the operation of one or more functions 512 of one or more APIs 510 during the execution of the software program 502. In at least one embodiment, software program 502 utilizes one or more APIs 510 implemented by or otherwise provided by a driver and / or runtime 504 to perform combinatorial arithmetic operations (thread-specific memory access computations performed by software program 502 during execution by one or more PPUs (such as GPUs).
[0094] In at least one embodiment, software program 502 utilizes one or more APIs 510 provided by the driver and / or runtime 504 to perform combined arithmetic operations on one or more PPUs (such as GPUs). In at least one embodiment, one or more APIs 510 provide combined arithmetic operations via the driver and / or runtime 504, as described above. In at least one embodiment, software program 502 utilizes one or more APIs 510 provided by the driver and / or runtime 504 to allocate or otherwise reserve one or more blocks of memory 514 for one or more PPUs (such as GPUs). In at least one embodiment, software program 502 utilizes one or more APIs 510 provided by the driver and / or runtime 504 to allocate or otherwise reserve blocks of memory. In at least one embodiment, one or more APIs 510 are used to perform thread memory address calculations and thread initiation, as described herein. Figure 1-4 As described. Figure 1 This is a block diagram illustrating a system for calculating the memory address of a thread according to at least one embodiment.
[0095] In at least one embodiment, block diagram 500 depicts a processor including one or more circuits for executing one or more software programs to combine two or more application programming interfaces (APIs) into a single API. In at least one embodiment, block diagram 500 depicts a system including one or more processors for executing one or more software programs to combine two or more application programming interfaces (APIs) into a single API. In at least one embodiment, the API is used to identify one or more expected software outputs, which will be used to compare with one or more other software outputs to be generated by the software.
[0096] Figure 6An example of a system 600, according to at least one embodiment, may include software and hardware for executing an API to generate one or more addresses corresponding to one or more instructions corresponding to the same software kernel or otherwise performing any of the operations described herein. System 600 may include a storage device 602 and one or more processors 608. Storage device 602 may include, for example, memory, a cache, or other storage devices further described herein. Storage device 602 may be separate from one or more processors 608, or storage device 602 may be included in one or more processors 608 (e.g., in storage device 612). In at least one embodiment, software program 604 and / or software library (or instructions) 606 may be stored in memory, a cache, or other storage device and provided to one or more processors 608 to cause one or more circuits of one or more processors 608 to perform the operations described herein. In at least one embodiment, software program 604 and / or software library (or instructions) 606 may be integrated into one or more circuits of one or more processors 608. A software program 604 that can be used to perform any of the operations described herein can be stored on a storage device 602.
[0097] In at least one embodiment, the software program 604 may include one or more software modules. In at least one embodiment, the one or more software modules include, as in combination with... Figure 1 The CUDA execution engine 104 is described.
[0098] In at least one embodiment, as used in any implementation described herein, unless explicitly stated otherwise from the context or explicitly to the contrary, a module refers to any combination of software logic, firmware logic, hardware logic, and / or circuitry configured to provide the functionality described herein. In at least one embodiment, software is embodied as a software package, code, and / or instruction set or instructions, and "hardware" as used in any implementation described herein includes, for example, individually or in any combination, hardwired circuitry, programmable circuitry, state machine circuitry, fixed-function circuitry, execution unit circuitry, and / or firmware storing instructions executed by the programmable circuitry. In at least one embodiment, modules are embodied collectively or individually as circuitry forming part of a larger system, such as integrated circuits (ICs), system-on-a-chip (SoCs), etc. In at least one embodiment, a module performs one or more processes associated with any suitable processing unit and / or combination of processing units such as one or more CPUs, GPUs, GPGPUs, PPUs, and / or variations thereof (including variations further described herein).
[0099] In at least one embodiment, software program 604 may include a collection of software code, commands, instructions, or other sequences of text for instructing a computing device to perform one or more computational operations and / or to invoke one or more other instruction sets (such as one or more APIs or one or more API functions or instruction set architecture (ISA) level instructions) to be executed or otherwise implemented. In at least one embodiment, the API includes an API for executing the API to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described herein.
[0100] Instructions (e.g., hardware instructions) or microcode may involve ISA-level instructions, which may include native ISA instructions or non-native ISA commands. Software program 604 and / or software library (or instruction) 606 (e.g., one or more modules) may be distributed across multiple processors communicating via a bus, network, by writing to shared memory, and / or any suitable communication procedure (such as the communication procedure described herein).
[0101] In at least one embodiment, system 600 may include one or more software libraries 606, which may, for example, provide one or more API and / or ISA instructions. In at least one embodiment, one or more API and / or ISA instructions may be used to execute the API to generate one or more addresses of one or more instructions corresponding to the same software kernel. In at least one embodiment, one or more software libraries 606 may be included in a driver and / or runtime. In at least one embodiment, software library 606 (e.g., including one or more API and / or ISA instructions) may include software instruction sets that, if executed or otherwise implemented, cause one or more processors 608 to perform one or more computational operations, such as any of the operations described herein. In at least one embodiment, one or more API and / or ISA instructions may be distributed or otherwise provided as part of one or more software libraries 606, runtimes, drivers, and / or any other software and / or executable code groups further described herein. In at least one embodiment, one or more API and / or ISA instructions may perform one or more computational operations in response to a call to software program 604.
[0102] One or more processors 608 may include any number of processors and any suitable processing units and / or combinations of processing units, such as, but not limited to, a central processing unit (“CPU”), a graphics processing unit (“GPU”), or other processors (including accelerators, field-programmable gate arrays (FPGAs), graphics processors, parallel processors, GPGPUs, DPUs, and / or variations thereof, including those further described herein), such other processors including any processors described herein, such as, but not limited to, those described herein. Figure 8-20B The processors in the memory. In at least one embodiment, one or more processors 608 may retrieve or fetch instructions (e.g., one or more API and / or ISA instructions) from memory 602 using, for example, instruction fetch 616 (e.g., for an instruction fetch phase). Instructions may include instructions for executing APIs to generate one or more addresses of one or more instructions corresponding to the same software kernel. In at least one embodiment, one or more processors 608 may include memory 612 and instruction queue 610 for storing and queuing instructions fetched from memory 602. In at least one embodiment, the fetched instructions may be decoded by decoder 618 to determine what operation one or more processors 608 should perform (e.g., in an instruction decoding phase). In at least one embodiment, one or more processors 608 may fetch additional operands (data) that can be used for instructions, and operands may be stored, for example, in registers or memory 612. In at least one embodiment, micro-operation 620 may perform operations on the data stored in one or more registers or memory 612. For example, each step of an instruction fetched by one or more processors 608 can be broken down during execution, so that one or more processors 608 can execute the instruction step by step through a series of micro-operations 620. In at least one embodiment, the program counter (PC) 614 can hold the address of the next instruction and can be updated to point to the next instruction to be executed by one or more processors 608.
[0103] In at least one embodiment, one or more processors 608 may (e.g., during the execution phase) execute instructions. For example, one or more processors 608 may execute operations specified by the instructions, such as arithmetic operations, logical operations, or data transfers. In at least one embodiment, one or more computing units 622 may execute instructions for performing any of the operations described herein. In at least one embodiment, one or more computing units may include one or more ALUs 624 (Arithmetic Logic Units) that can be used to perform arithmetic and logical operations. In at least one embodiment, one or more computing units may include one or more FPUs (Floating Point Units) 626 that can be used to perform floating-point calculations. In at least one embodiment, other circuitry 628 may be used to perform other operations, such as vector operations and / or scalar operations. In at least one embodiment, one or more accelerators 630 may include one or more matrix multiplication accelerators, one or more parallel processing units (PPUs) (such as GPUs), or any other accelerators or processors further described herein. In at least one embodiment, software program 604 may utilize one or more API and / or ISA instructions to perform various computational operations, such as matrix multiplication, arithmetic operations, or any other computational operations further described herein, using one or more accelerators 630. In at least one embodiment, one or more computational operations using one or more accelerators 630 may at least include a group of one or more computational operations that will be accelerated by execution at least partially by one or more accelerators 630, the group of one or more computational operations including: executing APIs to generate one or more addresses of one or more instructions corresponding to the same software kernel.
[0104] In at least one embodiment, system 600 can be used to execute one or more instructions including functions or operations (such as those described in conjunction with 1-4). In at least one embodiment, system 600, including one or more processors, causes one or more circuits to execute an API to generate one or more addresses of one or more instructions corresponding to the same software kernel and / or otherwise perform the operations described herein. In at least one embodiment, system 600 is included Figure 1-5 The system illustrated herein and / or otherwise includes Figure 1-5 The system illustrated herein is configured to cause one or more circuits to execute an API to generate one or more addresses of one or more instructions corresponding to the same software kernel and / or otherwise perform the operations described herein. In at least one embodiment, system 600 includes Figure 6-25CThe illustrations depict one or more hardware components, such as those used to execute APIs to generate one or more addresses of one or more instructions corresponding to the same software kernel and / or otherwise perform the operations described herein.
[0105] In at least one embodiment, some or all of the processes described herein (or any other processes described, or variations and / or combinations thereof) may be executed under the control of one or more computer systems configured with executable instructions and / or other data, and may be implemented as executable instructions that execute jointly on a combination of processor 102 or multiple processors 100. In at least one embodiment, the executable instructions and / or other data may be stored on a non-transitory computer-readable storage medium (e.g., a computer program persistently stored on a magnetic, optical, or flash memory medium).
[0106] Data Center
[0107] Figure 7 An example data center 700 according to at least one embodiment is illustrated. The data center 700 may include one or more rooms having racks 702 and auxiliary equipment for housing one or more racks 702 and one or more substrates 704. Racks 702 may include one or more substrates 704. Racks 702 may include housings for housing and supporting individual substrates 704. Operational aspects of racks 702 may be adjustable at the rack level (corresponding to a group of substrates 704) or at the substrate level (corresponding to individual substrates 704), among other options. Racks 702 or substrates 704 may have specific selected maximum operating parameters, such as, but not limited to, power consumption, operating frequency, etc. The data center 700 may be supported by various cooling systems, such as, but not limited to, cooling towers, cooling loops, pumps, and other support systems. The cooling system may include sensors and controllers for monitoring and managing the cooling characteristics of racks 702. The substrates 704 within racks 702 may draw operating power from one or more power distribution units (PDUs; not shown). PDUs can be arranged within racks 702, for example, between racks 702 that include substrates 704, or within racks 702 that also house substrates 704.
[0108] The rack 702 and substrate 704 may include subsystems, modules, add-in cards, and other semiconductor components. Substrate 704 may include one or more computing units 706, each computing unit 706 including one or more processors 708, one or more memories 710, and an interface controller 712. The computing unit 706 may include any number of processors, such as, but not limited to, a central processing unit (“CPU”), a graphics processing unit (“GPU”), or other processors (including accelerators, field-programmable gate arrays (FPGAs), graphics processors, etc.), including any processor described herein, such as, but not limited to, those described herein. Figure 8-20B The processor in the computing unit 706 may include one or more memory storage devices 710 (e.g., dynamic read-only memory, solid-state storage devices, or disk drives), as well as network input / output (“NW I / O”) devices, network switches, virtual machines (“VMs”), power supply modules, and cooling modules, etc. One or more computing units 706 may be a server having one or more of the aforementioned computing resources.
[0109] Computing unit 706 may include individual computing unit groups housed in one or more racks (not shown), or in numerous racks within data centers in different geographical locations (also not shown). Individual computing unit groups may include grouped computing, networking, memory, or storage resources that can be configured or allocated to support one or more workloads. Several computing units (e.g., including CPUs and / or other processors) may be grouped within one or more racks to provide computing resources to support one or more workloads. Resource coordinator 714 may configure or otherwise control one or more computing units 706 or groups of computing units. Resource coordinator 714 may include a Software Design Infrastructure (“SDI”) management entity for data center 700. Resource coordinator 714 may include hardware, software, or some combination thereof.
[0110] Data center 700 may include any one or any combination of the framework layer 720, software layer 730, and application layer 740. For example... Figure 7As shown, framework layer 720 includes a job scheduler 722, a configuration manager 724, a resource manager 726, and a distributed file system 728. Framework layer 720 may include a framework for supporting software 732 of software layer 730 and / or one or more applications 742 of application layer 740. Software 732 or application 742 may respectively include web-based service software or applications, such as, but not limited to, software or applications provided by Amazon Web Services, Google Cloud, and Microsoft Azure. Framework layer 720 may be a type of free and open-source software web application framework, such as, but not limited to, Apache Spark. TM (Hereinafter referred to as "Spark"), which can utilize the distributed file system 728 for large-scale data processing (e.g., "big data"). The job scheduler 722 may include Spark drivers, which facilitate the scheduling of workloads supported by various layers of the data center 700. The configuration manager 724 may be able to configure different layers, such as, but not limited to, the software layer 730 and the framework layer 720 (which includes Spark and the distributed file system 728 for supporting large-scale data processing). The resource manager 726 may be able to manage clustered or grouped compute units 706 mapped to or allocated to support the distributed file system 728 and the job scheduler 722. The resource manager 726 may coordinate with the resource coordinator 714 to manage these mapped or allocated compute resources.
[0111] Software 732 may be included in software layer 730, and may include software used by at least a portion of computing units 706, one or more computing units 706, groups of computing units 706, and / or the distributed file system 728 of framework layer 720. One or more types of software may include, but are not limited to, internet web search software, email virus scanning software, database software, and streaming video content software.
[0112] Application 742 may be included in application layer 740 and may include one or more types of applications used by at least portions of computing unit 706, one or more computing units 706, groups of computing units 706, and / or the distributed file system 728 of framework layer 720. One or more types of applications may include, but are not limited to, any number of genomics applications, cognitive computing applications, and machine learning applications, including training or inference software, machine learning framework software (e.g., PyTorch, TensorFlow, Caffe, etc.), or other machine learning applications used in conjunction with one or more embodiments.
[0113] Any of the Configuration Manager 724, Resource Manager 726, and Resource Coordinator 714 can implement any number and type of self-modification actions based on any amount and type of data obtained in any technically feasible manner. Self-modification actions can alleviate the burden on data center operators of Data Center 700 to make potentially erroneous configuration decisions and may prevent underutilized and / or poorly performing portions of the data center.
[0114] Data center 700 may include tools, services, software, or other resources for training one or more machine learning models according to one or more embodiments described herein, or for using one or more machine learning models to predict or infer information. For example, a machine learning model can be trained by calculating weight parameters based on a neural network architecture using the software and computing resources described above regarding data center 700. The trained machine learning model corresponding to one or more neural networks can be used with the resources described above regarding data center 700 to infer or predict information using weight parameters calculated through one or more training techniques described herein.
[0115] Data Center 700 can use CPUs, application-specific integrated circuits (ASICs), GPUs, FPGAs, or other hardware (e.g., Figure 8-20B The embodiments described herein can be used to perform some or all of the processes and techniques described elsewhere, such as, but not limited to, training and / or inference using the resources described above. Furthermore, one or more of the software and / or hardware resources described above can be configured as a service to allow a user to train or perform information inference, such as, but not limited to, image recognition, speech recognition, or other artificial intelligence services.
[0116] In at least one embodiment, processor 708 may include one of the processors described below and / or include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. In at least one embodiment, processor 708 is configured by software 732 to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. Data center 700 may use logic, CPU, application-specific integrated circuit (ASIC), GPU, FPGA or other hardware (e.g., Figure 8-20B The embodiments described above or elsewhere in this document may perform any of the operations described in the examples.
[0117] processor
[0118] The following figures illustrate, but are not limited to, example processors and processing systems that can be used to execute application programming interfaces (APIs) to generate one or more addresses of one or more instructions corresponding to the same software kernel, or otherwise perform some or all of the processes, operations, and / or techniques described elsewhere herein. The example processors and processing systems can be software-configured to execute application programming interfaces (APIs) to generate one or more addresses of one or more instructions corresponding to the same software kernel, or otherwise perform any of the operations described above or elsewhere herein. Processors and processing systems may include logic, central processing units (CPUs), application-specific integrated circuits (ASICs), graphics processing units (GPUs), field-programmable gate arrays (FPGAs), XPUs (i.e., any computing architecture best suited to the needs of the application), or other hardware (e.g., Figure 8-20B The embodiments described herein are used to perform any of the operations described above, below, or elsewhere herein. The processor and / or processing system described herein may include one or more circuits that can be used to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein. As used herein, one or more circuits may be software-configured to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein. Figure 25A and Figure 25B The illustration depicts logic 2515 according to at least one embodiment, which, as described elsewhere herein, can be used in one or more devices to perform operations such as, but not limited to, those discussed herein. For example, logic can refer to any combination of software logic, hardware logic, and / or firmware logic that provides the functionality and / or operations described herein, wherein the logic can be collectively or individually embodied as part of a circuit system forming a larger system, such as an integrated circuit (IC), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), system-on-a-chip (SoC), or one or more processors (e.g., CPU, GPU).
[0119] Figure 8A processor according to at least one embodiment is illustrated, which is a system-on-a-chip (SOC) 800 (which may be referred to as a system-on-a-chip, superchip, or other names). The SOC 800 may include processor complexes 810 and 840. The SOC 800 may include any number of processor complexes 810 and / or processor complexes 840, which may include any number of processors described herein in any combination, such as, but not limited to, processors described herein in any combination. Figure 8-20B The processor in the system. For example, processor 810 may include a central processing unit (CPU), and processor 840 may include a graphics processor. Alternatively, processor 810 may include a graphics processor, and processor 840 may include a graphics processor. The SOC 800 may include any number of display controllers 892, any number of multimedia engines 894, any number of I / O interfaces 870, any number of memory controllers 880, and any number of fabrics 860 in any combination. For ease of explanation, this document uses reference numbers identifying objects and bracket numbers identifying instances (if needed) to denote multiple instances of similar objects. The SOC 800 may include a processor from Broadcom Corporation, Palo Alto, California.
[0120] Processor complex 810 may include a CPU, processor complex 840 may include a GPU, and SOC 800 may include a processing unit integrating processor complex 810 and processor complex 840 onto a single chip. Certain tasks may be assigned to processor complex 810, while other tasks may be assigned to processor complex 840. Processor complex 810 may be configured to execute main control software associated with SOC 800, such as, but not limited to, an operating system. Processor complex 810 may be the main processor of SOC 800, controlling and coordinating the operation of other processors. Processor complex 810 may issue commands that control the operation of processor complex 840 to perform some or all of the operations described herein. Processor complex 810 may be configured to execute host-executable code derived from CUDA or other source code (e.g., HIP source code), while processor complex 840 may be configured to execute device-executable code derived from CUDA or other source code to perform any of the operations described herein.
[0121] The processor complex 810 may include cores 820(1)-820(4) and a cache (e.g., L3 cache) 830 for storing information for performing the operations described herein. The processor complex 810 may include any number of cores 820 in any combination and any number and type of cache. The cores 820 may be configured to execute instructions of a specific instruction set architecture (“ISA”) to perform some or all of the operations described herein. Each core 820 may include a CPU core. Cores 820(1)-820(4) may be referred to as compute units or arithmetic units. The SOC 800 may include any number of processor complexes 810, architecture 860, I / O interface 870, and memory controller 880.
[0122] Each core 820 may include a fetch / decode unit 822, an integer execution engine 824, a floating-point execution engine 826, and an L2 cache 828. The fetch / decode unit 822 may fetch instructions to perform some or all of the operations described herein (e.g., but not limited to APIs compiled into instructions) and decode those instructions, generate micro-operations, and dispatch individual micro-instructions to the integer execution engine 824 and / or the floating-point execution engine 826. The fetch / decode unit 822 may concurrently dispatch one micro-instruction to the integer execution engine 824 and another micro-instruction to the floating-point execution engine 826. The integer execution engine 824 may perform integer and memory operations. The floating-point engine 826 may perform floating-point and vector operations. The fetch / decode unit 822 may dispatch micro-instructions to one or more execution engines, which may replace both the integer execution engine 824 and the floating-point execution engine 826.
[0123] Each core 820(i) (where i is an integer representing a specific instance of core 820) can access the L2 cache 828(i) included in core 820(i). Each core 820 included in core complex 810(j) (where j is an integer representing a specific instance of core complex 810) can be connected to other cores 820 included in core complex 810(j) via the L3 cache 830(j) included in core complex 810(j). The cores 820 included in core complex 810(j) (where j is an integer representing a specific instance of core complex 810) can access all L3 caches 830(j) included in core complex 810(j). The L3 cache 830 can include any number of slices.
[0124] Processor complex 840 may be a graphics complex that can be configured to perform computational operations (e.g., the computational operations described herein) in a highly parallel manner. Processor complex 840 may be configured to perform graphics pipeline operations, such as, but not limited to, drawing commands, pixel operations, geometric calculations, and other operations associated with rendering an image to a display. Processor complex 840 may be configured to perform graphics-independent operations, such as, but not limited to, neural network training and / or simulation. Processor complex 840 may be configured to perform both graphics-related and graphics-independent operations.
[0125] The processor complex 840 may include any number of compute units 850(1)-850(N) (where N is any integer greater than 1) and an L2 cache 842. The compute units 850 may share the L2 cache 842, which may store information that will be used to perform some or all of the operations described herein. The L2 cache 842 may be partitioned. The processor complex 840 may include any number of compute units 850 and any number (including zero) and type of cache. The processor complex 840 may include any number of dedicated graphics hardware.
[0126] Each compute unit 850 may include any number of SIMD units 852(1)-852(N) (where N is any integer greater than 1) and shared memory 854. Each SIMD unit 852 may implement a SIMD architecture and may be configured to perform some or all of the operations described herein in parallel. Each compute unit 850 may execute any number of thread blocks, but each thread block may execute on a single compute unit 850, although in some embodiments, the thread block may execute on multiple compute units. A thread block may include any number of execution threads. A workgroup may be a thread block. Each SIMD unit 852 may execute a set of threads. A set of threads (e.g., 16 threads), also referred to as a warp, subgroup, or wavefront (e.g., used by AMD and Intel), may belong to a single thread block and be configured to process different datasets based on a single instruction set. Prediction may be used to disable one or more threads in a warp, subgroup, or wavefront. A lane may be a thread. A work item can be a thread, such as (but not limited to) an OpenCL thread. Different thread bundles, subgroups, or wavefronts within a thread block can be synchronized together and communicate via shared memory 854. Each compute unit 850 can include one or more thread block clusters, where thread block clusters can implement programmable control over locality at a larger granularity than a single thread block in a single streaming multiprocessor (SM). Thread block clusters (also referred to as “clusters”) can support multiple thread blocks running concurrently across streaming multiprocessors, thereby synchronously and cooperatively acquiring, exchanging, or otherwise using data. In at least one embodiment, a streaming multiprocessor (“SM”) can refer to a streaming microprocessor, a streaming processor (“SP”), a streaming processing unit (“SPU”), a compute unit (“CU”), an execution unit (“EU”), and / or a slice, where a slice in this context can refer to a portion of the processing resources within a processing unit (e.g., 16 cores, a ray tracing unit, a thread bootstrap, or a scheduler).
[0127] Structure 860 may be a system interconnect that facilitates data and control transfers across processor complex 810, processor complex 840, I / O interface 870, memory controller 880, display controller 892, and multimedia engine 894, for example, to perform some or all of the operations described herein. SOC 800 may include any number and type of system interconnects other than or replacing structure 860, facilitating data and control transfers across any number and type of directly or indirectly linked components within or outside SOC 800. I / O interface 870 may represent any number and type of I / O interfaces (e.g., PCI, PCI extensions (“PCI-X”), PCIe, Gigabit Ethernet (“GBE”), USB, etc.). Various types of peripheral devices may be coupled to I / O interface 870. Peripherals that may be coupled to I / O interface 870 may include keyboards, mice, printers, scanners, joysticks or other types of game controllers, media recording devices, external storage devices, network interface cards, etc.
[0128] Display controller 892 can display images on one or more display devices, such as, but not limited to, liquid crystal displays (“LCD”) devices. Multimedia engine 894 can include any number and type of circuitry related to multimedia, such as, but not limited to, video decoders, video encoders, image signal processors, etc. Memory controller 880 can facilitate data transfer between SOC 800 and unified system memory 890. Processor complex 810 and processor complex 840 can share unified system memory 890. Unified system memory 890 can include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as, but not limited to, synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. Unified system memory 890 can include 3D stacked memory, including but not limited to high bandwidth memory (HBM), HBM2e, or HDM3.
[0129] The SOC 800 can implement a memory subsystem including any number and type of memory controllers 880 and memory devices (e.g., shared memory 854), which may be dedicated to a single component or shared among multiple components to perform any of the operations described herein. The SOC 800 can implement a cache subsystem including one or more cache memories (e.g., L2 cache 828, L3 cache 830, and L2 cache 842), each cache memory may be dedicated to any number of components (e.g., core 820, core complex 810, SIMD unit 852, compute unit 850, and processor complex 840), or may be shared among any number of components (e.g., core 820, core complex 810, SIMD unit 852, compute unit 850, and processor complex 840).
[0130] In at least one embodiment, the SOC 800 may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0131] Figure 9A A parallel processor 900 according to at least one embodiment is illustrated. The parallel processor 900 may be implemented using one or more circuits and may be referred to as a programmable processor (e.g., CPU and / or GPU), logic, application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other hardware (e.g., Figure 8-20B (The embodiments in this document) are used to perform any of the operations described above or elsewhere herein.
[0132] Parallel processor 900 may include parallel processing unit 902 for performing any of the operations described above or elsewhere herein. Parallel processing unit 902 may include I / O unit 904 that enables communication with other devices, including other instances of parallel processing unit 902. I / O unit 904 may be directly connected to other devices. I / O unit 904 may be connected to other devices via the use of a hub or switch interface, such as, but not limited to, memory hub 905. The connection between memory hub 905 and I / O unit 904 may form a communication link 913. I / O unit 904 may be connected to host interface 906 and memory crossbar switch 916, wherein host interface 906 receives commands directed to perform processing operations, and memory crossbar switch 916 receives commands directed to perform memory operations.
[0133] When host interface 906 receives a command buffer via I / O unit 904, host interface 906 can route the work operations that execute these commands to front-end 908. Front-end 908 can be coupled to scheduler 910 (which may be called sequencer), which is configured to distribute commands or other work items to processing cluster array 912. Scheduler 910 can ensure that processing cluster array 912 is correctly configured and in an active state before tasks are distributed to the cluster of processing cluster array 912. Scheduler 910 can be implemented via firmware logic executed on a microcontroller. The microcontroller-implemented scheduler 910 can be configured to perform complex scheduling and work distribution operations at both coarse and fine granular levels, thereby enabling fast preemption and context switching of threads executing on processing array 912. Host software can validate workloads scheduled on processing cluster array 912 via one of multiple graphics processing paths. The workload can then be automatically distributed to the processing array cluster 912 by the scheduler 910 logic within the microcontroller, which includes the scheduler 910.
[0134] Processing cluster array 912 can perform any of the operations described above or elsewhere herein, and may include up to “N” processing clusters (e.g., clusters 914A, 914B through 914N), where “N” represents a positive integer (which may be a different integer “N” than used in other diagrams). Each cluster 914A-914N in processing cluster array 912 can execute a large number of concurrent threads. Scheduler 910 may use various scheduling and / or work distribution algorithms to distribute work to clusters 914A-914N in processing cluster array 912, which may vary depending on the workload generated by each type of program or computation. Scheduling may be dynamically handled by scheduler 910 or may be assisted by compiler logic during the compilation of program logic configured to be executed by processing cluster array 912. Different clusters 914A-914N of processing cluster array 912 may be assigned to process different types of programs or perform different types of computations.
[0135] The processing cluster array 912 can be configured to perform various types of parallel processing operations, such as, but not limited to, any of the operations described above or elsewhere herein. The processing cluster array 912 can be configured to perform general-purpose parallel computing operations. For example, the processing cluster array 912 may include logic for performing processing tasks, including filtering video and / or audio data, performing modeling operations (including physical operations), and performing data transformations.
[0136] Processing cluster array 912 can be configured to perform parallel graphics processing operations. Processing cluster array 912 may include additional logic for supporting the execution of such graphics processing operations, including but not limited to texture sampling logic for performing texture operations, as well as tessellation logic and other vertex processing logic. Processing cluster array 912 can be configured to execute shader programs related to graphics processing, such as, but not limited to, vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. Parallel processing unit 902 can transfer data from system memory via I / O unit 904 for processing. During processing, the transferred data may be stored in on-chip memory (e.g., parallel processor memory 922) during processing and then written back to system memory.
[0137] When the parallel processing unit 902 is used to perform graphics processing, the scheduler 910 can be configured to divide the processing workload into tasks of approximately equal size to better distribute graphics processing operations to multiple clusters 914A-914N of the processing cluster array 912. Each part of the processing cluster array 912 can be configured to perform different types of processing. For example, a first part can be configured to perform vertex shading and topology generation, a second part can be configured to perform tessellation and geometry shading, and a third part can be configured to perform pixel shading or other screen-space operations to produce a rendered image for display. Intermediate data generated by one or more clusters 914A-914N can be stored in a buffer to allow intermediate data to be transferred between clusters 914A-914N for further processing.
[0138] The processing cluster array 912 can receive processing tasks to be executed via a scheduler 910, which receives commands defining the processing tasks from a front end 908. A processing task may include an index of data to be processed, such as surface (patch) data, primitive data, vertex data, and / or pixel data, as well as state parameters and commands defining how to process the data (e.g., which program to execute). The scheduler 910 can be configured to retrieve the index corresponding to the task, or may receive the index from the front end 908. The front end 908 can be configured to ensure that the processing cluster array 912 is configured to be active before the workload specified by the incoming command buffer (e.g., batch buffer, push buffer, etc.) is initiated.
[0139] Each instance of one or more instances of parallel processing unit 902 may be coupled to parallel processor memory 922 to perform any of the operations described above or elsewhere herein. Parallel processor memory 922 may be accessed via memory crossbar switch 916, which may receive memory requests from processing cluster array 912 and I / O unit 904. Memory crossbar switch 916 may access parallel processor memory 922 via memory interface 918. Memory interface 918 may include multiple partition units (e.g., partition units 920A, 920B through 920N), each partition unit may be coupled to a portion (e.g., memory cell) of parallel processor memory 922. The number of partition units 920A-920N may be configured to be equal to the number of memory cells, such that a first partition unit 920A has a corresponding first memory cell 924A, a second partition unit 920B has a corresponding memory cell 924B, and an Nth partition unit 920N has a corresponding Nth memory cell 924N. The number of partition units 920A-920N may not be equal to the number of memory units.
[0140] Memory cells 924A-924N may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as, but not limited to, synchronous graphics random access memory (SGRAM), which includes graphics double data rate (GDDR) memory. Memory cells 924A-924N may also include 3D stacked memory, including but not limited to high-bandwidth memory (HBM), HBM2e, or HDM3. Render targets (e.g., but not limited to framebuffers or texture maps) may be stored in memory cells 924A-924N, allowing partitioning cells 920A-920N to write portions of each render target in parallel to efficiently utilize the available bandwidth of the parallel processor memory 922. A local instance of the parallel processor memory 922 may not be included to support a unified memory design that combines system memory with local cache memory.
[0141] Any cluster 914A-914N in the processing cluster array 912 can process data to be written to any memory cell 924A-924N within the parallel processor memory 922. The memory crossbar switch 916 can be configured to transfer the output of each cluster 914A-914N to any partition cell 920A-920N, or to another cluster 914A-914N on which additional processing operations can be performed. Each cluster 914A-914N can communicate with the memory interface 918 via the memory crossbar switch 916 to read from or write to various external memory devices. The memory crossbar switch 916 can be connected to the memory interface 918 to communicate with the I / O unit 904, or to a local instance of the parallel processor memory 922, enabling processing units within different processing clusters 914A-914N to communicate with system memory or other local memory of the non-parallel processing unit 902. The memory crossbar switch 916 can use virtual channels to separate traffic flows between clusters 914A-914N and partition units 920A-920N.
[0142] Multiple instances of the parallel processing unit 902 can be mounted on a single add-in card, or multiple add-in cards can be interconnected. Even if different instances of the parallel processing unit 902 have different numbers of processing cores, different amounts of local parallel processor memory, and / or other configuration differences, these different instances can be configured to interoperate. For example, some instances of the parallel processing unit 902 may include higher-precision floating-point units relative to other instances. Systems including one or more instances of the parallel processing unit 902 or the parallel processor 900 can be implemented in a variety of configurations and form factors, including but not limited to desktop, laptop, or handheld personal computers, servers, workstations, game consoles, and / or embedded systems.
[0143] Figure 9A It also includes a block diagram of a partitioning unit 920 according to at least one embodiment. The partitioning unit 920 is... Figure 9A An example of one of the partition units 920A-920N in the parallel processor memory. Partition unit 920 may include an L2 cache 921, a frame buffer interface 925, and a ROP 926 (raster operation unit). The L2 cache 921 may be a read / write cache configured to perform load and store operations received from the memory crossbar switch 916 and the ROP 926. Read misses and urgent write-back requests may be output from the L2 cache 921 to the frame buffer interface 925 for processing. Updates may also be sent to the frame buffer via the frame buffer interface 925 for processing. The frame buffer interface 925 may interface with one of the memory cells in the parallel processor memory, such as, but not limited to, Figure 9A The memory cells 924A-924N (shown as 924) are located, for example, within the parallel processor memory 922.
[0144] The ROP 926 can be a processing unit that performs raster operations, such as, but not limited to, stenciling, z-testing, blending, etc. The ROP 926 can then output processed graphics data stored in graphics memory. The ROP 926 may include compression logic for compressing depth or color data written to memory and decompressing depth or color data read from memory. The compression logic can be lossless compression logic that utilizes one or more compression algorithms. The type of compression performed by the ROP 926 can vary based on the statistical characteristics of the data to be compressed. For example, incremental color compression is performed on depth and color data on a per-tile basis.
[0145] ROP 926 can be included in each processing cluster (e.g., Figure 9A The data is stored within clusters 914A-914N, not within partition units 920. Read and write requests for pixel data (not pixel fragment data) can be transferred via memory crossbar switch 916. Processed graphics data can be displayed on the monitor and routed for further processing by the processor, or routed to... Figure 9A One of the processing entities within the parallel processor 900 in the system is further processed.
[0146] In at least one embodiment, the parallel processor 900 may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0147] Figure 9B A block diagram including a processing cluster 914 within a parallel processing unit according to at least one embodiment. The processing cluster may be... Figure 9A An instance of one of the processing clusters 914A-914N is provided, which can be used to perform any of the operations described above or elsewhere herein. The processing cluster 914 can be configured to execute many threads in parallel, where a “thread” refers to an instance of a specific program executed on a specific input dataset. Single Instruction Multiple Data (SIMD) instruction issuing techniques can be used to support the parallel execution of a large number of threads without providing multiple independent instruction units. Single Instruction Multiple Thread (SIMT) techniques can be used to support the parallel execution of a large number of typically synchronous threads using a common instruction unit configured to issue instructions to a set of processing engines within each processing cluster.
[0148] The operation of cluster 914 can be controlled via pipeline manager 932, which distributes processing tasks to the SIMT parallel processors. Pipeline manager 932 can... Figure 9A The scheduler 910 receives instructions and manages the execution of these instructions via the graphics multiprocessor 934 and / or texture unit 936. The graphics multiprocessor 934 may be an example instance of a SIMT parallel processor. However, the processing cluster 914 may include various types of SIMT parallel processors with different architectures. The processing cluster 914 may include one or more instances of the graphics multiprocessor 934. The graphics multiprocessor 934 can process data and can use a data cross switch 940 to distribute the processed data to one of several possible destinations, including other shader units. The pipeline manager 932 can facilitate the distribution of processed data by specifying the destination of the processed data to be distributed via the data cross switch 940.
[0149] Each graphics multiprocessor 934 within the processing cluster 914 may include a set of identical functional execution logic (e.g., arithmetic logic units, load-memory units, etc.) for performing computations for any of the operations described above or elsewhere herein. The functional execution logic can be configured in a pipelined manner, where new instructions can be issued before previous instructions complete. The functional execution logic can support a wide range of operations, including integer and floating-point arithmetic, comparison operations, Boolean operations, bit shifting, and computation of various algebraic functions. Different operations can be performed using the same functional unit hardware, and arbitrary combinations of functional units are possible.
[0150] Instructions transmitted to the processing cluster 914 can form threads, which may also be called thread bundles, subgroups, waves, or wavefronts. A group of threads executing across a set of parallel processing engines can be called a thread group. Thread groups can execute a common program on different input data. Each thread within a thread group can be assigned to a different processing engine within the graphics multiprocessor 934. The number of threads in a thread group can be less than the number of processing engines within the graphics multiprocessor 934. When the number of threads in a thread group is less than the number of processing engines, one or more processing engines may be idle during the processing cycle of that thread group. The number of threads in a thread group can also be more than the number of processing engines within the graphics multiprocessor 934. When the number of threads in a thread group is more than the number of processing engines within the graphics multiprocessor 934, processing can be performed in consecutive clock cycles. Multiple thread groups can execute concurrently on the graphics multiprocessor 934.
[0151] The graphics multiprocessor 934 includes an internal cache memory for performing load and store operations, such as, but not limited to, any of the operations described above or elsewhere herein. The graphics multiprocessor 934 may forgo the internal cache and instead use a cache memory within the processing cluster 914 (e.g., L1 cache 948). Each graphics multiprocessor 934 may also access partition units that can be shared across all processing clusters 914 (e.g., ...). Figure 9A The L2 cache within partition units 920A-920N is used for transferring data between threads. The graphics multiprocessor 934 can also access off-chip global memory, which may include one or more of the local parallel processor memory and / or system memory. Any memory outside the parallel processing unit 902 can be used as global memory. The processing cluster 914 may include multiple instances of the graphics multiprocessor 934 and can share common instructions and data, which may be stored in the L1 cache 948.
[0152] Each processing cluster 914 may include an MMU 945 (Memory Management Unit), which can be configured to map virtual addresses to physical addresses. One or more instances of the MMU 945 may reside on... Figure 9A The MMU 945 is located within the memory interface 918. It may include a set of page table entries (PTEs) for mapping virtual addresses to physical addresses on tiles, and optional cache line indexes. The MMU 945 may include address translation backstop (TLB) buffers or caches that may reside within the graphics multiprocessor 934 or L1 948 cache, or within the processing cluster 914. Physical addresses can be processed to distribute surface data access locally, allowing for efficient request interleaving between partition units. The cache line indexes can be used to determine whether a request for a cache line is a hit or a miss.
[0153] Processing cluster 914 can be configured such that each graphics multiprocessor 934 is coupled to a texture unit 936 for performing texture mapping operations, such as determining texture sample locations, reading texture data, and filtering texture data. Texture data can be read from an internal texture L1 cache (not shown) or an L1 cache within the graphics multiprocessor 934, and can be retrieved as needed from an L2 cache, local parallel processor memory, or system memory. Each graphics multiprocessor 934 can output a processed task to a data crossbar switch 940 to provide the processed task to another processing cluster 914 for further processing, or store the processed task in an L2 cache, local parallel processor memory, or system memory via a memory crossbar switch 916. A preROP (pre-raster operation unit) 942 can be configured to receive data from the graphics multiprocessor 934 and direct the data to ROP units, which can be associated with partitioning units (e.g., ...) described herein. Figure 9A The PreROP 942 unit is located together with partition units 920A-920N. The PreROP 942 unit can perform color blending optimization, organize pixel color data, and perform address translation.
[0154] In at least one embodiment, the processing cluster 914 may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein.
[0155] Figure 9CA graphics multiprocessor 934 according to at least one embodiment is illustrated, for example, to perform any of the operations described above or elsewhere herein. The graphics multiprocessor 934 may be coupled to a pipeline manager 932 of a processing cluster 914. The graphics multiprocessor 934 may include an execution pipeline including, but not limited to, an instruction cache 952 (e.g., which may store instructions, such as, but not limited to, compiled API instructions), an instruction unit 954, an address mapping unit 956, a register file 958, one or more general-purpose graphics processing unit (GPGPU) cores 962, and one or more load / store units 966, one or more of which may perform load / store operations to load / store instructions corresponding to the execution operation. The GPGPU cores 962 and the load / store units 966 may be coupled to a cache memory 972 and a shared memory 970 via a memory and cache interconnect 968. The GPGPU cores 962 may be part of a SoC, such as, but not limited to, [other components]. Figure 8 It is part of the integrated circuit 800.
[0156] Instruction cache 952 can receive a stream of instructions to be executed (e.g., perform any operation described above or elsewhere herein) from pipeline manager 932. Instructions can be cached in instruction cache 952 and dispatched for execution by instruction unit 954. Instruction unit 954 can dispatch instructions into thread groups (e.g., thread bundles, subgroups, wavefronts, or waves), with each thread in the thread group assigned to a different execution unit within GPGPU core 962. Instructions can access any of the local, shared, or global address spaces by specifying an address within a unified address space. Address mapping unit 956 can be used to translate addresses in the unified address space into different memory addresses accessible by load / store unit 966.
[0157] Register file 958 provides a set of registers for the functional units of graphics multiprocessor 934. Register file 958 provides temporary storage for operands on data paths connected to functional units of graphics multiprocessor 934 (e.g., GPGPU core 962, load / store unit 966). Register file 958 can be partitioned among functional units such that each functional unit is allocated a dedicated portion of register file 958. Register file 958 can be partitioned among different thread bundles (which may be referred to as wavefronts, subgroups, and / or waves or threads) executed by graphics multiprocessor 934.
[0158] Each GPGPU core 962 may include a floating-point unit (FPU) and / or an integer arithmetic logic unit (ALU) for executing instructions of the graphics multiprocessor 934. The architectures of the GPGPU cores 962 may be similar or different. A first part of the GPGPU core 962 may include a single-precision FPU and an integer ALU, while a second part of the GPGPU core may include a double-precision FPU. The FPU may implement IEEE 754-2008 standard floating-point arithmetic or enable variable-precision floating-point arithmetic. The graphics multiprocessor 934 may also include one or more fixed-function or special-function units for performing specific functions, such as, but not limited to, copying rectangles or pixel blending operations. One or more of the GPGPU cores 962 may also include fixed-function or special-function logic.
[0159] The GPGPU Core 962 may include SIMD logic capable of executing a single instruction on multiple sets of data. The GPGPU Core 962 can physically execute SIMD4, SIMD8, and SIMD16 instructions, and logically execute SIMD1, SIMD2, and SIMD32 instructions. The SIMD instructions for the GPGPU Core can be generated by the shader compiler at compile time, or automatically generated when executing programs written and compiled for Single Program Multiple Data (SPMD) or SIMT architectures. Multiple threads of a program can be configured for a SIMT execution model that can be executed via a single SIMD instruction. For example, eight SIMT threads performing the same or similar operations can be executed in parallel via a single SIMD8 logic unit.
[0160] The memory and cache interconnect 968 may include an interconnect network that connects each functional unit of the graphics multiprocessor 934 to the register file 958 and shared memory 970. The memory and cache interconnect 968 may be a cross-switch interconnect that allows the load / store unit 966 to perform load and store operations between the shared memory 970 and the register file 958. The register file 958 may operate at the same frequency as the GPGPU core 962, thus data transfer between the GPGPU core 962 and the register file 958 can have very low latency. The shared memory 970 can be used to implement communication between threads executing on functional units within the graphics multiprocessor 934. The cache memory 972 can be used as a data cache, for example, for caching texture data transferred between functional units and texture units 936. The shared memory 970 can also be used as a program-managed cache. Threads executing on the GPGPU core 962 can automatically cache data stored in the cache memory 972, or programmatically store data in the shared memory.
[0161] The parallel processor or GPGPU described herein can be communicatively coupled to a host / processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various general-purpose GPU (GPGPU) functions. The GPU can be communicatively coupled to the host processor / core via a bus or other interconnect (e.g., high-speed interconnects, such as, but not limited to, PCIe or NVLink). A System-on-a-Chip (SoC) may include the parallel processor or GPGPU described herein, which executes on the SoC. The GPU may be integrated as a core on a package or chip and communicatively coupled to the core via an internal processor bus / interconnect within the package or chip. Regardless of the GPU's connection method, the processor core can assign work to the GPU in the form of a sequence of commands / instructions contained in a job descriptor. The GPU can then use dedicated circuitry / logic to efficiently process these commands / instructions to perform any of the operations described above or elsewhere herein.
[0162] In at least one embodiment, the graphics multiprocessor 934 may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0163] Figure 10A processor 1000 according to at least one embodiment is illustrated. The processor 1000 may include a hybrid architecture processor (e.g., Lunar Lake or Meteor Lake) from Intel Corporation, Santa Clara, California, or other processors sharing at least some of the components described herein. The processor 1000 may include one or more central processing units (CPU 1002), one or more graphics processing units (GPU 1006), and / or one or more neural processing units (NPU 1008), which may be, for example, dedicated AI accelerators for offloading artificial intelligence (AI) workloads from the CPU 1002 and GPU 1006. The processor 1000 may use instructions that, if executed, cause the processor 1000 and / or any of its components to perform some or all of the processes and techniques described elsewhere herein. The processor 1000 may include any number of memory and cache units 1010 for facilitating processing between different components of the processor 1000. The memory and cache 1010 on processor 1000 may include one or more levels of cache (e.g., L1, L2, L3, and / or last-level cache) and high-bandwidth memory (e.g., HBM2e or HBM3) in any combination. Regarding processor 1000 and any components described above or elsewhere herein, one or more APIs described herein may, for example, be compiled into instructions that may be fetched by instruction fetching logic or equivalents, decoded by processor decoder or equivalents, scheduled (e.g., sequentially or out of order) for execution by scheduler or equivalents, executed by execution logic or equivalents, reordered, and then retired by retirement logic or equivalents. APIs (and / or compiled instructions including APIs) may be stored in any storage device (e.g., cache and / or memory) internal or external to processor 1000. The results of APIs may be stored in storage devices internal or external to processor 1000, including registers, DRAM, flash memory, SRAM, cache, or other memory. One or more APIs described herein may include calls.
[0164] Processor 1000 may include a computing engine as CPU 1002, and may include any number of cores, such as, but not limited to, up to 16 cores / 22 threads. The cores in CPU 1002 may include P-cores (performance), E-cores (high efficiency), and LP-E cores (low-power, high-efficiency). Performance cores can be used for low-latency, single-threaded, computationally intensive workloads, while high-efficiency cores can be used for multi-threaded, less computationally intensive workloads. Low-power, high-efficiency cores can be used for scalable multi-threaded execution and offloading background tasks. P-cores can be used for single-threaded and limited-threaded execution, while E-cores and LP-E cores are used for multi-threaded throughput and power efficiency.
[0165] The GPU 1006 can include any number of graphics engines, such as, but not limited to, those with 8 Xe cores (up to 128 execution units or EUs). Arc TM Graphics engine (Xe LPG). For example... Figure 10 As shown, GPU 1006 may include a vector engine 1010 and a matrix engine 1012, which, for example, can run FP, INT, and matrix operation tasks simultaneously, individually, or in batches. GPU 1006 may include a load / store unit 1014, as well as other memories, such as, but not limited to, an instruction cache (I$) 1016 and an L1 cache / subsystem local memory (SLM) 1018, which may, for example, store instructions for performing any of the operations described above or elsewhere herein.
[0166] NPU 1004 may include one or more AI Boost integrates a Neural Processing Unit (NPU). The NPU 1004 can be enumerated as an integrated PCIe device to the host processor. The NPU 1004 can include one or more (e.g., two) Neural Computation Engine (NCE) tiles 1030. Each tile can be configured with any combination of, but not limited to, the following: (e.g., 2000) Multiply-Accumulate (MAC) engines 1034, a post-processing engine (not shown), an AEP processor (not shown), and memory per tile (2MB dedicated SRAM), such as... Figure 10 As shown. For general computing needs, the neural computing engine 1030 may include a disturbance pipeline 1032, an activation function (AF) 1036, a data transformation 1038, a load / store 1040, and a streaming hybrid architecture vector engine (SHAVE) 1028 for high-performance parallel computing, which may include a DMA (Direct Memory Access) engine 1024 for transporting data between system memory DRAM (Dynamic Random Access Memory) 1026 and a software-managed cache. The built-in device MMU (Memory Management Unit) 1022, plus the IOMMU (Input-Output Memory Management Unit) (not shown), can support multiple concurrent hardware contexts and provide secure isolation between execution contexts according to the MCDM (Microsoft Computing Driver Model) architecture. The processor 1000 may also include a media unit (not shown), which may be included on or separate from the XCD or other components of the processor 1000 to enable video playback and video processing of compressed or uncompressed data, such as using HEVC, AV1, VP9, and AVC hardware-accelerated decoding support and HEVC, VP9, and AVC hardware-accelerated encoding support.
[0167] Thread bootstrap ( The Thread Director (which includes firmware built into the processor 1000) can prioritize and manage the distribution of workloads, thereby sending tasks to optimized cores. For example, the thread director can tie P cores, E cores, and / or LP-E cores (as described above) together with task scheduling capabilities and the ability to send less demanding tasks to E cores or LP-E cores. Deep learning acceleration ( DLBoost (not shown) can provide built-in AI acceleration for training and inference workloads and may include support for VNNI (for CPU) and DP4a (for GPU) instruction sets. This instruction set can be used with OpenVINO. TM The toolkit and oneAPI are optimized to accelerate INT8 inference. For example, the software stack described elsewhere in this document can be used to leverage OpenVINO. TM The toolkit enables AI inference. The processor 1000 can be configured to execute applications, such as, but not limited to, CUDA programs.
[0168] In at least one embodiment, the processor 1000 may include one or more circuits configured to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described above or elsewhere herein. The one or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein.
[0169] Processor 1000 may alternatively include a processor based on Qualcomm's AIEngine Direct architecture from Santa Clara, California, or other processors sharing at least some of the components described herein. It may include any number of NPUs, GPUs, CPUs, and other associated components, such as, but not limited to, an NPU 1004 as a Hexagon NPU, a GPU 1006 as an Adreno GPU, a CPU 1002 as a Kryo or Qualcomm Oryon CPU, and a Qualcomm Sensing Hub (not shown) and a memory subsystem 1010, in any combination. Hexagon NPU 1004 may include power rails, micro-tile inference units, hardware acceleration units, tensor units, scalar units, and vector units (all not shown), which may have dedicated or shared memory (e.g., cache or memory, such as HBM3) for storing, for example, instructions for performing any of the operations described above or elsewhere herein. The Adreno GPU 1006 can provide graphics and parallel processing for AI, in formats including but not limited to 32-bit floating-point (FP32), 16-bit floating-point (FP16), and 8-bit integer (INT8). The Kryo or Qualcomm Oryon CPU 1002 can execute AI workloads and handle the contextualization of ubiquitous generative AI applications. The CPU 1002 may also include an instruction fetch unit, a renaming and deprecation unit, a memory management unit, a vector execution unit, an integer execution unit, and a load and store unit for processing and instruction management. Regarding the processor 1000 and any of its components described above or elsewhere herein, one or more APIs described herein may, for example, be compiled into instructions that can be fetched by the instruction fetch unit, decoded by the processor decoder or equivalent, scheduled (e.g., sequentially or out of order) for execution by the scheduler or equivalent, executed by execution logic or equivalent, reordered, and then deprecated by the renaming and deprecation unit. The API (and / or compiled instructions including the API) can be stored in any storage device (e.g., cache and / or memory) inside or outside the processor 1000. An arbitrary number of CPU cores 1002 can be included in an arbitrary number of CPU clusters, which can be coupled to memory and / or cache, such as, but not limited to, a shared L2 cache. Memory can be separate or shared; for example, the CPU clusters of CPU cores 1002 can be coupled to a memory subsystem 1010, which can include structures capable of reading and writing to memory (e.g., DRAM), system-level caches, and an arbitrary number of memory management units.The Qualcomm sensing hub (not shown) includes a miniature NPU, power rails, and conventional sensors (such as gyroscopes, accelerometers, or even barometers) that support voice and data streaming. The memory subsystem 1010 may include memory and cache on the processor 1000, which may include L1 or more levels of cache (e.g., L1, L2, L3, and / or last-level cache) and high-bandwidth memory (e.g., HBM2e or HBM3) in any combination, for example, for storing information and / or instructions for performing any of the operations described above or elsewhere herein. All or part of the memory and / or cache in the memory subsystem 1010 may be shared or used individually by any component or combination of components on the processor 1000 (e.g., GPU 1006, NPU 1004, and CPU 1002).
[0170] The Qualcomm AI Engine 1000 can be programmed and controlled using a software stack to perform some or all of the operations described herein, including, for example... A neural processing SDK is provided for inference on Android, Linux, and Windows. Developer libraries and services support programming languages, virtual platforms, and compilers. At lower levels of the software stack, system software includes a basic real-time operating system (RTOS), system interfaces, and drivers. The software stack supports various operating systems, including Android, Windows, Linux, and QNX, as well as deployment and monitoring infrastructures such as Prometheus, Kubernetes, and Docker. OpenCL and DirectML are supported for direct cross-platform access to the GPU 1006. For the CPU 1002, LLVM compiler infrastructure optimizations enable accelerated and efficient AI inference. Regarding the Qualcomm AI Engine 1000 and any of its components described above or elsewhere herein, one or more APIs described herein can, for example, be compiled into instructions that can be fetched by instruction fetching logic or equivalents, decoded by processor decoders or equivalents, scheduled (e.g., sequentially or out of order) for execution by a scheduler or equivalent, executed by execution logic or equivalents, reordered, and then retired by retirement logic or equivalents. The API (and / or compiled instructions including the API) can be stored in any storage device (e.g., cache and / or memory) inside or outside the Qualcomm AI Engine 1000. The results of the API can be stored in storage devices inside or outside the Qualcomm AI Engine 1000, including registers, DRAM, flash memory, SRAM, cache, or other memory.
[0171] In at least one embodiment, the processor 1000 or the Qualcomm AI engine 1000 may include one or more circuits configured to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein.
[0172] Figure 11A A processor 1100 according to at least one embodiment is illustrated. The processor 1100 may include a Scalable family processor from Intel Corporation, Santa Clara, California, or other processors sharing at least some of the components described herein. The processor 1100 may include one or more cores 1112(1)-1112(N) capable of performing the operations described elsewhere herein, where N is any integer greater than 1. Cores 1112(1)-1112(N) may be interconnected using ring and / or mesh interconnects. Utilizing a mesh interconnect architecture, arrays of vertical and horizontal communication paths may allow traversal from one core to another 1112(1)-1112(N) via the shortest path (jumping to the correct row along the vertical path and to the correct column along the horizontal path). For the mesh interconnect, a die may accommodate cores 1112(1)-1112(N) and may include a Converged Mesh Stop Point (CMS) grid that may be associated with cores 1112(1)-1112(N) (e.g., 1:1). Each core can be associated with a low-level cache (LLC) slice 1114(1)-1114(N), or cores 1112(1)-1112(N) can share a cache, such as a low-level cache. LLC 1114(1)-1114(N) can be inclusive or non-inclusive (having blocks not present in the high-level cache) by merging blocks from a higher-level cache (e.g., L2 cache). Each core and LLC slice can include a caching and home agent (CHA) (not shown), which can maintain cache consistency by providing resource scalability via the mesh interconnect. Super Path Interconnect ( UPI 1116 provides cache coherency capabilities. UPI 1116 can provide coherent interconnects for scalable systems and allows multiple processors to share a single shared address space via links, such as, but not limited to, two or three UPI links per processor.
[0173] Processor 1100 may also include system agent 1110, which may house and / or perform various functions, such as, but not limited to, memory management, display functions, and / or input / output (I / O) functions. For example, processor 1100 may include one or more integrated memory controllers (IMCs) 1108. IMCs 1108 may control and manage memory, such as, but not limited to, different memory types, such as DDR RAM, such as DDR4, or other memory described elsewhere herein. System agent 1110 may include a display controller (not shown) for supporting one or more displays. System agent 1110 may also integrate PCIe 1104 (e.g., up to 20 PCIe lanes), which may, for example, be connected to an external dedicated graphics connector via a DMI bus (e.g., Intel's DMI 3.0 bus) 1106. System agent 1110 may include an image processing unit (IPU) (not shown) that integrates an on-die image signal processor (ISP). Structure 1102 provides scalability for connecting to other nodes (e.g., processors, such as processor 1100) and can, for example, be connected to Cornelis Networks (…). It can be used together with elements of a scalable system framework that provides performance for high-performance computing (HPC) workloads and the ability to scale to tens of thousands of nodes.
[0174] Figure 11B Components within a core 1112 according to at least one embodiment are illustrated. The core 1112 may include a front-end 1118, a back-end or execution engine 1132, and a memory subsystem 1142. The front-end 1118 may provide operations (e.g., operations described elsewhere herein) to the execution engine 1132 by decoding instructions stored in memory. For example, the front-end 1118 may include micro-operation (μOps) cache paths and / or traditional paths, and a branch prediction unit 1121 capable of determining path instructions. A traditional path for instructions may include fetching variable-length (e.g., x86) instructions from an L1 instruction cache 1120 and instruction fetch and pre-decode 1122, queuing these instructions into an instruction queue 1124, and decoding the instructions into μOps that can be provided to an allocation queue 1128 using a decoder 1126. Alternatively, the μOps cache path may include a cache that includes decoded μOps (μOps 1130) that can be sent to the allocation queue 1128. The allocation queue 1128 can act as an interface between the front end 1118 and the execution engine 1132, and can provide instructions to the execution engine 1132. For example, one or more APIs described herein can be compiled into instructions that can be stored, processed, and executed by the front end 1118 and the execution engine 1132, and stored in the memory subsystem 1142.
[0175] Execution engine 1132 can receive micro-operations into reordering buffer 1134, which can register, rename, and deregister μOPs. μOPs can be sent from the reordering buffer to scheduler 1136, which can be connected to one or more different execution units 1138, which can be connected to address generation units (AGUs) 1140. Execution units 1138 can perform operations such as basic arithmetic logic unit (ALU) operations, multiplication, division, and / or more complex operations, such as, but not limited to, various vector operations. Scheduler 1136 can manage the queuing of μOPs for one or more execution units 1138 based on, for example, the operations that need to be performed.
[0176] The memory subsystem 1142 can handle load and store requests as well as sorting operations. For example, μOPs may be associated with memory accesses (e.g., load and store), and these μOPs can be sent through dedicated scheduler ports that can perform these memory operations. For example, store and load operations can be sent to load and store buffers 1144. The memory subsystem 1142 may also include shared or separate L1 data and instruction caches 1146, and an L2 cache 1148 that can be used and shared by the L1 data and instruction caches 1146. (As described above regarding...) Figure 11A Each core 1112 can be connected to a slice of a third-level cache (e.g., LLC1114), which can be shared by all cores 1112.
[0177] In at least one embodiment, processor 1100 may include one or more circuits configured to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein.
[0178] Figure 12An AI accelerator 1200 according to at least one embodiment is illustrated. Processor 1200 may include a processor with an AI accelerator architecture manufactured by Intel Corporation, Santa Clara, California, or other processors sharing at least some of the components described herein. AI accelerator 1200 may use instructions that, if executed by AI accelerator 1200, cause AI accelerator 1200 to perform some or all of the processes and techniques described elsewhere herein. For example, with respect to AI accelerator 1200 and any components described above or elsewhere herein, one or more APIs described herein may, for example, be compiled into instructions that may be fetched by instruction fetching logic or equivalents, decoded by processor decoders or equivalents, scheduled (e.g., sequentially or out of order) by a scheduler or equivalent for execution, executed by execution logic or equivalents, reordered, and then retired by retirement logic or equivalents. APIs (and / or compiled instructions including APIs) may be stored in any storage device internal or external to AI accelerator 1200 (e.g., in cache and / or memory). The results of the API can be stored in internal or external storage devices of the AI accelerator 1200, including registers, DRAM, flash memory, SRAM, cache, or other memory. The AI accelerator 1200 may include one or more compute dies, which may include homogeneous or heterogeneous processors. The compute dies may include one or more central processing units (CPUs), one or more graphics processing units (GPUs), or a combination of both.
[0179] In at least one embodiment, the computational die may include a computational engine for performing AI computations. In at least one embodiment, the computational die of the AI accelerator 1200 may be split into any number (e.g., four) clusters, which may be referred to as DCORE (Deep Learning Core) 1206, and include any number of matrix multiplication engines (MME) 1208, tensor processor cores (TPC) 1210, memory management units 1212, and L2 caches 1214 in any combination. The MME 1208 may perform operations using matrix multiplication, such as fully connected layers, convolutions, and batch general matrix multiplication (GEMM). The MME 1208 may be equipped with a multiplication-accumulation unit (MAC) (not shown), which may perform general matrix multiplication (GEMM) operations, such as, but not limited to, AxB multiplication, which involves generating a tensor C [NxM] from two input tensors A [NxK] and B [KxN]. The MME 1208 may be programmed with array dimensions, positions, data types, and various operands. The MME 1208 can retrieve tensors A and B from memory and pull them into its streaming buffer for parallel matrix multiplication by the MAC. After completion, the MME 1208 can push tensor C back to memory. The TPC 1210 may include any number of scalar units for performing scalar operations, any number of vector units for performing vector operations, any number of register files or local memory units (e.g., vector local memory), and load and store components for instructions, which may be coupled to memory or caches (e.g., HBM, L3 cache, and / or L2 cache) (all not shown). The TPC can support different types of parallel processing, such as Very Long Instruction Word (VLIW) Single Instruction Multiple Data (SIMD) data types such as, but not limited to, FP32, BF16, FP16, and FP8 (both E4M3 and E5M2), UINT32, INT32, UINT16, INT16, UINT8, and INT8 data types. Any number of computational dies can be interconnected. Interconnects that can connect computing dies can be via intermediate bridges, for example, those intermediate bridges that are transparent to software.
[0180] The memory on the AI accelerator 1200 may include one or more levels of cache (e.g., L1, L2, L3, and / or last-level cache) and high-bandwidth memory (e.g., HBM2e or HBM3) in any combination. The memory and / or cache system may be unified or separate. The compute die of the AI accelerator 1200 may include on-chip memory comprising one or more levels (e.g., two levels) of cache. On-chip SRAM or other memory described elsewhere herein may be used as a unified last-level cache (L3) or split into multiple slices of L2 cache accessible to the MME 1208 and TPC 1210 groups. Using on-chip memory as an L2 or L3 cache is entirely software-configurable, and the software can dynamically determine its optimal cache allocation based on I / O tensors. AI accelerator 1200 may include one or more memory management units (MMUs) 1222 for managing memory, such as allowing the AI accelerator 1200 memory subsystem to run in virtual space when accessing VRAM.
[0181] AI accelerator 1200 may include a communication port (e.g., a PCIe Gen5 x16 port) 1202 for communicating with a host and scheduling and synchronization unit 1204. AI accelerator 1200 may include a media unit 1216, which may include any number or combination of media decoder engines (DECs) 1220 and rotation engines (ROTs) 1218. AI accelerator 1200 may include a network unit 1224, which may include any number or combination of network ports 1226 and an accompanying RDMA engine 1228, L2 cache, and memory (e.g., HBM2e or HBM3) stack. AI accelerator 1200 may include a programmable control path entity (not shown) for managing the parallel and efficient execution of the various engines. The control path may include a submission queue (SQ) that can be issued by the runtime system, a completion queue (CQ) that can be used for job completion reporting, a programmable scheduling mechanism that can be used for task scheduling, a programmable hardware synchronization mechanism or "synchronization manager (SM)" that can be used for hardware synchronization, and a programmable interrupt service mechanism or "interrupt manager (INTR)" that can pass asynchronous events to drivers.
[0182] AI Accelerator 1200 may include media decoding units supporting video formats such as, but not limited to, HEVC, Progressive H.264, SVC base layer, MVC, VP9, JPEG, and Progressive JPEG. AI Accelerator 1200 may support post-processing of the decoded media stream, such as, but not limited to, image downsizing (image resizing), vertical and horizontal scaling at different scaling ratios, image enlargement, image cropping, bilinear scaling, and Lancos scaling. AI Accelerator 1200 may implement two post-processing channels per decoder unit, one for scalar (up and down) and the other solely for outputting the original image. AI Accelerator 1200 may include a hardware rotation engine that performs the following transformations on the input image: 2D rotation, 3D rotation, projection, image warping and de-warping, resampling of the input data at user-defined coordinates, and rescaling.
[0183] The RDMA 1228 based on converged Ethernet on the AI accelerator 1200 enables scaling from a single node (i.e., from a single AI accelerator 1200 to hundreds or thousands of nodes or AI accelerators 1200). The network subsystem 1224 may include... The accelerator 1200 includes an In-Gigabit Ethernet Communication Library (IGCL), a master controller coordinating data movement, and a programmable scheduling mechanism that enables smooth engine activation while maintaining task dependencies. The accelerator network subsystem may include a Gigabit Ethernet NIC port 1226, a Layer 2 MAC (not shown), and an RDMA engine 1228. The AI accelerator 1200 may include an aggregation engine for performing summation activities. All engines in the processor 1200 can run in parallel; for example, the MME 1208, TPC 1210, and NIC 1226 can all operate simultaneously. Dependencies may exist between operations running on different engines; for example, the output of one engine may be used as the input of another engine, and / or the MME, TPC, and NIC may be scheduled to run in parallel. When one engine completes its execution, another engine can be scheduled to begin working on the next operation (executed immediately after its input is ready).
[0184] The AI accelerator 1200 can be operated and controlled using a software layer 1228, which may include low-level components such as, but not limited to, a graph compiler, an automatic kernel fusionist and pre-compiled kernel libraries, and integration with the AI ecosystem such as, but not limited to, PyTorch, DeepSpeed, Hugging Face, vLLM, Ray, etc., or as described elsewhere in this document regarding software and programming platforms. The software layer 1228 may include implementations of algorithms such as, but not limited to, paged attention, flash attention, etc. The software layer 1228 can generate optimized binary code that implements a given model topology, such as, but not limited to, performing operator fusion, data layout management, parallelization, pipeline and memory management, and graph-level optimization.
[0185] In at least one embodiment, the AI accelerator 1200 may include one or more circuits configured to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described above or elsewhere herein. The one or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein.
[0186] This paper describes a neuromorphic computing system employing a multi-core architecture, where each core houses computing elements including neurons, synapses with on-chip learning capabilities, and local memory for storing synaptic weights and routing tables. Figure 13This is a simplified block diagram 1300 illustrating at least a portion of an example of such a neuromorphic computing device 1305 according to at least one embodiment. The neuromorphic computing device 1305 may include a neuromorphic processor from Intel Corporation, Santa Clara, California, or other processors that include at least a portion of the components described herein. As shown in this example, the device 1305 may be equipped with a network 1310 consisting of multiple neural network cores interconnected by a network on the device, thereby potentially defining multiple distinct connections between the cores. For example, the device 1305 may provide a network 1310 of spiking neural network cores, each core communicating via short packet pulse messages sent from one core to another through network channels. Each core (e.g., 1315) may have processing and memory resources, as well as logic, for implementing a number of primitive nonlinear time computation elements, such as, but not limited to, multiple (e.g., more than 1000) distinct artificial neurons (referred to herein as “neurons”). For example, each core may be able to implement multiple neurons concurrently, allowing the neuromorphic core to implement many, many neurons using the device 1305. With respect to the neuromorphic computing device 1305 and any components described above or elsewhere herein, one or more APIs or equivalents described herein may, for example, be compiled into instructions or equivalents that may be fetched by instruction fetching logic or equivalents, decoded by processor decoder or equivalents, scheduled (e.g., sequentially or out of order) for execution by scheduler or equivalents, executed by execution logic or equivalents, reordered, and then retired by retirement logic or equivalents. The APIs (and / or compiled instructions including the APIs) may be stored in any storage device (e.g., cache and / or memory) internal or external to the neuromorphic computing device 1305. The results of the APIs may be stored in storage devices internal or external to the neuromorphic computing device 1305, including registers, DRAM, flash memory, SRAM, cache, or other memory equivalents.
[0187] continue Figure 13For example, neuromorphic computing device 1305 may also include processor 1320 and system memory 1325 for implementing one or more components to manage and provide the functionality of neuromorphic computing device 1305. For instance, a system manager 1330 may be provided to manage the global attributes and operations of neuromorphic computing device 1305 (e.g., attributes affecting core network 1310, multiple cores in network 1310, interconnection of neuromorphic computing device 1305 with other devices, managing access to global system memory 1325, and other potential examples). In one example, system manager 1330 may manage the definition and configuration of specific routing tables for individual routers in network 1310, orchestration of network definitions and attributes to be applied to network 1310 (e.g., weights, attenuation rates, etc.), core synchronization and time multiplexing management, routing input to appropriate cores, and other potential functions.
[0188] As another example, the neuromorphic computing device 1305 may also include a programming interface 1335 through which a user or system can specify the neural network definition to be applied (e.g., via routing tables and individual neuron attributes), implemented by the neuromorphic core grid 1310. A software-based programming tool may be provided with or separately from the neuromorphic computing device 1305, through which a user can define a specific neural network to be implemented using the neuromorphic core network 1310. The programming interface 1335 can receive input from a programmer, then generate the corresponding routing table and populate the specified parameters into the local memory of each neuromorphic core (e.g., 1315) to implement the corresponding custom artificial neural network implemented by the neuromorphic core 1315.
[0189] In certain circumstances, the neuromorphic computing device 1305 can advantageously engage and interoperate with other devices, including general-purpose computing devices, to enable specific applications and use cases. Therefore, in some cases, external interface logic 1340 may be provided to communicate with one or more other devices (e.g., via one or more defined communication protocols). External interface 1340 may be used to accept input data from another device or an external memory controller used as an input data source. External interface 1340 may additionally or alternatively be used to allow the results or outputs of computations performed using the neural network implemented using the neuromorphic computing device 1305 to be provided to another device (e.g., another general-purpose processor implementing machine learning algorithms) to enable additional applications and enhancements, among other examples.
[0190] like Figure 13The diagram illustrates a network 1310 interconnected by a network of multiple neural network cores on a device, showing a portion of a network structure interconnecting multiple neuromorphic cores (e.g., 1315a-d). For example, several neuromorphic cores (e.g., 1315a-d) can be provided in a mesh, each core interconnected via a network including multiple routers (e.g., 1350). In one implementation, each neuromorphic core (e.g., 1315a-d) can be connected to a single router (e.g., 1350) in the router array, and the router can be connected to at least one other router (e.g., [other router]). Figure 13 (As shown at 1310 in the diagram). As an example, in one particular implementation, four neuromorphic cores (e.g., 1315a-d) can be connected to a single router (e.g., 1350), and each router 1350 can be connected to two or more other routers to form a multi-core mesh, thereby allowing each neuromorphic core to interconnect with every other neuromorphic core in the neuromorphic computing device 1305. Furthermore, since each neuromorphic core can be configured to implement multiple different neurons, the router network of the neuromorphic computing device 1305 can similarly implement connections or artificial synapses (or simply "synapses") defined between any two of the potential many (e.g., 30,000+) neurons defined by the network definition using the neuromorphic cores 1310 provided in the neuromorphic computing device 1305.
[0191] Figure 13A block diagram of the internal components of an example implementation of the neuromorphic core 1315 is shown. In one example, a single neuromorphic core may implement a number of neurons (e.g., 1024) that share the architectural resources of the neuromorphic core 1315 in a time-division multiplexing manner. In one example, each neuromorphic core 1315 may include a processor block 1355 capable of executing arithmetic functions and routing related to the implementation of the digitally implemented artificial neurons, such as, but not limited to, those explained herein. Each neuromorphic core 1315 may also provide local memory in which routing tables of the neural network can be stored and accessed, accumulated potentials of each cell body of each neuron implemented using core 1315 can be tracked, parameters of each neuron implemented by core 1315 can be recorded, and other data and usage can be recorded. Components or architectural resources of the neuromorphic core 1315 may also include: an input interface 1365 for receiving input spike messages generated by other neurons on other neuromorphic cores; and an output interface 1370 for sending spike messages to other neuromorphic cores via a mesh network 1310. In some instances, the routing logic of the neuromorphic core 1315 can be implemented at least partially using the output interface 1370. Furthermore, in some cases, the core (e.g., 1315) can implement multiple neurons within an example SNN, and some of these neurons can be interconnected. In this case, spiking messages sent between neurons hosted on the core 1315 can forgo communication via the routing structure of the neuromorphic computing device 1305 and can be managed locally within the specific neuromorphic core 1315.
[0192] Each neuromorphic core may also include logic for implementing artificial dendrites 1380 and artificial cell bodies 1385 (hereinafter referred to herein as “dendrites” and “cell bodies”, respectively) for each neuron 1375. Dendrite 1380 may be a hardware-implemented process for receiving impulses from network 1310. Cell body 1385 may be a hardware-implemented process for receiving the current time-accumulated neurotransmitter mass of each dendrite and evolving the potential state of each dendrite and cell body to generate outgoing impulse messages at appropriate times. Dendrite 1380 may be defined for each connection receiving input from another source (e.g., another neuron). In one implementation, the dendritic process 1380 may receive and process the impulse message as it arrives serially from network 1310 in a time-division multiplexed manner. With the reception of impulses, neuronal activation (tracked using cell body 1385 (and local memory 1360)) may increase. When the activation of a neuron exceeds a threshold set for neuron 1375, neuron 1375 generates a spike message, which is propagated via output interface 1370 to a fixed set of fan-out neurons. The network distributes the spike messages to all destination neurons, which in turn can update their activation in a transient, time-dependent manner in response. This can lead to some of the destination neurons also exceeding their corresponding thresholds and triggering further spike messages, just as in real biological neural networks.
[0193] As described above, the neuromorphic computing device 1305 can reliably implement spiking-based neural computing models. Such models are also referred to as spiking neural networks (SNNs). In addition to neuronal and synaptic states, SNNs incorporate temporal concepts. For example, in SNNs, communication occurs via event-driven action potentials or spiking, which convey no explicit information other than the spiking time and the implicit source and destination neuron pairs corresponding to the spiking transmission. The computation of the result of a dynamic nonlinear integral as a weighted spiking input occurs in each neuron. In some implementations, recurrent and dynamic feedback can be incorporated into the SNN computation model. Furthermore, various network connectivity models can be employed to model a wide range of real-world networks or relationships, including fully connected (all-to-all) networks, feedforward trees, completely random projections, "small-world" networks, and other examples. Isomorphic two-dimensional networks at the neuromorphic core (e.g., but not limited to...) Figure 13The network shown in the example can advantageously support all these network models. Since some or all of the cores of the neuromorphic computing device 1305 can be connected, some or all of the neurons defined in a core can also be fully connected via a certain number of router hops. The neuromorphic computing device 1305 may also include a fully configurable routing table for defining various neural networks by allowing neurons in each core to distribute their spurs to any number of cores in the grid 1310 to achieve a completely arbitrary connection graph.
[0194] In improved implementations of systems capable of supporting SNNs, for example, but not limited to... Figure 13 The example illustrates a very large-scale integrated circuit (VLSI) hardware device that can provide high-speed, reliable circuitry to implement SNNs (Spiritual Neural Networks) to model the information processing algorithms used by the brain, but in a more programmable way. For instance, while a biological brain can only perform a specific set of defined behaviors (a consequence of years of development), a neuromorphic processor device can provide the ability to rapidly reprogram all neural parameters. Therefore, a single neuromorphic processor can be used to implement a wider range of behaviors than a single slice of biological brain tissue. This distinction can be achieved by employing neuromorphic processors with neuromorphic designs that are radically different from those found in natural neural circuitry.
[0195] As an example, a neuromorphic processor can implement a spontaneous neural network (SNN) using time-multiplexed computation in both a spiking communication network and the neuronal mechanism of the neuromorphic computing device 1305. Therefore, the physical circuitry of the neuromorphic computing device 1305 can be shared by many neurons to achieve a higher neuron density. Through time multiplexing, the network can connect N cores with a total wiring length of O(N), while the length of discrete point-to-point wiring will be extended to O(N). 2 This significantly reduces wiring resources to accommodate planar and non-plastic VLSI routing techniques, among other examples. In the neuromorphic core, time multiplexing can be implemented through dense memory allocation, for example, using static random access memory (SRAM) with a shared bus, address decoding logic, and other multiplexed logic elements. The state of each neuron can be stored in the processor's memory, where data describing the state of each neuron includes the state of the collective synapse of each neuron, all currents and voltages on its membrane, and other example information (e.g., but not limited to configuration and other information).
[0196] Neuromorphic processors can be implemented in a “digital” manner, unlike other processors that employ more “analog” or “isomorphic” neuromorphic approaches. For example, a digital implementation can use digital adder and multiplier circuitry to integrate synaptic currents, in contrast to an analog isomorphic neuromorphic approach that accumulates charge on capacitors in a manner similar to how neurons accumulate synaptic charge on their lipid membranes. For instance, the accumulated synaptic charge for each neuron can be stored in the local memory of the corresponding core. Furthermore, at the architectural level of an example digital neuromorphic processor, reliable and deterministic operation can be achieved through time synchronization across the core network, ensuring that any two executions of the design, given the same initial conditions and configuration, will produce the same results. Asynchronicity can be reserved at the circuit level to allow individual cores to operate as quickly and freely as possible while maintaining determinism at the system level. Therefore, in neural computing, the concept of time as a time variable can be abstracted away from the “wall clock” time used by the hardware to perform computations. Thus, in some implementations, a time synchronization mechanism can be provided that globally synchronizes the neuromorphic cores at discrete time intervals. The synchronization mechanism allows neural computation to be completed at the fastest speed allowed by the circuit, and there is a difference between the runtime and the biological time for modeling neuromorphic systems.
[0197] In operation, the neuromorphic computing device 1305 can start in an idle state when all neuromorphic cores are inactive. As each core asynchronously loops through its neurons, it generates impulse messages, which are routed by the mesh interconnect to the appropriate destination core containing all destination neurons. The implementation of multiple neurons on a single neuromorphic core can be time-multiplexed, and time steps can be defined, where all impulses involving multiple neurons can be processed and considered using the shared resources of the respective cores. When each core completes its service to its neurons within the corresponding time step, in some implementations, the core can communicate with neighboring cores using synchronization messages (e.g., using a handshake) to refresh the mesh of all transmitted impulse messages, allowing the core to safely determine that all impulses have been serviced within a certain time step. At this point, all cores can be considered synchronized, allowing them to advance their time steps and return to the initial state to begin the next time step.
[0198] Given this context, as described above, a device (e.g., 1305) can be provided to realize an interconnected neuromorphic core grid 1310, wherein core 1315 can realize multiple artificial neurons capable of interconnecting to realize an SNN. Each neuromorphic core (e.g., 1315) can provide two loosely coupled asynchronous processes: an input dendrite process (e.g., 1380) that receives impulses from network 1310 and applies them to the appropriate destination dendritic compartment at an appropriate future time; and an output cell body process (e.g., 1385) that receives the current-time accumulated neurotransmitter mass of each dendritic compartment and evolves the membrane potential state of each dendrite and cell body to generate an outgoing impulse message at an appropriate time (e.g., when the threshold potential of the cell body is reached). It should be noted that, from a biological perspective, the names of dendrites and cell bodies used herein are only approximate to the function of these features and should not be interpreted too literally.
[0199] In at least one embodiment, the neuromorphic computing device 1305 may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described above or elsewhere herein. The one or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein.
[0200] Figure 14 This is a block diagram of an embodiment of a multi-node network capable of enabling remote memory computing according to any embodiment. System 1400 may represent the node network described herein, for example, the node network may be used to perform some or all of the operations described herein. System 1400 may represent a data center. System 1400 may represent a server farm. System 1400 may represent a data cloud or processing cloud. System 1400 may represent a supercomputer. System 1400 may include tens, hundreds, or thousands of nodes. The nodes of System 1400 may include processors, such as, but not limited to, a central processing unit (CPU), a graphics processing unit (GPU), or any combination of processors described herein, such as, but not limited to, processors described herein. Figure 8-20BOther processors in the system. For any processor in system 1400 and any components described above or elsewhere herein, one or more APIs or equivalents described herein may, for example, be compiled into instructions or equivalents that may be fetched by instruction fetching logic or equivalents, decoded by processor decoder or equivalents, scheduled (e.g., sequentially or out of order) for execution by scheduler or equivalents, executed by execution logic or equivalents, reordered, and then retired by retirement logic or equivalents. APIs (and / or compiled instructions including APIs) may be stored in any storage device (e.g., cache and / or memory) inside or outside the processor or node. The results of APIs may be stored in storage devices inside or outside the processor or node, including registers, DRAM, flash memory, SRAM, cache, or other memory equivalents. System 1400 may include more than nine thousand nodes, each node comprising two Intel Xeon Max processors, six Intel Max series GPUs, and a unified memory architecture, such as, but not limited to, the architecture used in Intel's Aurora supercomputer in Santa Clara, California, or other supercomputers that share at least some of the components described herein.
[0201] One or more clients 1402 send requests to system 1400 via network 1404. Network 1404 represents one or more local area networks, wide area networks, or a combination of both. Clients 1402 can be human or machine clients that generate requests for operations to be performed by system 1400. System 1400 executes the application or data computation task requested by client 1402.
[0202] System 1400 may include one or more racks, which represent structural and interconnect resources for housing and interconnecting multiple computing nodes. Rack 1410 may include multiple nodes 1430. Rack 1410 may carry multiple blade assemblies 1420(0)-1420(N-1), where N is an integer greater than or equal to 2. Carrying may refer to providing power, structural or mechanical support, and interconnection. Blades 1420(0)-1420(N-1) may refer to computing resources on a printed circuit board (PCB), where the PCB houses the hardware components of one or more nodes 1430. Blades 1420(0)-1420(N-1) may or may not include a chassis, enclosure, or other “box” besides those provided by rack 1410. Blades 1420(0)-1420(N-1) may include an enclosure with exposed connectors for connection to rack 1410. System 1400 may or may not include rack 1410, and each blade (e.g., 1420(0)) may include a chassis or housing that may be stacked or otherwise closely proximate with other blades and allow nodes 1430 to interconnect. System 1400 may include 10,624 compute blades, comprising 63,744 Intel Max series GPUs and 21,248 Intel Xeon Max CPUs on 166 racks.
[0203] System 1400 may include architecture 1470, which represents one or more interconnectors of nodes 1430. Architecture 1470 may include multiple switches 1472 or routers or other hardware for routing signaling between nodes 1430. Additionally, architecture 1470 may couple system 1400 to network 1404 for access by client 1402. Besides routing devices, architecture 1470 may also be considered to include cables or ports or other hardware devices for coupling nodes 1430 together. Architecture 1470 may have one or more associated protocols for managing signaling routing through system 1400. One or more protocols are at least partially dependent on the hardware devices used in system 1400.
[0204] As shown in the figure, rack 1410 may include N blades (e.g., 1420(0)-1420(N-1)). In addition to rack 1410, system 1400 may also include rack 1450. As shown in the figure, rack 1450 may include M blades (e.g., 1460(0)-1460(M-1)). M is not necessarily the same as N; therefore, it is understood that various different hardware device components may be used and coupled together into system 1400 via structure 1470. Blades 1460(0)-1460(M-1) may be the same as or similar to blades 1420(0)-1420(N-1). Node 1430 may be any type of node described herein and is not necessarily of the same type. System 1400 is not limited to homogeneous or non-homogeneous.
[0205] The nodes in blade 1420(0) are shown in detail. However, other nodes in system 1400 may be identical or similar. At least some nodes 1430 may be compute nodes, having processor 1432 and memory 1440. A compute node is a node having processing resources (e.g., one or more processors) that executes an operating system and can receive and process one or more tasks. At least some nodes 1430 may include storage server nodes, which have servers as processing resources 1432 and memory 1440. A storage server is a node having more storage resources than a compute node, and instead of having processors for performing tasks, a storage server includes processing resources for managing access to storage nodes within the storage server.
[0206] Node 1430 may include interface controller 1434, which may represent logic for controlling node 1430's access to structure 1470. The logic may include hardware resources for interconnecting to physical interconnect hardware. The logic may include software or firmware logic for managing the interconnect. Interface controller 1434 may include a host structure interface, which may include a structure interface according to any embodiment described herein.
[0207] Node 1430 may include a memory subsystem 1440. Memory 1440 may include a memory computation resource (comp) 1442, which represents the ability of memory 1440 to perform one or more memory computations. System 1400 supports remote memory operations, such as, but not limited to, those described elsewhere herein. Therefore, node 1430 may request a remote node to perform a memory computation, wherein the data used for the computation remains local to the executing node and is not sent via structure 1470 or from memory to the structure interface. In response to the execution of the memory computation, the executing node may provide the result to the requesting node.
[0208] Processor 1432 may include one or more individual processors. Each individual processor may include a single processing unit, a multi-core processing unit, or a combination thereof. A processing unit may include a main processor, such as, but not limited to, a CPU (Central Processing Unit), a peripheral processor (such as, but not limited to, a GPU (Graphics Processing Unit)), or a combination thereof. Memory 1440 may be or include a memory device and a memory controller.
[0209] The term "memory device" can refer to different types of memory. Memory devices generally refer to volatile memory technology. Volatile memory is memory whose state (and the data stored within it) is uncertain if power is interrupted. Non-volatile memory is memory whose state is deterministic even if power is interrupted. Dynamically volatile memory can refresh the data stored in the device to maintain its state. An example of dynamically volatile memory includes DRAM (Dynamic Random Access Memory) or variations thereof, such as, but not limited to, Synchronous DRAM (SDRAM). The memory subsystem described in this article is compatible with a variety of memory technologies, such as, but not limited to, DDR3 (Double Data Rate version 3, originally released by JEDEC (Joint Electron Device Engineering Committee) on June 27, 2007, currently version 21), DDR4 (DDR version 4, initial specification released by JEDEC in September 2012), DDR4E (DDR version 4, extended version, currently under discussion by JEDEC), LPDDR3 (Low Power DDR version 3, JESD209-3B, released by JEDEC in August 2013), and LPDDR4 (Low Power Double Data Rate (LPDDR) version 4). JESD209-4 (originally released by JEDEC in August 2014), WIO2 (Wide I / O2), JESD229-2 (originally released by JEDEC in August 2014), HBM (High Bandwidth DRAM), JESD235 (originally released by JEDEC in October 2013), DDR5 (DDR version 5, currently under discussion by JEDEC), LPDDR5 (currently under discussion by JEDEC), HBM2 (HBM version 2, currently under discussion by JEDEC) or combinations of other memory technologies, as well as technologies derived from or extended based on such specifications.
[0210] In addition to or as an alternative to volatile memory, in one embodiment, a reference to a memory device may refer to a non-volatile memory device whose state is deterministic even when power is interrupted. In one embodiment, a non-volatile memory device is a block-addressable memory device, such as, but not limited to, NAND or NOR technology. Therefore, the memory device may also include future-generation non-volatile devices, such as, but not limited to, three-dimensional cross-point (3DXP) memory devices, other byte-addressable non-volatile memory devices, or memory devices using chalcogenide phase change materials (e.g., chalcogenide glasses). In one embodiment, the memory device may be or include multi-threshold NAND flash memory, NOR flash memory, single-level or multi-level phase change memory (PCM) or switched phase change memory (PCMS), resistive memory, nanowire memory, ferroelectric transistor random access memory (FeTRAM), magnetoresistive random access memory (MRAM) incorporating memristor technology, or spin-transfer torque (STT)-MRAM, or any combination of the foregoing, or other memories.
[0211] In at least one embodiment, system 1400 may include one or more circuits configured to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0212] Figure 15An accelerated processing unit 1500 according to at least one embodiment is illustrated. The accelerated processing unit 1500 may include a processor based on the CDNA architecture of AMD Inc., Santa Clara, California, or other processors sharing at least some of the components described herein. The accelerated processing unit 1500 may include one or more accelerator complex dies (XCDs) 1504 for performing the operations described elsewhere in this document, such as, but not limited to, graphics processing and / or parallel processing and instruction-level parallel computing, including support for multiple precisions (INT8, FP8, BF16, FP16, TF32, FP32, and FP64) and sparse matrix data (i.e., sparsity). In some cases, the XCD may be referred to as a graphics computing die (GCD). The accelerated processing unit 1500 may include one or more complex computing dies (CCDs) 1506 for performing the operations described elsewhere in this document, such as, but not limited to, operations performed by a host processor. In some cases, the CCD may be referred to as a core complex or CCX, such as, but not limited to, the CCX used in AMD Ryzen processors. XCDs and CCDs can share any type of cache or memory (e.g., one or more memory cells 1502), or a cache or memory can be allocated to each XCD or CCD or group of XCDs or CCDs. For example, AMD Infinity Fabric within the package connects XCDs and CCDs to a shared AMD Infinity Cache 1508, and in some embodiments, to high-bandwidth memory (e.g., HMB3). Accelerated processing unit 1500 may include an AMD MI300a processor comprising three CPU die (or CCD) and six accelerator die (XCD) on top of four input-output dies (IODs) that may be layered on a single silicon die (e.g., via AMD Infinity Fabric) and linked together to eight high-bandwidth DRAM stacks in a ring to form a superchip. For systems using only accelerators, the AMD MI300x processor replaces the CCD with two or more XCDs.
[0213] Accelerated processing unit 1500 may include one or more input / output (I / O) interfaces. For example, XCD 1504 and CCD 1506 may coexist on one or more input-output dies (IODs) 1510, which may include one or more I / O interfaces. IOD 1510 may include any number and type of I / O interfaces (e.g., PCI, PCI expansion (“PCI-X”), PCIe, Gigabit Ethernet (“GBE”), USB, etc.). Various types of peripheral devices may be coupled to I / O interfaces 1570. The I / O interfaces of IOD 1510 may also be used to connect one or more accelerated processing units 1500, for example, in a server architecture.
[0214] Accelerated processing unit 1500 may include one or more memory units 1502 for storing instructions and other information for performing the operations described in other parts of this document. Memory units 1502 may include any volatile memory, such as, but not limited to, the memory types described in other parts of this document, and may include, for example, high-bandwidth memory (e.g., HMB3) or high-bandwidth DRAM. The memory associated with accelerated processing unit 1500 (e.g., memory unit 1502) may include system memory, which can be used for, for example, commands, instructions, and constants, as well as input and output. Memory unit 1502 may also include device memory, which can be used for storage and, for example, for commands, instructions, and constants, as well as input and output, as a return buffer, and for private data. Memory unit 1502 may be linked to one or more IODs 1510. In at least one embodiment, L1 cache 1520 initiates a memory hierarchy including a shared L2 cache 1528 (e.g., within an XCD). AMD Infinity Cache TM It is the last-level cache (LLC) located on the active I / O die (IOD). The CCD 1506 and XCD1504 can have dedicated or shared memory. AMD Infinity architecture and AMD Infinity Fabric TM The technology enables consistent, high-throughput unification of GPU and CPU chip technologies (such as XCD, CCD, and / or CCX) with memory (such as stacked HBM3 memory) in a single device and across multiple device platforms.
[0215] like Figure 15As shown, the XCD 1504 may include a set of shared global resources 1530, which may include a hardware scheduler 1532 and an asynchronous compute engine (ACE) 1524. The ACE 1524 sends tasks (e.g., compute shader workgroups) to compute units (CUs or cores) 1534. Each of the ACEs 1524 (e.g., four) may be associated with a CU 1534 (e.g., 40 CUs), and some CUs 1534 may be disabled for yield management. CUs 1534 may have dedicated caches or shared caches (e.g., L2 caches) 1528 for consolidating all memory traffic on a single die. CU1534 may include threaded and parallel processor cores, including instruction fetching and scheduling using a scheduler (S) 1512, a matrix core unit (MCU) 1516, and a shader core (SC) 1518 (e.g., execution units for scalar, vector, and matrix data types), and a load / store pipeline with an L1 cache 1520 and a local data share (LDS) 1514. The local data share may include, for example, a sticky-note RAM with built-in arithmetic capabilities, allowing data sharing between threads in a workgroup. An instruction cache 1540 (e.g., for storing and providing instructions for performing the operations described elsewhere in this document) and a constant cache 1538 may be connected to one or more CUs and may be shared between two CUs. The matrix core 1516 can handle various data types, such as, but not limited to, INT8, FP8, FP16, BF16, and TF32 data types. Accelerated processing unit 1500 may include computation units 1534, which may be arranged in an array format, such as as a data parallel processor (DPP) array. The hyper-threaded dispatch processor 1542 can communicate with the compute unit 1534, and the command processor 1544 can read commands written by the host to memory-mapped registers in the system memory address space (not shown). When a command is completed, the command processor 1544 can send a hardware-generated interrupt to the host processor (e.g., a CCD). The memory controller 1536 can also directly access all device memories and system memory regions specified by the host. To satisfy read and write requests, the memory controller 1536 can perform the functions of a direct memory access (DMA) controller, including calculating the memory address offset based on the format of the requested data in memory. For example, one or more APIs described herein can be compiled into instructions that can be stored in the instruction cache 1540, then fetched by the instruction fetch logic in the processor 1540, decoded by the processor decoder or equivalent, scheduled (e.g., sequentially or out of order) by the scheduler or equivalent for execution, executed by the execution logic or equivalent, reordered, and then retired by the retirement logic or equivalent.The API (and / or compiled instructions including the API) can be stored in any storage device, either inside or outside the processor 1500 (e.g., in a cache and / or memory). The results of the API can be stored in storage devices, either inside or outside the processor 1500, including registers, DRAM, flash memory, SRAM, cache, or other memory equivalents.
[0216] Applications may include programs running on the main processor (e.g., a CCD) and programs running on one or more XCDs (referred to as kernels). Programs can be controlled by host commands that set internal base addresses and other configuration registers, specify data fields on which the accelerator processing unit 1500 can run, invalidate and flush caches on the accelerator processing unit 1500, and cause the accelerator processing unit 1500 to begin executing a program. A kernel can be referred to as a program executed by the accelerator processing unit 1500. Kernels can execute independently on each work item or as a group of work items, referred to as a wavefront, which can execute kernels on all (e.g., 64) work items in a single pass. The computation unit 1534 may include: a scalar arithmetic logic unit (ALU) that can operate on a single value for each wavefront (shared by all work items); a vector ALU that can operate on a unique value for each work item; a local data share 1514 that allows work items within a workgroup to communicate and share data; a scalar memory (not shown) that can transfer data between the scalar general-purpose registers (SGPRs) and memory via cache; and a vector memory that can transfer data between the vector general-purpose registers (VGPRs) and memory, including sampling of texture maps. Kernel control flow can be manipulated using scalar ALU instructions, which may include if / else statements, branches, and loops. Scalar ALU (SALU) and memory instructions can operate on the entire wavefront and operate on one or more SGPRs. Vector memory and ALU instructions can operate on all work items in the wavefront simultaneously.
[0217] In at least one embodiment, the acceleration processing unit 1500 may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0218] Figure 16Processor 1600 is shown, including, but not limited to, a Zen architecture-based processor (e.g., Zen1, 2, 3, 4, 5, or other architectures) from AMD Inc. of Santa Clara, California, or other processors that share at least some of the components described herein. Processor 1600 includes one or more CPU dies 1602(1)-1602(N), where N is any integer greater than 1. CPU die 1602 may include any number of processor cores 1616 (e.g., for performing any operations described elsewhere herein) and any number of cache memories (e.g., for storing instructions and other information to perform any operations described elsewhere herein). For example, L2 cache unit 1618 may be coupled to processor core 1616, and processor core 1616 may share and / or be individually coupled to L2 cache unit 1618. Processor core 1616 can be coupled to L3 cache 1622 and / or a shared L3 cache, which can be the lowest level cache (LLC) 1622 used to access data and other information used by processor core 1616. One or more processor cores 1616 and one or more L2 cache units 1618 can be included in a core complex (CCX) 1620, which can include (e.g., 32MB) a shared cache (e.g., L3 cache 1622). Core complex 1620 can be manufactured onto a die (CCD or CPU die) 1602. For example, up to 12 core complexes 1620 can be configured in a processor along with 8 CPU dies 1602, thereby providing up to 96 processor cores 1616 for processor 1600. For example, a "Zen 4c" core complex 1620 can include up to 8 cores 1616 and a shared 16MB L3 cache 1622. Two core complexes from these core complexes 1620 can be combined onto a single CPU die 1602, resulting in 16 cores per die and a total of 32MB of L3 cache per die 1622. Up to eight CPU dies 1602 can be combined with I / O units 1604 to provide up to 128 processor cores 1616 for the CPU. Up to four "Zen 4c" dies mentioned above can be combined to provide up to 64 processor cores 1616 for the CPU.
[0219] Processor 1600 may include various configurations for input / output operations, which will be further described herein. I / O unit 1604 may include one or more memory controllers 1606 capable of managing the memory usage of processor 1600 (e.g., DDR5 memory). I / O unit 1604 may include one or more SATA disk controllers for managing storage device 1612, and one or more Compute Express Links (CXL) for providing CPU-to-device and CPU-to-memory connectivity, which can be flexibly assigned to specific functions during server design. TM 1.1+ Memory controller 1614. I / O unit 1604 may include PCIe controller 1608 for connecting peripherals and other components connected to processor 1600. I / O unit 1604 may also include USB port 1610 for connecting to other components separate from processor 1600. CPU die 1602 may support any number of connections to I / O unit 1604, for example, one or two connections. As shown, I / O unit 1604 may include components further described herein, and I / O unit 1604 may be an I / O die accommodating several different components. Memory controller 1606, PCIe controller 1608, USB port 1610, SATA controller 1612 and / or CXL controller 1614 may be individually integrated into any location within processor 1600, or integrated in any group or combination.
[0220] Processor 1600 may include an Infinity Fabric 1624 interconnect (which may be similar to or based on a PCIe architecture) that provides connectivity between the CPU (e.g., CPU die 1602(1)-1602(N)), graphics processor 1626, inference engine 1632, and other components in a multi-chip architecture (e.g., security processor 1628 and I / O unit 1604). One or more AMD Infinity Fabrics TM Interconnect 1610 can be connected to CPU dies 1602(1)-1602(N) and used as a connection between CPUs. One or more Infinity Fabric connections 1610 can connect each CPU die 1602 to the I / O unit 1610.
[0221] In at least one embodiment, processor 1600 may include a central processing unit (CPU) and other related hardware and software described above and further herein. Processor 1600 may also include a graphics processor 1626. Graphics processor 1626 may be used for image generation and processing, as well as other computations and operations described further herein. Graphics processor 1626 may be based on AMD's RDNA 3 or 3.5 architecture, located in Santa Clara, California. Graphics processor 1626 may include a graphics computing die (GCD) and a memory cache die (MCD). The GCD may include any number of computing units (CUs) for graphics or other processing, such as operations performed by an arithmetic logic unit (ALU) described further herein. Graphics processor 1626 may include an L2 cache available for use by the computing units. The MCD (not shown) may include any number of memory cells and may include a cache (e.g., an L3 cache) and a memory interface for coupling to memory (e.g., memory 1642(1)-(N), where N is an integer). Components within the graphics processor 1626 can be connected using various methods, such as using Infinity Fabric 1624 interconnects, either internally or externally to the graphics processor 1626.
[0222] Inference engine 1632 can provide neural processing capabilities to processor 1600 for computational processes used in neural networks, deep learning, and other artificial intelligence-related operations, which will be further described herein. Processor 1600 may include: a security processor 1628 for managing the security of processor 1600; a display controller 1630 for controlling the display; a system management unit 1634 for managing and operating some or all components on processor 1600; a multimedia engine 1636 for audio and video operations; a fusion controller hub 1638 for managing USB, SATA, and PCIe connections to the processor; and a sensor fusion hub 1640 for managing sensors (e.g., accelerometers). Processor 1600 may also include memory 1642(1)-(N), where N is any integer. Memory may include different memory types, such as LPDDR5 and / or DDR5, or other memory described elsewhere herein.
[0223] To perform the operations further described herein, processor 1600 may include an execution pipeline including a front-end that may include a cache for storing instructions (e.g., an L1 cache) (not shown). A branch predictor may modify the instruction stream. Instructions may be decoded by a decoder, dispatched to a back-end for execution, and renamed. For example, the instruction fetch and decode pipeline may be dispatched to integer or floating-point execution operations, which may be scheduled by a scheduler and passed to vectors and / or general-purpose registers. Floating-point multipliers and / or addition operations may be processed, and an arithmetic logic unit (ALU) may also be used to perform computations, such as arithmetic and logical operations. The output of the computation unit may be coupled to a load / store queue, which may be connected to a cache, such as an L1 cache and / or an L2 cache.
[0224] With respect to processor 1600 and any components described above or elsewhere herein, one or more APIs or equivalents described herein may be compiled, for example, into instructions or equivalents (e.g., AVX-512 instructions based on a SIMD model). These instructions or equivalents may be fetched by instruction fetching logic or equivalents, decoded by processor decoder or equivalents, scheduled (e.g., sequentially or out of order) for execution by scheduler or equivalents, executed by execution logic or equivalents, reordered, and then retired by retirement logic or equivalents. APIs (and / or compiled instructions including APIs) may be stored in any storage device (e.g., cache and / or memory) internal or external to processor 1600. The results of the APIs may then be stored in storage devices internal or external to processor 1600, including registers, DRAM, flash memory, SRAM, cache, or other memory equivalents.
[0225] In at least one embodiment, processor 1600 may include one or more circuits configured to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein.
[0226] Figure 17 An example of a processing core 1700 is shown, which may implement an Arm architecture (e.g., v9.0-A) or other processors that share at least some of the components described herein. Neoverse TMThe V2 core 1700 can be implemented within a DynamIQ shared unit (DSU) cluster via a DSU-110 interconnect 1754 for connecting one or more cores, for example, for parallel processing. Neoverse TM The V2 core can be implemented as a single core in a DSU cluster configured for direct interconnection, with or without L3 cache, listener filter, or listener control unit (SCU) logic (not shown). Neoverse TMThe V2 core may include a CPU bridge 1752 for connecting core 1700 to the DSU-110 interconnect. This bridge may also connect core 1700 to an external memory system and the remainder of the on-chip system. The L1 instruction memory system 1702 may fetch instructions from instruction cache 1704 and deliver instructions (e.g., one or more compileable APIs described herein) to instruction decoding unit 1710, for example, to perform some or all of the operations described above or elsewhere herein. The L1 instruction memory system 1702 may include L1 instruction cache 1704 (e.g., with 64-byte cache lines), L1 instruction translation back buffer (TLB) 1706 (e.g., natively supporting 4KB, 16KB, 64KB, and 2MB page sizes), and macro operation cache (MOP) 1708 (e.g., a 1536-entry, 4-way skewed associated L0 MOP cache), which may include decoded and optimized instructions for higher performance. Instruction decoding unit 1710 may decode AArch64 instructions into its internal format. The register renaming unit 1712 can perform register renaming to facilitate out-of-order execution and dispatch decoded instructions to various issue queues. The instruction issuing unit 1714 can control when decoded instructions are dispatched to the execution pipeline, and it can include an issue queue for storing instructions to be dispatched to the execution pipeline. The integer execution pipeline 1716 can be included in the execution pipeline and includes an integer execution unit 1718 that can perform arithmetic and logical data processing operations. The vector execution unit 1720 can be included in the execution pipeline and can execute advanced SIMD and floating-point arithmetic (FPU) 1722, execute Scalable Vector Extension (SVE) and Scalable Vector Extension 2 (SVE2) instructions 1724, and can also selectively execute cryptographic instructions 1726. The advanced SIMD can include a media and signal processing architecture that primarily adds instructions for audio, video, 3D graphics, image, and speech processing. The floating-point architecture provides support for single-precision and double-precision floating-point operations. The L1 data memory system 1730 executes load and store instructions, as well as service memory coherence requests. The L1 data memory system 1730 may include an L1 data cache 1732 and a fully associative L1 data TLB 1734, natively supporting 4KB, 16KB, and 64KB page sizes and 2MB and 512MB block sizes. The memory management unit (MMU) 1728 provides fine-grained memory system control through a set of virtual-to-physical address mappings and memory attributes, which are stored in a translation table and saved to the TLB 1734 after address translation. The L2 memory system 1736 may include an L2 cache 1738 and can be connected to the DSU-110 1754 via an asynchronous CPU bridge 1752. TMThe V2 core 1700 supports a range of debugging, testing, and tracing options, including the tracing unit 1742, the tracing buffer 1740, and the embedded logic analyzer (ELA) 1748. Neoverse TM The V2 core 1700 implements the Statistical Analysis Extension (SPE) 1744, which provides a statistical view of the performance characteristics of executed instructions. Software developers can leverage these views to optimize code for better performance. The Performance Monitoring Unit (PMU) 1746 provides a performance monitor that can be configured to collect statistics on the operation of each core and memory system. This information can be used for debugging and code analysis. The General Purpose Interrupt Controller (GIC) CPU interface 1750, when integrated with external allocator components, serves as a resource for supporting and managing interrupts in a cluster system. In a cluster, each Neoverse... TM There can be a CPU bridge 1752 between the V2 core 1700 and the DSU-110 1754. The CPU bridge 1752 can control the buffering and synchronization between the core 1700 and the DSU-110 1754. The CPU bridge 1752 can be asynchronous to allow each core 1700 to use a different frequency, power, and area implementation point. The CPU bridge 1752 can operate synchronously without affecting other interfaces, such as, but not limited to, asynchronous debug and tracing interfaces.
[0227] In at least one embodiment, core 1700 may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0228] Figure 18 One or more chips including one or more tensor processing units (TPUs) 1800 are shown according to at least one embodiment. Figure 18 The TPU 1800 may include an application-specific integrated circuit (ASIC), for example, for performing some or all of the operations described above or elsewhere herein, such as, but not limited to, machine learning workloads that accelerate the execution of matrix operations. The TPU 1800 may be an ASIC from Alphabet Corporation, Mountain View, California. The cloud TPU includes a cloud service that enables the TPU to be used as a scalable resource for processing tasks, such as, but not limited to, machine learning workloads that can run on frameworks such as, but not limited to, TensorFlow, PyTorch, and JAX.
[0229] Chip 1800 may include any number of TPUs, which may include a tensor core 1806. The tensor core 1806 may include one or more core sequencers 1808, vector processing units (VPUs) 1810, matrix multiplication units (MXUs) 1812(A)-1814(N) (where N is any integer greater than 1), and transpose permutation units 1816. The core sequencer 1808 may fetch instructions (e.g., VLIW (Very Long Instruction Word)) from the instruction memory (Imem) of core 1806, perform scalar operations using scalar data memory (Smem) and scalar registers (Sregs) (not shown), and forward vector instructions to the vector processing units (VPUs) 1810. For example, an instruction may initiate eight operations: two scalar operations, two vector ALU operations, vector loading and storing, and queuing data into the matrix multiplication and transpose units and a pair of slots for queuing data from them. The VPU1810 can perform vector operations using a large on-chip vector memory (Vmem) and vector registers (Vregs). The VPU1810 can stream data to or from the MXU via a decoupled FIFO. The VPU1810 can collect and distribute data to the Vmem using both data-level parallelism (2D matrix and vector function units) and instruction-level parallelism (8 operations per instruction). Large two-dimensional matrix multiplication units (MXUs) 1812(A)–1812(N) can, for example, use a systolic array to reduce area and power consumption, and use a large, software-controlled on-chip memory instead of a cache. The transpose-reduction-permute unit 1816 can perform (e.g., 128x128) matrix transpose, reduction, and permute on the VPU1810 channels. A high-bandwidth memory 1804 can be used for on-chip applications and can be coupled to the host queue 1802, for example, via PCIe. One or more chips 1800 can be connected together for computing. For example, one or more chips 1800 can be connected to form a torus, such as a 2D torus. Chips 1800 can also include any number (e.g., four) of inter-core interconnect (ICI) links 1818, which can enable direct connections between chips to form a supercomputer.
[0230] For any processor in chip 1800 and any components described above or elsewhere herein, one or more APIs or equivalents described herein may, for example, be compiled into instructions or equivalents that may be fetched by instruction fetching logic or equivalents, decoded by processor decoder or equivalents, scheduled (e.g., sequentially or out of order) for execution by scheduler or equivalents, executed by execution logic or equivalents, reordered, and then retired by retirement logic or equivalents. APIs (and / or compiled instructions including APIs) may be stored in any storage device (e.g., cache and / or memory) external to or internal to any processor in chip 1800. The results of the APIs may then be stored in any storage device internal to or external to any processor in chip 1800, including registers, DRAM, flash memory, SRAM, cache, or other memory equivalents.
[0231] In at least one embodiment, chip 1800 may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0232] Figure 19A vector processor according to at least one embodiment is illustrated. The vector processor 1900 may support the RISC-V standard. The vector processor 1900 may include one or more cores 1910 (e.g., scalar units) and one or more vector processing units (VPUs) 1942 (e.g., vector units), which may, for example, perform some or all of the operations described above or elsewhere herein. The core 1910 may include an Andes Custom Extension (ACE) 1916, which can be used, for example, to deliver custom instructions to the processor 1900 via an ACP 1938. The core 1910 may include a 1-cycle multiplier and a 1-cycle instruction / data local memory (ILM / DLM) for improving parallelism by allowing simultaneous instruction fetching and data access. A memory management unit (MMU) 1924 manages system memory and cache, and provides branch execution, instruction pair issuance, L1 instruction / data cache, and local memory storage. The core 1910 may include a physical memory protection and programmable physical memory attribute unit (PMP / PPMA) 1922. Core 1910 may include a digital signal processor (DSP) 1928 and a floating-point unit (FPU) 1926, as well as a load-memory unit (LSU) 1932 for interaction with memory hierarchies (D$1934 and I$1930). Core 1910 may include a branch prediction unit 1918 and a multiplier unit 1920.
[0233] The Vector Processing Unit (VPU) 1942 may include one or more Vector Functional Units (FUs) 1946(A)-1946(N) (which may be linked together for parallel processing), a separate memory path for loading / storing RISC-V Vectors (RVVs) via the ACE-RVV 1948 and the AndesStreaming Port (ASP) 1944, and a Vector Load / Storage Unit (VLSU) 1950.
[0234] The vector processor 1900 may include bus interfaces such as, but not limited to, a cacheable L2 cache port 1956, a non-cacheable MMIO port 1954, an input-output coherence port (IOCP) 1958 for a cacheless bus master, a local memory access port for accessing the ILM / DLM 1912 (which can be coupled to SRAM 1906) and the high-bandwidth vector memory (HVM) 1936, and a shared peripheral port (SPP) 1952 for external peripherals. Other memory ports include the LM slave port AXI 1902, the HVM subordinate port AXI 1904, the MEM (AXI) 1962, and the AXI 1960. The trace I / F 1914 can be captured, encoded, and transmitted off-chip via the Inst. trace I / F 1908 (e.g., a record of executed processor instructions). Software tools can use the Inst. trace I / F 1908 to reconstruct the exact execution sequence of a program.
[0235] For any processor in processor 1900 and any components described above or elsewhere herein, one or more APIs or equivalents described herein may, for example, be compiled into instructions or equivalents that may be fetched by instruction fetching logic or equivalents, decoded by processor decoder or equivalents, scheduled (e.g., sequentially or out of order) for execution by scheduler or equivalents, executed by execution logic or equivalents, reordered, and then retired by retirement logic or equivalents. APIs (and / or compiled instructions including APIs) may be stored in any storage device external to or internal to processor 1900 (e.g., in cache and / or memory). The results of the APIs may then be stored in storage devices internal to or external to processor 1900, including registers, DRAM, flash memory, SRAM, cache, or other memory equivalents.
[0236] In at least one embodiment, the vector processor 1900 may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein.
[0237] Figure 20A A schematic diagram of an example multi-core tiled processor microarchitecture is shown. Figure 20AMulti-core tiled processors in a system can include language processing processors. For example... Figure 20A As shown, each “tile” in the processor architecture is a processing element bundled together using an on-chip network (NoC), which can be used to perform some or all of the operations described above or elsewhere in this document. For example, each tile may have an instruction dispatch 2004 and integer (INT) units 2006 and floating-point (FP) units 2008, a load-memory unit (LSU) 2012 for engaging with the memory hierarchy (data cache (D$) 2010 and instruction cache (I$) 2014), and a network (NET) interface 2016 for communicating with other tiles. Some tiles in processor 2000 may include a memory controller 2002 for managing and controlling memory, as further described herein. Processor 2000 may have a functionally sliced architecture. Processor 2000 may reside on an application-specific integrated circuit (ASIC). Figure 20A The layout of an ASIC can be represented. Processor 2000 may include a coprocessor designed to execute instructions for a predictive model. A predictive model refers to any model configured to make predictions based on input data. The predictive model can use a classifier for classification predictions. The predictive model can be a machine learning model, such as, but not limited to, a tensor flow model, and processor 2000 is a tensor flow processor.
[0238] The processor 2000 can use different microarchitectures, and these microarchitectures will Figure 20B Each tile in the process represents a separate functional unit. Conversely, the functional tiles 2024 of the processor 2000 can be aggregated into multiple functional processing units (hereinafter referred to as "slices") 2004, each corresponding to a specific functional type (e.g., FP / INT 2018, NET 2020, MEM 2022). For example, as... Figure 20B As shown, each slice can correspond to a row of functional tiles extending in a north-south direction. Furthermore, the processor 2000 may also include communication channels for carrying data between tiles in different slices, each communication channel extending horizontally in an east-west direction. Each communication channel can be connected to each slice 2004 of the processor 2000.
[0239] The slices 2004 of the processor 2000 may each correspond to different functions and may include arithmetic logic slices (e.g., FP / INT 2018), channel switching slices (e.g., NET 2020), and memory slices (e.g., MEM 2022). Arithmetic logic units can perform one or more arithmetic and / or logical operations on data received via a communication channel to generate output data. Examples of arithmetic logic units may be matrix multiplication units and vector multiplication units. Memory slices include memory cells that store data. Memory slices can provide data to other slices via communication channels. Memory slices can also receive data from other slices via communication channels. Channel switching slices can configurably route data from one communication channel to any other communication channel. For example, data from a first channel can be provided to a second channel via a channel switching slice. In some embodiments, a channel switching slice can be implemented as a crossbar switch. Each slice 2004 also includes its own instruction queue (not shown) for storing instructions and an instruction control unit (ICU) for controlling instruction execution. Instructions in a given instruction queue can only be executed by a tile in its associated functional slice and not by other slices of the processor.
[0240] By arranging the tiles of processor 2000 into different functional slices 2004, the on-chip instruction and control flow of processor 2000 can be separated from the data flow. For example, according to some embodiments, Figure 20B One of the arrows illustrates the instruction flow within the processor architecture. According to at least one embodiment, Figure 20B Another arrow in the diagram illustrates the data flow within the processor architecture. As shown, instruction and control flow can flow across the tiles of processor 2000 in a first direction (e.g., north-south along the length of a functional slice, as indicated by the first arrow), while data flow can flow across the tiles of processor 2000 in a second direction (e.g., east-west across a functional slice, as indicated by the second arrow), which is perpendicular to the first direction.
[0241] Different functional slices of the processor 2000 can correspond to MEM 2022 (memory), VXM (vector execution module), MXM (matrix execution module), NIM (numerical interpretation module), and SXM (swapping and permutation module). Each slice can include N tiles, all of which can be controlled by the same instruction control unit (ICU) (not shown). Each slice can operate completely independently and can only be coordinated using barrier-like synchronization primitives or by the compiler using tractable determinism. Each tile of the processor 2000 can correspond to an execution unit organized as ×MSIMD tiles. For example, each tile of the on-chip memory of the processor 2000 can be organized to atomically store L-element vectors. Therefore, a MEM slice with N tiles can work together to store or process large vectors (e.g., with a total of N×M elements).
[0242] Tiles within a slice can execute instructions in an "interleaved" manner, where instructions can be issued tile-by-tile within the slice over N cycle periods. Functional slices can be physically arranged on the chip to allow for efficient data flow for pipelined execution over hundreds of cycles for common patterns. The data flow can perform a single "u-turn" (direction change) corresponding to a single matrix operation before being written back to memory; in some embodiments, a particular data flow can change direction multiple times before writing the resulting data back to memory (due to multiple matrix and vector operations).
[0243] To achieve good single-threaded performance, conventional multi-core processor designs (e.g., ...) Figure 20A As shown, a significant portion of silicon is typically required to be dedicated to exposing and utilizing instruction-level parallelism (ILP). This often involves register renaming schemes and large instruction windows where instructions have no explicit understanding of the hardware on which they will execute, while maintaining the illusion that programs execute sequentially. In contrast, when using processors with a function slice architecture (e.g., TSP), the TSP compiler (not shown) generates an explicit plan of how the processor 2000 can execute programs (e.g., microprograms). The compiler can specify when each operation will execute, which function slices will perform the work, and which STREAM registers will hold operands. The compiler can maintain a high-fidelity (cycle-accurate) model of the processor 2000's (e.g., TSP) hardware state so that microprograms can coordinate data flow.
[0244] Processor 2000 (e.g., TSP) can use a web-hosted compiler that takes a model (e.g., an ML model, such as but not limited to TensorFlow models) as input and issues a proprietary stream of instructions for processor 2000 (e.g., TSP). The compiler is responsible for coordinating the control and data flow of the program and specifying any instruction-level parallelism by explicitly bundling instructions that can and should be executed concurrently so that they can be dispatched together. The main hardware architecture includes the architecture-visible streaming register file (STREAM), which will be described in more detail below, and acts as a conduit for operands to flow from MEM slices (e.g., SRAM) to functional slices (and vice versa).
[0245] The MEM 2022 of the processor 2000 can be used as: (1) a storage for model parameters, microprograms, and data on which they operate; and (2) an on-chip network (NoC) for transferring data operands from the MEM to functional slices and returning computation results to the MEM. In some embodiments, the on-chip memory may consume approximately 75% of the chip area of the processor 2000. In some embodiments, the on-chip memory of the MEM tile may include SRAM instead of DRAM due to the bandwidth requirements of the processor 2000. The on-chip memory capacity of the processor 2000 can be determined by: (i) the number of ML models that can reside on the chip simultaneously, (ii) the size of any given model, and (iii) partitioning of large models to fit a multi-chip system. In some embodiments, the MEM system of the processor 2000 may provide multiple memory slices organized into two distinct hemispheres (referred to as “MEM WEST” and “MEM EAST”, respectively).
[0246] The memory slices in each hemisphere can be mirrored such that the slices are physically numbered {0,...L} in the Eastern Hemisphere and {L,...0} in the Western Hemisphere, such that memory slice 0 in each hemisphere corresponds to the slice of the VXM slice closest to the hemisphere, where each hemisphere comprises L slices. Data transfer towards the chip center can be referred to as inward, while data transfer towards the outer edge of the chip (easternmost or westernmost) can be referred to as outward. Although the memory hemispheres of the processor 2000 can be referred to as East and West, it is understood that other names may be used to refer to different memory hemispheres in other embodiments.
[0247] In some embodiments, streaming register files (referred to as STREAMS) transfer operands and results between the SRAM of a MEM slice of the processor 2000 and the functional slice. In some embodiments, multiple MEM slices (e.g., 2 to 10 adjacent MEM slices) can be physically organized into sets. Each slice set can be located between a pair of STREAMS register files, allowing each slice to read from or write to the STREAMS registers in either direction. By placing STREAMS register files between sets of MEM slices, the number of cycles required to transfer data operands across hemispheres can be reduced (e.g., reduced by a factor corresponding to the number of slices per set). The number of slices per set can be configured based on the distance of data transfer within a single clock cycle.
[0248] for Figure 20A Any processor and any component described above or elsewhere herein, one or more APIs or equivalents described herein may be compiled, for example, into instructions or equivalents that may be fetched by instruction fetching logic or equivalents, decoded by processor decoder or equivalents, scheduled (e.g., sequentially or out of order) for execution by scheduler or equivalents, executed by execution logic or equivalents, reordered, and then retired by retirement logic or equivalents. The API (and / or compiled instructions including the API) may be stored in any storage device internal or external to the processor 2000 (e.g., in cache and / or memory). The results of the API may then be stored in storage devices internal or external to the processor 2000, including registers, DRAM, flash memory, SRAM, cache, or other memory equivalents.
[0249] In at least one embodiment, the processor 2000 may include one or more circuits configured to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be software-configured to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to perform any of the operations described above or elsewhere herein.
[0250] Software Structure
[0251] The following figures illustrate, without limitation, examples of software structures for implementing at least one embodiment.
[0252] Figure 21A software stack of a programming platform according to at least one embodiment is illustrated. The programming platform may include a platform for accelerating computational tasks by utilizing hardware on a computing system. In at least one embodiment, software developers can access the programming platform through libraries, compiler instructions, and / or extensions to programming languages. The programming platform may be CUDA, Radeon Open Computing Platform (“ROCm”), OpenCL (OpenCL... TM Developed by the Khronos Group), SYCL, or Intel OneAPI.
[0253] The software stack 2100 of the programming platform can provide an execution environment for the application 2101. The application 2101 may include any computer software that can be launched on the software stack 2100. The application 2101 may include artificial intelligence (“AI”) / machine learning (“ML”) applications, high-performance computing (“HPC”) applications, virtual desktop infrastructure (“VDI”), or data center workloads.
[0254] Application 2101 and software stack 2100 run on hardware 2108. Hardware 2108 may include one or more GPUs, CPUs, FPGAs, AI engines, and / or other types of computing devices supporting programming platforms. Software stack 2100 may be vendor-specific and compatible only with vendor-specific devices, such as CUDA, ROCm, OneAPI, OpenCL, or other implementations. Hardware 2108 may include a host connected to one or more devices that can be accessed via application programming interface (“API”) calls to perform computational tasks. In at least one embodiment, the devices within hardware 2108 may include GPUs, FPGAs, AI engines, or other computing devices (but may also include CPUs) and their memory, while the host within hardware 2108 may include CPUs (but may also include computing devices) and their memory. For any hardware 2108 described above or elsewhere herein, the one or more APIs described herein may, for example, be compiled into instructions that may be fetched by instruction fetching logic, decoded by a processor decoder, scheduled (e.g., sequentially or out of order) for execution by a scheduler, executed by execution logic, reordered, and then retired by retirement logic. The API (and / or compiled instructions including the API) may be stored in any storage device (e.g., cache and / or memory) internal or external to hardware 2108. The results of the API may be stored in storage devices internal or external to hardware 2108, including registers, DRAM, flash memory, SRAM, cache, or other memory. One or more APIs described herein may receive calls. One or more APIs described herein may communicate with a library or a portion of a library to perform the function described by the call. One or more APIs described herein may receive calls and communicate with a library or a portion of a library to perform the function described by the call.
[0255] The software stack 2100 of the programming platform may include multiple libraries 2103, a runtime 2105, optional drivers / interfaces 2107, and device kernel drivers 2108. Each library 2103 may include data and programming code that can be used by computer programs and utilized during software development. Library 2103 may include pre-written code and subroutines, classes, values, type specifications, configuration data, documentation, help data, and / or message templates. Library 2103 may include functionality optimized for execution on one or more types of devices. Library 2103 may include functionality for performing mathematical, deep learning, and / or other types of operations on the device. Library 2103 may be associated with corresponding APIs 2102, which may include one or more APIs for exposing the functionality implemented in library 2103. A processor (e.g., CPU, GPU) may execute, call, or otherwise use one or more APIs to determine kernel priority. For example, a first kernel (e.g., a parent kernel) may launch a second kernel (e.g., a child kernel), and the processor may use the second kernel to launch an additional kernel (e.g., a grandchild kernel) independent of the first kernel. The processor can execute APIs or call APIs from memory to support dynamic stream priorities (e.g., updating priorities when performing operations using streams). For example, when the processor executes the API, it allows the programmer to copy stream priorities from one stream to one or more other streams.
[0256] Software stack 2100 may include APIs that support dynamic stream prioritization (e.g., updating priority while performing operations on the stream), allowing programmers to set the stream's priority at any time after the stream is created. Software stack 2100 may include APIs that support dynamic stream prioritization (e.g., updating priority while performing operations on the stream), allowing programmers to obtain the stream's current priority, where the priority is one of several attributes of the stream. Software stack 2100 may include APIs that support dynamic stream prioritization (e.g., updating priority while performing operations on the stream), allowing programmers to obtain the stream's current priority as a single attribute. Software stack 2100 may include APIs that support dynamic stream prioritization (e.g., updating priority while performing operations on the stream), allowing programmers to start the kernel to perform operations on the stream at a set priority, which may be different from the stream priority. Software stack 2100 may include an API for indicating whether an object (e.g., a thread synchronization object, such as, but not limited to, a barrier) tracks whether all data movement operations of a set of threads running on the GPU have a specified state after a specified time period, wherein the specified state may be a state indicating that data has been moved and is ready for use, and is specified using expected parity values as input to the API.
[0257] Software stack 2100 may include one or more APIs for updating the kernel. The processor may execute APIs or call APIs from memory to update existing APIs, thereby supporting a context-independent kernel. This allows programmers to add kernel nodes to a graph without a graph context, so that the graph context can be dynamically associated with the kernel at runtime. Software stack 2100 may include one or more APIs that allow programmers to obtain kernel identifiers and graph contexts as separate parameters from kernel nodes, thereby enabling parameter retrieval from both the kernel and the context-independent kernel. Software stack 2100 may include one or more APIs that use parallel processors (e.g., but not limited to one or more graphics processing units) to initiate task graphs (e.g., task graphs) and execute one or more task graphs (e.g., including one or more programs).
[0258] Software stack 2100 may include one or more APIs for associating one or more instructions with one or more memory sorting operations (e.g., but not limited to fence or memory barrier operations). Instructions may be associated with one or more domains, causing memory sorting operations to execute in association with one or more specific domains without interfering with instructions in other domains. APIs may indicate that a thread has reached (e.g., reached a thread synchronization barrier) or completed a certain phase of work associated with an asynchronous data movement operation on the GPU. Software stack 2100 may include one or more APIs that allow programmers to manually indicate an expected transaction count when a thread completes a certain phase of work; this transaction count can be used to update an object used to track whether all data movement operations for a set of threads have been completed.
[0259] Application 2101 can be written as source code and then compiled into executable code, as described below. Figure 23 and Figure 24 As discussed in more detail, the executable code of application 2101 can run, at least partially, in the execution environment provided by software stack 2100. During the execution of application 2101, code that needs to run on the device (rather than the host) may be encountered. In this case, runtime 2105 can be invoked to load and launch the required code on the device. Runtime 2105 can include any technically feasible runtime system capable of supporting the execution of application 2101.
[0260] Runtime 2105 can be implemented as one or more runtime libraries associated with a corresponding API (shown as API 2104). One or more such runtime libraries may include functions for memory management, execution control, device management, error handling, and / or synchronization, etc. Memory management functions may include functions for allocating, dealing with, and copying device memory, as well as functions for transferring data between host memory and device memory. Execution control functions may include functions for launching functions on the device (sometimes referred to as "kernels" when the function is a global function that can be called from the host) and setting attribute values in buffers maintained by the runtime libraries so that a given function can be executed on the device.
[0261] The runtime library and the corresponding API 2104 can be implemented in any technically feasible manner. One (or any number) APIs can expose a set of low-level functions for fine-grained control of the device, while another (or any number) APIs can expose a set of high-level functions for such functions. High-level runtime APIs can be built on top of low-level APIs. One or more runtime APIs can be language-specific APIs, which can be layered on top of language-independent runtime APIs.
[0262] Optional drivers or interfaces 2107 can be implemented, for example, for CUDA and ROCm implementations, which will be described further below. Optional drivers / interfaces 2107 can be associated with optional driver or interface APIs, such as, but not limited to, the CUDA and / or ROCm APIs.
[0263] One or more processors disclosed in the “processing system” may execute, access, or otherwise use software stack 2100. For example, system-on-a-chip 800, parallel processor 900, graphics multiprocessor 934, processor 1000, processor 1100, accelerator 1200, neuromorphic processor 1305, supercomputer 1400, acceleration processing unit 1500, processor 1600, processor 1700, tensor processing unit 1800, processor 1900, and language processing unit 2000 may execute, use, call, or otherwise implement (e.g., by accessing memory) one or more APIs included in software stack 2100.
[0264] Device kernel driver 2108 can be configured to facilitate communication with the underlying device. Device kernel driver 2108 can provide low-level functionality for APIs (such as, but not limited to, API 2104) and / or other software. Device kernel driver 2108 can be configured to compile intermediate representation (“IR”) code into binary code at runtime. For CUDA or other implementations (such as, but not limited to, ROCm, OneAPI, or OpenCL), device kernel driver 2108 can compile non-hardware-specific parallel thread execution (“PTX”) IR code into binary code for a specific target device at runtime (and cache the compiled binary code), which is sometimes referred to as “finalized” code. Doing so allows the finalized code to run on a target device that may not have existed when the source code was initially compiled into PTX code. Alternatively, the device source code can be compiled into binary code offline without device kernel driver 2108 compiling the IR code at runtime.
[0265] Processors described elsewhere in this document (e.g., but not limited to) Figure 8-20B The processor (in the document) may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be configured by software (e.g., software stack 2100) to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0266] According to at least one embodiment, Figure 21 The software stack 2100 can execute in a CUDA implementation. The CUDA software stack 2100, on which the application 2101 can be launched, may include CUDA libraries 2103, CUDA runtime 2105, CUDA drivers 2107, and device kernel drivers 2108. The CUDA software stack 2100 can execute on hardware (e.g., a graphics multiprocessor 934, which may include a CUDA-enabled GPU developed by NVIDIA Corporation in Santa Clara, California).
[0267] Application 2101, CUDA runtime 2105, and device kernel driver 2108 can perform the functions described above and elsewhere in this document. CUDA driver 2107 may include a library (libcuda.so) that implements CUDA driver API 2106. Similar to CUDA runtime API 2104 implemented by the CUDA runtime library (cudart), CUDA driver API 2106 exposes functions for memory management, execution control, device management, error handling, synchronization, and / or graphics interoperability. CUDA driver API 2106 differs from CUDA runtime API 2104 in that CUDA runtime API 2104 simplifies device code management by providing implicit initialization, context (similar to processes), and module (similar to dynamically loaded libraries) management. Compared to the high-level CUDA runtime API 2104, CUDA driver API 2106 can serve as a low-level API, providing finer-grained device control, especially in terms of context and module loading. The CUDA driver API 2106 can expose context management functions not exposed in the CUDA runtime API 2104. The CUDA driver API 2106 can also be language-agnostic, supporting technologies such as OpenCL in addition to the CUDA runtime API 2104. Furthermore, development libraries, including the CUDA runtime 2105, can be considered separate from driver components, including the user-mode CUDA driver 2107 and the kernel-mode device driver 2108 (sometimes referred to as the "display" driver).
[0268] CUDA library 2103 may include mathematical libraries, deep learning libraries, parallel algorithm libraries, and / or signal / image / video processing libraries, which parallel computing applications (such as, but not limited to, application 2101) may utilize. CUDA library 2103 may include mathematical libraries, such as, but not limited to, the cuBLAS library (an implementation of the basic linear algebra subroutine (“BLAS”) for performing linear algebra operations), the cuFFT library (for computing the Fast Fourier Transform (“FFT”)), and the cuRAND library (for generating random numbers), etc. CUDA library 2103 may include deep learning libraries, such as, but not limited to, the cuDNN primitive library for deep neural networks and the TensorRT platform for high-performance deep learning inference, etc.
[0269] In at least one embodiment, the processor described elsewhere herein (e.g., but not limited to Figure) Figure 8-20BThe processor (in the document) may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be configured by software (e.g., software stack 2100) to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0270] According to at least one embodiment, Figure 21 The software stack 2100 can execute in the ROCm implementation. An application 2101 can be launched on the ROCm software stack 2100, which includes a language runtime 2103, a system runtime 2105, a thunk 2107, and a ROCm kernel driver 2108. The ROCm software stack 2100 executes on hardware 2109, which may include a GPU that supports ROCm, developed by AMD Inc. of Santa Clara, California.
[0271] Application 2101 is executable and can be combined with the above. Figure 21 Similar functionality to that discussed. Furthermore, the language runtime 2103 and system runtime 2105 can execute functions related to the above. Figure 21 The runtime 2105 discussed here has similar functionality. The difference between the language runtime 2103 and the system runtime 2105 is that the system runtime 2105 is a language-independent runtime that implements the ROCr System Runtime API 2104 and uses the Heterogeneous System Architecture (“HSA”) runtime API. The HSA runtime API may include a streamlined user-mode API that exposes interfaces for accessing and interacting with the AMD GPU, including functions for memory management, execution control via kernel architecture dispatch, error handling, system and agent information, and runtime initialization and shutdown. Unlike the system runtime 2105, the language runtime 2103 may be an implementation of the language-specific runtime API 2102, which sits in a layer above the ROCr System Runtime API 2104. The language runtime API may include the Heterogeneous Computing Portable Interface (“HIP”) language runtime API, the Heterogeneous Computing Compiler (“HCC”) language runtime API, or the OpenCL API, etc. HIP is an extension of the C++ programming language, offering a functionally similar version to the CUDA mechanism. Furthermore, the HIP runtime API can include features related to the above. Figure 21The discussion covers functions similar to those in the CUDA runtime API, such as, but not limited to, memory management, execution control, device management, error handling, and synchronization.
[0272] Thunk(ROCt)2107 can be interface 2106, which is used to interact with the underlying ROCm driver 2108. ROCm driver 2108 can be the ROCk driver, which is a combination of the AMD GPU driver and the HSA core driver (amdkfd). The AMD GPU driver can be a device core driver developed by AMD for GPUs, performing functions in conjunction with the above. Figure 21 The device kernel driver discussed is similar to the 2109. An HSA kernel driver can be a driver that allows different types of processors to share system resources more efficiently through hardware features.
[0273] Various libraries (not shown) may be included in the ROCm software stack 2100 on top of the language runtime 2103, and provide integration with the above. Figure 21 The discussion focuses on CUDA library 2103 and similar functionality. Various libraries can include mathematical libraries, deep learning libraries, and / or other libraries, such as, but not limited to, the hipBLAS library which implements functionality similar to CUDA cuBLAS, the rocFFT library for computing FFTs similar to CUDA cuFFT, etc.
[0274] Processors described elsewhere in this document (e.g., but not limited to) Figure 8-20B The processor (in the document) may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be configured by software (e.g., software stack 2100) to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0275] According to at least one embodiment, Figure 21 The software stack 2100 can execute in an OpenCL implementation. The OpenCL software stack 2100, on which the application 2101 can be launched, may include the OpenCL framework 2103, the OpenCL runtime 2105, and a driver 2108. The OpenCL software stack 2100 can execute on non-vendor-specific hardware 2109. Because devices developed by different vendors support OpenCL, specific OpenCL drivers may be required for interoperability with the hardware of those vendors.
[0276] Application 2101, OpenCL runtime 2105, device kernel driver 2108, and hardware 2109 can execute in combination with the above. Figure 21 Other implementations of the application 2101, runtime 2105, device kernel driver 2108, and hardware 2109 discussed above have similar functionality. The application 2101 may also include an OpenCL kernel (not shown), whose code will be executed on the device.
[0277] OpenCL can define a "platform" that allows a host to control devices connected to it. The OpenCL framework provides platform-level APIs and runtime APIs, shown as Platform API 2102 and Runtime API 2104, respectively. Runtime API 2104 uses a context to manage kernel execution on a device. Each identified device can be associated with a corresponding context, which Runtime API 2104 uses to manage the device's command queue, program objects, kernel objects, shared memory objects, and so on. Platform API 2102 exposes functions that allow the use of a device context to select and initialize devices, submit work to devices via command queues, and enable data transfer with devices. In addition, the OpenCL framework provides various built-in functions (not shown), including mathematical functions, relational functions, and image processing functions.
[0278] The OpenCL framework 2103 may also include a compiler (not shown). Source code can be compiled offline before application execution or online during application execution. Unlike CUDA and ROCm, OpenCL applications can be compiled online by a compiler representing any number of compilers that can be used to compile source code and / or IR code (e.g., but not limited to Standard Portable Intermediate Representation (“SPIR-V”) code) into binary code. Alternatively, OpenCL applications can be compiled offline before execution.
[0279] In at least one embodiment, the processor described elsewhere herein (e.g., but not limited to) Figure 8-20B The processor (in the document) may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be configured by software (e.g., software stack 2100) to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0280] According to at least one embodiment, the software may be supported by a programming platform configured to support various programming models, middleware and / or libraries, and frameworks that the application may rely on. The application may be an AI / ML application implemented using, for example, a deep learning framework (e.g., but not limited to, MXNet, PyTorch, or TensorFlow), which may rely on libraries such as, but not limited to, cuDNN, the NVIDIA Collective Communication Library (“NCCL”), and / or the NVIDIA Developer Data Loading Library (“DALI”) CUDA library to provide accelerated computation on the underlying hardware.
[0281] The programming platform can be a combination of the above. Figure 21 The platform described is one of CUDA, ROCm, or OpenCL. The programming platform can support various programming models, which can be abstractions of the underlying computing system that allow the expression of algorithms and data structures. Programming models can expose features of the underlying hardware to improve performance. Programming models may include CUDA, HIP, OpenCL, C++ Accelerated Massive Parallelism (“C++AMP”), Open Multiprocessing (“OpenMP”), Open Accelerators (“OpenACC”), and / or Vulkan Compute.
[0282] Libraries and / or middleware can provide abstract implementations of programming models. Such libraries may include data and programming code that computer programs can use and leverage during software development. Such middleware may include software that provides services to applications beyond those offered by the programming platform. Libraries and / or middleware may include cuBLAS, cuFFT, cuRAND, and other CUDA libraries, or rocBLAS, rocFFT, rocRAND, and other ROCm libraries. Furthermore, libraries and / or middleware may include the NCCL and ROCm communication collection library (“RCCL”) libraries that provide communication routines for GPUs, the MIOpen library for accelerating deep learning, and / or the Eigen library for linear algebra, matrix and vector operations, geometric transformations, numerical solvers, and related algorithms.
[0283] Application frameworks may depend on libraries and / or middleware. Each application framework can be a software framework that provides a standard structure for implementing application software. Returning to the AI / ML example discussed above, AI / ML applications can be implemented using frameworks such as, but not limited to, deep learning frameworks like Caffe, Caffe2, TensorFlow, Keras, PyTorch, or MxNet.
[0284] In at least one embodiment, the processor described elsewhere herein (e.g., but not limited to) Figure 8-20BThe processor (in the document) may include one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or otherwise perform any of the operations described above or elsewhere herein. One or more circuits may be configured by software (e.g., the programming platform described herein) to execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or perform any of the operations described above or elsewhere herein.
[0285] Figure 22 The method for using at least one embodiment in the above is shown. Figure 21 Compiled code executed on one of the programming platforms shown. Compiler 2201 is configured to receive source code 2200, compile source code 2200, and output executable file 2210. Compiler 2201 can be configured to convert source code 2200 into host executable code 2207 for execution on a host and device executable code 2208 for execution on a device. Source code 2200 can be compiled offline before executing the application or can be compiled online during application execution. Source code 2200 can include code in any programming language supported by compiler 2201, such as, but not limited to, C++, C, Fortran, etc. Source code 2200 can be included in a single source file containing both host code and device code, indicating the location of the device code. The single source file can be a .cu file including CUDA code, or a .hip.cpp file including HIP code, or a file in other formats including both host code and device code. Alternatively, source code 2200 can include multiple source code files instead of a single source file, with host code and device code separated into these files. Compiler 2201 includes or can access one or more libraries to identify API call sequences to execute a single fused API, wherein the single fused API is a combination of two or more APIs. In at least one embodiment, compiler 2201 may be an NVIDIA CUDA compiler (“NVCC”) for compiling CUDA code in a .cu file, or an HCC compiler for compiling HIP code in a .hip.cpp file, or other compilers.
[0286] Compiler 2201 can be configured to compile source code 2200 into host executable code 2207 for execution on a host and device executable code 2208 for execution on a device. The operations performed by compiler 2201 include parsing source code 2200 into an abstract system tree (AST), performing optimizations, and generating executable code. When source code 2200 comprises a single source file, compiler 2201 can separate the device code from the host code in that single source file, compile the device code and host code into device executable code 2208 and host executable code 2207 respectively, and link the device executable code 2208 and host executable code 2207 together to form a single file.
[0287] Compiler 2201 may include compiler front-end 2202, host compiler 2205, device compiler 2206, and linker 2209. Compiler front-end 2202 may be configured to separate device code 2204 from host code 2203 in source code 2200. In at least one embodiment, device code 2204 may be compiled by device compiler 2206 into device executable code 2208, which may include binary code or IR code as described above. Separately, host code 2203 may be compiled by host compiler 2205 into host executable code 2207. For other compilers such as NVCC (e.g., but not limited to oneAPI, ROCm, and OpenCL compilers), host compiler 2205 may be a general-purpose C / C++ compiler that outputs native object code, while device compiler 2206 may be a low-level virtual machine (“LLVM”) based compiler that forks the LLVM compiler infrastructure and outputs PTX code or binary code. For HCC, both host compiler 2205 and device compiler 2206 may be LLVM based compilers that output object binary code.
[0288] After compiling source code 2200 into host executable code 2207 and device executable code 2208, linker 2209 can link the host executable code 2207 and device executable code 2208 together to form executable file 2210. The host's native object code and the device's PTX or binary code can be linked together in an executable and linkable format (“ELF”) file, a container format for storing object code. Host executable code 2207 and device executable code 2208 can take any suitable format, such as, but not limited to, binary code and / or IR code. In at least one embodiment, for CUDA, host executable code 2207 may include native object code, while device executable code 2208 may include code in a PTX intermediate representation. In at least one embodiment, for ROCm, both host executable code 2207 and device executable code 2208 can include object binary code. Other implementations (e.g., but not limited to oneAPI, OpenCL) are considered and can be performed similarly to the CUDA and ROCm implementations described above.
[0289] Source code 2200 can be transformed before compilation. The source code is passed through a transformation tool (not shown), which transforms source code 2200 into transformed source code. Compiler 2201 can be used to compile the transformed source code into host-executable code 2207 and device-executable code 2208, a process similar to compiler 2201 compiling source code 2200 into host-executable code 2207 and device-executable code 2208, as described above. Figure 22 The discussion.
[0290] The transformations performed by the transformation tools can be used to port source code 2200 to environments different from those originally intended to run in. The transformation tools may include a HIP converter, which "hipify" CUDA code intended for the CUDA platform into HIP code that can be compiled and executed on the ROCm platform. The transformation of source code 2200 may include parsing source code 2200 and translating calls to APIs provided by one programming model (e.g., CUDA) into corresponding calls to APIs provided by another programming model (e.g., HIP), as described below. Figure 23 Let's discuss this in more detail. Returning to the example of HIP-enhanced CUDA code, calls to the CUDA runtime API, CUDA driver API, and / or CUDA libraries can be translated into corresponding HIP API calls. The automatic conversion performed by conversion tool 2201 may sometimes be incomplete, requiring additional manual intervention to fully port the source code 2200.
[0291] One or more techniques described herein can leverage other methods of converting one type of code into another to achieve interoperability between different device architectures. In at least one embodiment, an application for one platform (e.g., a CUDA application) can be compiled into code for implementation on another platform (e.g., an AMD processor, an Intel processor, or another processor). For example, source code 2200 may include source code for one platform (e.g., CUDA). Compiler 2201 can compile source code 2200 into an executable file 2210 that can be used by another platform (e.g., AMD or Intel). Programming toolkits can allow applications for one platform (e.g., CUDA) to be compiled (e.g., natively compiled) against another platform (e.g., AMD or Intel). For example, a GPGPU programming toolkit can allow CUDA applications to be natively compiled against AMD GPUs. Programs (e.g., CUDA programs) or their build systems do not require modification or conversion into other languages before being compiled into code for another platform. The compiler can accept the same command-line options and programming cognates (e.g., CUDA cognates) as another compiler (e.g., nvcc for CUDA), serving as a direct replacement for the installation of emulation toolkits (e.g., the NVIDIA CUDA Toolkit), so existing build tools and scripts (e.g., cmake) will work without further modification. In at least one embodiment, an nvcc-cognate CUDA can be compiled for AMD GPUs (including PTX asm) using an nvcc-compatible compiler. Implementations of the CUDA runtime and driver APIs for AMD GPUs can be used. Libraries (e.g., open-source wrapper libraries) can provide APIs by delegating to the corresponding ROCm libraries, such as the "CUDA-X" API. Example implementations include SCALE from Spectral Compute in London, UK. SCALE does not provide a new method for writing GPGPU software, but rather allows programs written in the popular CUDA language to be directly compiled for AMD GPUs. Additional implementations could include the Clang compiler, which provides language front-ends and tooling infrastructure for languages in the C family of languages (C, C++, Objective C / C++, OpenCL, CUDA, and RenderScript).In at least one embodiment, the compiler described herein (e.g., but not limited to compiler 2201, compiler 2205, and / or compiler 2206) may include one or more circuits for compiling code (e.g., CUDA, HIP, OpenCL, OneAPI, or others) to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel or to otherwise perform any of the operations described above or elsewhere herein.
[0292] Figure 23 A system 2300 is shown, configured to compile and execute CUDA source code 2310 using different types of processing units according to at least one embodiment. The system 2300 includes CUDA source code 2310, CUDA compiler 2350, host executable code 2370(1), host executable code 2370(2), CUDA device executable code 2384, CPU 2390, CUDA-enabled GPU 2394, GPU 2392, CUDA to HIP conversion tool 2320, HIP source code 2330, HIP compiler driver 2340, HCC 2360, and HCC device executable code 2382.
[0293] CUDA source code 2310 can be a collection of human-readable code in the CUDA programming language. The CUDA programming language can be an extension of the C++ programming language, including mechanisms for defining device code and distinguishing between device code and host code. Device code can include source code that, after compilation, can be executed in parallel on a device. A device can be a processor optimized for parallel instruction processing, such as, but not limited to, a CUDA-enabled GPU 2390, GPU 2392, or other GPGPUs. Host code is source code that, after compilation, can be executed on a host machine. A host machine is a processor optimized for sequential instruction processing, such as, but not limited to, a CPU 2390.
[0294] CUDA source code 2310 may include any number (including zero) of global functions 2312, any number (including zero) of device functions 2314, any number (including zero) of host functions 2316, and any number (including zero) of host / device functions 2318. Global functions 2312, device functions 2314, host functions 2316, and host / device functions 2318 can be mixed within CUDA source code 2310. Each global function 2312 can be executed on the device and can be called from the host. Therefore, one or more global functions 2312 can act as entry points for the device. Each global function 2312 can be a kernel. In a technique called dynamic parallelism, one or more global functions 2312 can define a kernel that can be executed on the device and called from the device. During execution, the kernel can be executed in parallel N times by N different threads on the device (where N is any positive integer).
[0295] Each device function 2314 can be executed on a device and can only be called from that device. Each host function 2316 can be executed on a host and can only be called from that host. Each host / device function 2316 can define a host version of a function that can be executed on a host and called only from that host, and a device version of a function that can be executed on a device and called only from that device.
[0296] CUDA source code 2310 can also include any number of calls to any number of functions, which can be defined through CUDA runtime API 2302. CUDA runtime API 2302 can include any number of functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc. CUDA source code 2310 can also include any number of calls to any number of functions, which can be specified in any number of other CUDA APIs. CUDA APIs can be any API designed for use by CUDA code. CUDA APIs can include CUDA runtime API 2302, CUDA driver APIs, APIs for any number of CUDA libraries, etc., including any APIs described elsewhere in this document. Compared to CUDA runtime API 2302, CUDA driver APIs can be lower-level APIs but can provide finer-grained control over devices. Examples of CUDA libraries include cuBLAS, cuFFT, cURAND, cuDNN, etc.
[0297] CUDA compiler 2350 can compile input CUDA code (e.g., CUDA source code 2310) to generate host executable code 2370(1) and CUDA device executable code 2384. CUDA compiler 2350 can be, but is not limited to, NVCC. Host executable code 2370(1) can be a compiled version of the host code included in the input source code, which can be executed on CPU 2390. CPU 2390 can be any processor optimized for sequential instruction processing.
[0298] CUDA device executable code 2384 may be a compiled version of the device code included in the input source code, which can be executed on a CUDA-enabled GPU 2394. CUDA device executable code 2384 may include binary code. CUDA device executable code 2384 may include IR code (e.g., but not limited to PTX code), which is further compiled at runtime by the device driver into binary code for a specific target device (e.g., a CUDA-enabled GPU 2394). CUDA-enabled GPU 2394 may include any processor optimized for parallel instruction processing and supporting CUDA. CUDA-enabled GPU 2394 may be developed by NVIDIA Corporation, located in Santa Clara, California.
[0299] The CUDA to HIP conversion tool 2320 can be configured to convert CUDA source code 2310 into functionally similar HIP source code 2330. The HIP source code 2330 may include a collection of human-readable code written in the HIP programming language. The HIP code may include human-readable code written in the HIP programming language. The HIP programming language may include extensions to the C++ programming language that include functionally similar versions of the CUDA mechanisms for defining device code and distinguishing it from host code. The HIP programming language may include a subset of the functionality of the CUDA programming language. For example, the HIP programming language may include mechanisms for defining global functions 2312, but such HIP programming languages may lack support for dynamic parallelism, so the global function 2312 defined in the HIP code may only be called from the host.
[0300] HIP source code 2330 may include any number (including zero) of global functions 2312, any number (including zero) of device functions 2314, any number (including zero) of host functions 2316, and any number (including zero) of host / device functions 2318. HIP source code 2330 may also include any number of calls to any number of functions specified in the HIP runtime API 2332. The HIP runtime API 2332 may include functionally similar versions of a subset of functions contained in the CUDA runtime API 2302. HIP source code 2330 may also include any number of calls to any number of functions specified in any number of other HIP APIs. The HIP API may be any API designed for use by HIP code and / or ROCm. The HIP API may include the HIP runtime API 2332, the HIP driver API, APIs for any number of HIP libraries, APIs for any number of ROCm libraries, etc.
[0301] The CUDA to HIP conversion tool 2320 can convert each kernel call in CUDA code from CUDA syntax to HIP syntax, and can convert any number of other CUDA calls in CUDA code into any number of other functionally similar HIP calls. CUDA calls can include calls to functions specified in the CUDA API, and HIP calls can include calls to functions specified in the HIP API. The CUDA to HIP conversion tool 2320 can convert any number of calls to functions specified in the CUDA runtime API 2302 into any number of calls to functions specified in the HIP runtime API 2332.
[0302] The CUDA to HIP conversion tool 2320 may include a tool called hipify-perl, which performs a text-based conversion process. The CUDA to HIP conversion tool 2320 may also include a tool called hipify-clang, which, compared to hipify-perl, performs a more complex and robust conversion process, including parsing the CUDA code using clang (a compiler front-end) and then converting the resulting symbols. Converting CUDA code to HIP code may include modifications beyond those performed by the CUDA to HIP conversion tool 2320 (e.g., manual editing).
[0303] HIP compiler driver 2340 may include a front-end that identifies target device 2346 and then configures a compiler compatible with target device 2346 to compile HIP source code 2330. Target device 2346 may include a processor optimized for parallel instruction processing. HIP compiler driver 2340 may identify target device 2346 in any technically feasible manner.
[0304] If the target device 2346 is CUDA compatible (e.g., a CUDA-enabled GPU 2394), the HIP compiler driver 2340 can generate HIP / NVCC compilation commands 2342. The HIP / NVCC compilation commands 2342 can configure the CUDA compiler 2350 to compile the HIP source code 2330 using a HIP-to-CUDA translation header and the CUDA runtime library. In response to the HIP / NVCC compilation commands 2342, the CUDA compiler 2350 can generate host executable code 2370(1) and CUDA device executable code 2384.
[0305] If the target device 2346 is incompatible with CUDA, the HIP compiler driver 2340 can generate HIP / HCC compilation command 2344. HIP / HCC compilation command 2344 can configure HCC 2360 to compile HIP source code 2330 using the HCC header and HIP / HCC runtime library. In response to HIP / HCC compilation command 2344, HCC 2360 can generate host executable code 2370(2) and HCC device executable code 2382. HCC device executable code 2382 can be a compiled version of the device code included in HIP source code 2330, which can be executed on GPU 2392. GPU 2392 can be any processor optimized for parallel instruction processing, incompatible with CUDA, and compatible with HCC. GPU 2392 can be developed by AMD Inc., located in Santa Clara, California. GPU 2392 can include GPU 2392 that does not support CUDA.
[0306] For illustrative purposes only, Figure 23Three different flows, which can be implemented in at least one embodiment, are described for compiling CUDA source code 2310 for execution on CPU 2390 and various devices. A direct CUDA stream can compile CUDA source code 2310 for execution on CPU 2390 and CUDA-enabled GPU 2394 without converting CUDA source code 2310 to HIP source code 2330. An indirect CUDA stream can convert CUDA source code 2310 to HIP source code 2330 and then compile HIP source code 2330 for execution on CPU 2390 and CUDA-enabled GPU 2394. A CUDA / HCC stream can convert CUDA source code 2310 to HIP source code 2330 and then compile HIP source code 2330 for execution on CPU 2390 and GPU 2392.
[0307] The achievable direct CUDA stream is represented by dashed lines and a series of bubbles labeled A1-A3. As shown in bubble A1, CUDA compiler 2350 can receive CUDA source code 2310 and CUDA compilation command 2348, which configures CUDA compiler 2350 to compile CUDA source code 2310. The CUDA source code 2310 available for direct CUDA stream can be written in a CUDA programming language based on a programming language other than C++ (e.g., C, Fortran, Python, Java, etc.). In response to CUDA compilation command 2348, CUDA compiler 2350 can generate host executable code 2370(1) and CUDA device executable code 2384 (shown in bubble A2). As shown in bubble A3, host executable code 2370(1) and CUDA device executable code 2384 can be executed on CPU 2390 and CUDA-enabled GPU 2394, respectively. CUDA device executable code 2384 may include binary code. CUDA device executable code 2384 may include PTX code and may be further compiled at runtime into binary code for a specific target device.
[0308] The achievable indirect CUDA streams are represented by dashed lines and a series of bubbles labeled B1-B6. As shown in bubble B1, the CUDA-to-HIP conversion tool 2320 can receive CUDA source code 2310. As shown in bubble B2, the CUDA-to-HIP conversion tool 2320 can convert CUDA source code 2310 into HIP source code 2330. As shown in bubble B3, the HIP compiler driver 2340 can receive HIP source code 2330 and determine that the target device 2346 supports CUDA.
[0309] As indicated by bubble B4, the HIP compiler driver 2340 can generate HIP / NVCC compilation command 2342 and transfer the HIP / NVCC compilation command 2342 and HIP source code 2330 to the CUDA compiler 2350. The HIP / NVCC compilation command 2342 can configure the CUDA compiler 2350 to compile the HIP source code 2330 using a HIP-to-CUDA translation header and a CUDA runtime library. The HIP-to-CUDA translation header can convert any number of mechanisms (e.g., functions) specified in any number of HIP APIs into any number of mechanisms specified in any number of CUDA APIs. The CUDA compiler 2350 can combine the HIP-to-CUDA translation header with the CUDA runtime library corresponding to the CUDA runtime API 2302 to generate host executable code 2370(1) and CUDA device executable code 2384. In response to HIP / NVCC compilation command 2342, CUDA compiler 2350 can generate host executable code 2370(1) and CUDA device executable code 2384 (represented by bubble B5). As shown in bubble B6, host executable code 2370(1) and CUDA device executable code 2384 can be executed on CPU 2390 and CUDA-enabled GPU 2394, respectively. CUDA device executable code 2384 may include binary code. CUDA device executable code 2384 may include PTX code and can be further compiled at runtime into binary code for a specific target device.
[0310] The implementable CUDA / HCC streams are represented by solid lines and a series of bubbles labeled C1-C6. As shown in bubble C1, the CUDA-to-HIP conversion tool 2320 can receive CUDA source code 2310. As shown in bubble C2, the CUDA-to-HIP conversion tool 2320 can convert CUDA source code 2310 into HIP source code 2330. As shown in bubble C3, the HIP compiler driver 2340 can receive HIP source code 2330 and can determine that the target device 2346 does not support CUDA.
[0311] HIP compiler driver 2340 can generate HIP / HCC compilation command 2344 and transfer HIP / HCC compilation command 2344 and HIP source code 2330 to HCC 2360 (shown by bubble labeled C4). HIP / HCC compilation command 2344 can configure HCC 2360 to compile HIP source code 2330 using HCC header files and HIP / HCC runtime library. HIP / HCC runtime library can correspond to HIP runtime API 2332. HCC header can include any number and type of HIP and HCC interoperability mechanisms. In response to HIP / HCC compilation command 2344, HCC 2360 can generate host executable code 2370(2) and HCC device executable code 2382 (shown by bubble labeled C5). As shown in the bubble labeled C6, host executable code 2370(2) and HCC device executable code 2382 can be executed on CPU 2390 and GPU 2392, respectively.
[0312] After converting CUDA source code 2310 to HIP source code 2330, executable code for a CUDA-enabled GPU 2394 or GPU 2392 can be generated using HIP compiler driver 2340 without re-executing the CUDA to HIP conversion tool 2320. CUDA to HIP conversion tool 2320 can convert CUDA source code 2310 to HIP source code 2330 and then store the HIP source code 2330 in memory. HIP compiler driver 2340 can then configure HCC 2360 to generate host executable code 2370(2) and HCC device executable code 2382 based on the HIP source code 2330. In at least one embodiment, HIP compiler driver 2340 subsequently configures CUDA compiler 2350 to generate host executable code 2370(1) and CUDA device executable code 2384 based on the stored HIP source code 2330.
[0313] According to at least one embodiment, the example kernel may be derived from... Figure 23 The CUDA to HIP conversion tool 2320 is used for conversion. The CUDA source code 2310 divides the overall problem that a given kernel is designed to solve into relatively coarse subproblems, which can be solved independently using thread blocks. Each thread block contains any number of threads. Each subproblem can be divided into relatively fine pieces, which can be solved collaboratively in parallel by threads within the thread block. Threads within a thread block can cooperate by sharing data via shared memory and synchronizing execution to coordinate memory access.
[0314] CUDA source code 2310 can organize thread blocks associated with a given kernel into a one-dimensional, two-dimensional, or three-dimensional thread block grid. Each thread block contains any number of threads, and the grid contains any number of thread blocks.
[0315] A kernel can be a function defined in device code using the "__global__" declaration specifier. The dimensions of the raster of kernels and their associated streams that execute a given kernel call can be specified using the CUDA kernel startup syntax. The CUDA kernel startup syntax is specified as "KernelName <<<GridSize,BlockSize,SharedMemorySize,Stream> >>
[0316] (KernelArguments);". The execution configuration syntax can include a "<<<...>>>" structure between the kernel name ("KernelName") and the parenthesized list of kernel parameters ("KernelArguments"). The CUDA kernel boot syntax can include the CUDA boot function syntax, instead of the execution configuration syntax.
[0317] "GridSize" can be of type dim3 and specifies the dimensions and size of the grid. The dim3 type can be a CUDA-defined structure containing unsigned integers x, y, and z. If z is not specified, it defaults to 1. If y is not specified, it defaults to 1. The number of thread blocks in the grid can be equal to the product of GridSize.x, GridSize.y, and GridSize.z. "BlockSize" can be of type dim3 and specifies the dimensions and size of each thread block. The number of threads per thread block can be equal to the product of BlockSize.x, BlockSize.y, and BlockSize.z. Each thread executing the kernel can be assigned a unique thread ID, which can be accessed in the kernel via built-in variables such as "threadIdx".
[0318] Regarding the CUDA kernel startup syntax, "SharedMemorySize" is an optional argument that, in addition to statically allocated memory, can specify the number of bytes in shared memory dynamically allocated per thread block for a given kernel call. The default value for SharedMemorySize is zero. Regarding the CUDA kernel startup syntax, "Stream" is also an optional argument that specifies the associated stream; the default value is zero to specify the default stream. A stream can be a sequence of commands executed sequentially (which may be emitted by different host threads). Different streams can execute commands out of order or concurrently.
[0319] CUDA source code 2310 may include the kernel definition and main function of the example kernel "MatAdd". The main function may be host code that executes on the host machine and includes a kernel call that causes the MatAdd kernel to execute on the device. The MatAdd kernel can add two NxN matrices A and B, where N is a positive integer, and store the result in matrix C. The main function can define the threadsPerBlock variable as 16x16 and the numBlocks variable as N / 16 x N / 16. Then, the main function can specify the kernel call "MatAdd<<<numBlocks,threadsPerBlock> >>(A,B,C);”. According to the CUDA kernel startup syntax, the kernel MatAdd can be executed using a thread block grid of dimensions N / 16 x N / 16, where each thread block is 16x16. Each thread block can contain 256 threads, and a grid with enough thread blocks can be created so that there is one thread per matrix element, and each thread in such a grid can execute the kernel MatAdd to perform a pairwise addition.
[0320] When converting CUDA source code 2310 to HIP source code 2330, the CUDA-to-HIP conversion tool 2320 can convert each kernel call in the CUDA source code 2310 from CUDA kernel startup syntax to HIP kernel startup syntax, and can convert any number of other CUDA calls in the source code 2310 into any number of other functionally similar HIP calls. The HIP kernel startup syntax can be specified as "hipLaunchKernelGGL(KernelName,GridSize,BlockSize,SharedMemorySize,Stream,KernelArguments);". Each of KernelName, GridSize, BlockSize, ShareMemorySize, Stream, and KernelArguments has the same meaning in the HIP kernel startup syntax as it does in the CUDA kernel startup syntax (described earlier in this document). The arguments SharedMemorySize and Stream can be required in the HIP kernel startup syntax, but optional in the CUDA kernel startup syntax.
[0321] A portion of the HIP source code 2330 can be identical to a portion of the CUDA source code 2310 shown, except for the kernel call that causes the kernel MatAdd to execute on the device. The kernel MatAdd can be defined in the HIP source code 2330 using the same "__global__" declaration specifier as used in the CUDA source code 2310. The kernel call in the HIP source code 2330 can be "hipLaunchKernelGGL(MatAdd,numBlocks,threadsPerBlock,0,0,A,B,C);", while the corresponding kernel call in the CUDA source code 2310 is "MatAdd<<<numBlocks,threadsPerBlock> >>(A,B,C);
[0322] Other implementations are conceivable, and these can be implemented similarly to the CUDA and HIP implementations described above, such as oneAPI, OpenCL, and other programming platforms. Code can be converted in any direction. For example, CUDA can be converted to HIP, and CUDA can be converted to OpenCL. SnuCL-Tr and CUCL can be used to convert OpenCL to CUDA or CUDA to OpenCL, respectively. Compiled code or intermediate representations (such as CUDA PTX code) can also be converted to run on other processor platforms (such as AMD or Intel). For example, conversion tools (such as ZLUDA) can be used to convert PTX code to run on Intel or AMD processors.
[0323] The techniques described herein can utilize the oneAPI programming model. The oneAPI programming model can refer to a programming model used to interact with various computing accelerator architectures. OneAPI can refer to an application programming interface (API) designed to interact with various computing accelerator architectures. The oneAPI programming model can use the DPC++ programming language. The DPC++ programming language can refer to a high-level language used for data-parallel programming productivity. The DPC++ programming language can be at least partially based on the C and / or C++ programming languages. The oneAPI programming model can be, but is not limited to, a programming model developed by Intel Corporation of Santa Clara, California.
[0324] OneAPI and / or the oneAPI programming model can be used to interact with a variety of accelerators, GPUs, processors, and / or their variant architectures. OneAPI may include a set of libraries that implement various functions. OneAPI may include at least the oneAPIDPC++ library, the oneAPI math kernel library, the oneAPI data analysis library, the oneAPI deep neural network library, the oneAPI collection communication library, the oneAPI thread building block library, the oneAPI video processing library, and / or their variants.
[0325] The oneAPIDPC++ library (also known as oneDPL) can be a library that implements algorithms and functions to accelerate DPC++ kernel programming. oneDPL can implement one or more Standard Template Library (STL) functions. OneDPL can implement one or more parallel STL functions. OneDPL can provide a set of library classes and functions, such as, but not limited to, parallel algorithms, iterators, function object classes, range-based APIs, and / or variations thereof. OneDPL can implement one or more classes and / or functions from the C++ Standard Library. OneDPL can implement one or more random number generator functions.
[0326] The oneAPI math kernel library (also known as oneMKL) can be a library that implements various optimized and parallelized routines for various mathematical functions and / or operations. OneMKL can implement one or more Basic Linear Algebra Subroutines (BLAS) and / or Linear Algebra Packages (LAPACK) dense linear algebra routines. OneMKL can implement one or more sparse BLAS linear algebra routines. OneMKL can implement one or more random number generators (RNGs). OneMKL can implement one or more vector math (VM) routines for performing mathematical operations on vectors. OneMKL can implement one or more Fast Fourier Transform (FFT) functions.
[0327] The OneAPI data analysis library (also known as oneDAL) can include libraries for implementing various data analysis applications and distributed computing. OneDAL can implement various algorithms for data analysis preprocessing, transformation, analysis, modeling, validation, and decision-making, including batch, online, and distributed computing processing modes. OneDAL can implement various C++ and / or Java APIs as well as various connectors for connecting to one or more data sources. OneDAL can implement the DPC++ API extension to the traditional C++ interface and support GPU usage for various algorithms.
[0328] The OneAPI Deep Neural Network Library (also known as oneDNN) can include libraries that implement various deep learning functions. OneDNN can implement various neural networks, machine learning and deep learning functions, algorithms and / or variations thereof.
[0329] The OneAPI collection communication library (also known as oneCCL) can include libraries that implement a variety of applications for deep learning and machine learning workloads. OneCCL can be built on lower-level communication middleware, such as, but not limited to, message passing interfaces (MPI) and libfabrics. OneCCL can support a set of deep learning-specific optimizations, such as, but not limited to, priority ordering, persistence operations, out-of-order execution, and / or variations thereof. OneCCL can implement a variety of CPU and GPU functionalities.
[0330] The OneAPI Thread Building Blocks library (also known as oneTBB) can include libraries that implement various parallel processes for a wide range of applications. OneTBB can be used for task-based shared parallel programming on a host machine. OneTBB can implement general-purpose parallel algorithms. OneTBB can implement concurrent containers. OneTBB can implement scalable memory allocators. OneTBB can implement work-stealing task schedulers. OneTBB can implement low-level synchronization primitives. OneTBB can be compiler-independent and can be used on a variety of processors, such as, but not limited to, GPUs, PPUs, CPUs, and / or variants thereof.
[0331] The OneAPI video processing library (also known as oneVPL) can include libraries for accelerating video processing in one or more applications. OneVPL can implement a variety of video decoding, encoding, and processing functions. OneVPL can implement various functions for media pipelines on CPUs, GPUs, and other accelerators. OneVPL can implement device discovery and selection in media-centric and video analytics workloads. OneVPL can implement API primitives for zero-copy buffer sharing.
[0332] The oneAPI programming model can use the DPC++ programming language. The DPC++ programming language can include a version of a programming language that is functionally similar to the CUDA mechanism to define device code and distinguish it from host code. The DPC++ programming language can include a subset of the functionality of the CUDA programming language. One or more CUDA programming model operations can be performed using the oneAPI programming model utilizing the DPC++ programming language.
[0333] Any application programming interface (API) described herein can be compiled by a compiler, interpreter, or other software tool into one or more instructions, operations, or other signals. Compilation may include generating one or more machine-executable instructions, operations, or other signals from source code. An API compiled into one or more instructions, operations, or other signals, when executed, may enable one or more processors (e.g., but not limited to...) Figure 8-20BThe processor described herein or any other logic circuit further described herein performs one or more computational operations.
[0334] In at least one embodiment, the conversion tools described elsewhere herein (e.g., but not limited to) may include one or more circuitry for converting CUDA code used to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel into HIP, oneAPI, OpenCL, or any other language for performing any of the operations described above or elsewhere herein. One or more circuitry may be software-configurable to convert CUDA code used to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel into HIP, oneAPI, OpenCL, or any other language for performing any of the operations described above or elsewhere herein.
[0335] Autonomous vehicles
[0336] Figure 24 An example of an autonomous vehicle 2400 according to at least one embodiment is shown. The autonomous vehicle 2400 (alternately referred to herein as "vehicle 2400") can be a passenger vehicle, such as, but not limited to, a sedan, truck, bus, and / or other type of vehicle capable of accommodating one or more passengers. In at least one embodiment, vehicle 2400 can be a semi-trailer tractor for transporting goods. Vehicle 2400 can be an aircraft, robotic vehicle, or other type of vehicle.
[0337] Autonomous vehicles can be described according to their automation levels, which are defined by the National Highway Traffic Safety Administration (NHTSA) (a division of the U.S. Department of Transportation) and the Society of Automotive Engineers (SAE) in their "Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles" (e.g., standard number J3016-201806, published June 15, 2018; standard number J3016-201609, published September 30, 2016; and previous and future versions of this standard). In at least one embodiment, vehicle 2400 can achieve functionality at one or more levels of autonomy, from Level 1 to Level 5. For example, in at least one embodiment, depending on the embodiment, vehicle 2400 can achieve conditional automation (Level 3), high automation (Level 4), and / or full automation (Level 5).
[0338] Vehicle 2400 may include components such as, but not limited to, chassis, body, wheels (e.g., 2, 4, 6, 8, 18, etc.), tires, axles, and other vehicle parts. Vehicle 2400 may include a propulsion system 2450, such as, but not limited to, an internal combustion engine, a hybrid power plant, an all-electric motor, and / or other propulsion system types. Propulsion system 2450 may be connected to the drivetrain of vehicle 2400 (which may include a transmission) to propel vehicle 2400. Propulsion system 2450 may be controlled based on signals received from throttle / accelerator 2452.
[0339] A steering system 2454 (which may include a steering wheel) is used to maneuver the vehicle 2400 (e.g., to travel along a desired path or route) while the propulsion system 2450 is in operation (e.g., when the vehicle 2400 is in motion). The steering system 2454 may receive signals from the steering actuator 2456. The steering wheel is optional for fully automated (level 5) functionality. A brake sensor system 2446 may be used to operate the vehicle brakes in response to signals received from the brake actuator 2448 and / or brake sensors.
[0340] Controller 2436 may include one or more system-on-a-chip (“SoC”) and / or graphics processing units (“GPUs”) that can provide signals (e.g., signals representing commands) to one or more components and / or systems of vehicle 2400. For example, controller 2436 may send signals to operate vehicle brakes via brake actuator 2448, steering system 2454 via steering actuator 2456, and propulsion system 2450 via throttle / accelerator 2452. Controller 2436 may include one or more onboard (e.g., integrated) computing devices for processing sensor signals and outputting operational commands (e.g., signals representing commands) to enable autonomous driving and / or assist a human driver in driving vehicle 2400. Controller 2436 may include a first controller for autonomous driving functions, a second controller for functional safety functions, a third controller for artificial intelligence functions (e.g., computer vision), a fourth controller for infotainment functions, a fifth controller for redundancy in emergency situations, and / or other controllers. A single controller may handle two or more of the above functions, two or more controllers may handle a single function, and / or any combination thereof.
[0341] The controller 2436 may provide signals for controlling one or more components and / or systems of the vehicle 2400 in response to sensor data (e.g., sensor inputs) received from one or more sensors. Sensor data can be received from, for example, a Global Navigation Satellite System (“GNSS”) sensor 2458 (e.g., a Global Positioning System sensor), a RADAR (radar) sensor 2460, an ultrasonic sensor 2462, a LIDAR (light radar) sensor 2464, an Inertial Measurement Unit (“IMU”) sensor 2466 (e.g., an accelerometer, a gyroscope, one or more magnetic compasses, a magnetometer, etc.), a microphone 2496, a stereo camera 2468, a wide-angle camera 2470 (e.g., a fisheye camera), an infrared camera 2472, a surround-view camera 2474 (e.g., a 360-degree camera), a long-range camera 2498, a medium-range camera 2476, a speed sensor 2444 (e.g., for measuring the speed of vehicle 2400), a vibration sensor 2442, a steering sensor 2440, a braking sensor (e.g., as part of a braking sensor system 2446), and / or other types of sensors.
[0342] One or more controllers 2436 may receive input (e.g., represented by input data) from the instrument panel 2432 of the vehicle 2400 and provide output (e.g., represented by output data, display data, etc.) via a human-machine interface (“HMI”) display 2434, an audible annunciator, a speaker, and / or via other components of the vehicle 2400. Output may include, but is not limited to, vehicle speed, rate, time, map data (e.g., a high-resolution map (not shown), location data (e.g., the location of the vehicle 2400, such as, but not limited to, its location on a map), direction, the location of other vehicles (e.g., occupying a grid), information about objects, and the status of objects perceived by the controllers 2436. For example, the HMI display 2434 may display information about the presence of one or more objects (e.g., street signs, warning signs, traffic light changes, etc.) and / or information about driving actions that the vehicle has performed, is performing, or will perform (e.g., changing lanes now, exiting from exit 34B in two miles, etc.).
[0343] Figure 24 Each component, feature, and system of vehicle 2400 can be connected via bus 2402. Bus 2402 may include a CAN data interface (also referred to herein as the "CAN bus"). CAN can be a network within vehicle 2400 used to assist in the control of various features and functions of vehicle 2400, such as, but not limited to, the activation of brakes, acceleration, braking, steering, windshield wipers, etc. Bus 2402 can be configured to have dozens or even hundreds of nodes, each with its own unique identifier (e.g., a CAN ID). Bus 2402 can be read to locate steering wheel angle, ground speed, engine revolutions per minute ("RPM"), button positions, and / or other vehicle status indicators. Bus 2402 can be an ASIL B compliant CAN bus.
[0344] In addition to CAN, or as an alternative to CAN, FlexRay and / or Ethernet protocols may be used. Bus 2402 may consist of any number of buses, including zero or more CAN buses, zero or more FlexRay buses, zero or more Ethernet buses, and / or zero or more other types of buses using different protocols. Two or more buses may be used to perform different functions and / or for redundancy. For example, a first bus may be used for collision avoidance functions, and a second bus may be used for actuation control. Each bus in bus 2402 may communicate with any component of vehicle 2400, and two or more buses of bus 2402 may communicate with corresponding components. Any number of System-on-Chip (“SoC”) 2404 (e.g., but not limited to SoC 2404(A) and SoC 2404(B)), each controller 2436, and / or each computer within the vehicle may access the same input data (e.g., input from sensors of vehicle 2400) and may be connected to a common bus, such as a CAN bus.
[0345] According to at least one embodiment, for Figure 24 The autonomous vehicle 2400 can place any number of cameras at any chosen camera locations and within any field of view. The cameras and corresponding fields of view are an exemplary embodiment and are not intended to be limiting. For example, additional and / or alternative cameras may be included, and / or the cameras may be located at different locations within the vehicle 2400.
[0346] The camera type may include a digital camera suitable for components and / or systems in vehicle 2400. The camera may operate under Automotive Safety Integrity Level (“ASIL”) B and / or other ASILs. According to embodiments, the camera type may support any image capture rate, such as, but not limited to, 60 frames per second (fps), 1220 fps, 240 fps, etc. The camera may be able to use a rolling shutter, a global shutter, other types of shutters, or combinations thereof. In at least one embodiment, the color filter array may include a red-to-clear-to-clear (“RCCC”) color filter array, a red-to-clear-to-blue (“RCCB”) color filter array, a red-blue-green (“RBGC”) color filter array, a Foveon X3 color filter array, a Bayer sensor (“RGGB”) color filter array, a monochrome sensor color filter array, and / or other types of color filter arrays. Transparent pixel cameras (e.g., but not limited to cameras with RCCC, RCCB, and / or RBGC color filter arrays) may be used to enhance sensitivity.
[0347] One or more cameras may be used to perform advanced driver assistance system (“ADAS”) functions (e.g., as part of a redundancy or fail-safe design). For example, a multi-function monochrome camera may be installed to provide functions including lane departure warning, traffic sign assist, and intelligent headlight control. One or more cameras (e.g., all cameras) may simultaneously record and provide image data (e.g., video).
[0348] One or more cameras can be mounted in mounting assemblies, such as, but not limited to, custom-designed (3D-printed) assemblies, to eliminate stray light and reflections inside the vehicle 2400 (e.g., dashboard reflections in the windshield mirror) that could interfere with the camera's image data capture capabilities. Regarding rearview mirror mounting assemblies, these assemblies can be custom-3D printed to match the shape of the camera mounting plate to the rearview mirror. The camera can be integrated into the rearview mirror. For side-view cameras, the cameras can also be integrated into the four pillars in each corner of the cab.
[0349] The field of view includes a portion of the environment in front of the vehicle 2400 that can be viewed by cameras (e.g., a front-facing camera) to help identify the path and obstacles ahead, and, with the assistance of one or more controllers 2436 and / or control SoCs, to provide information crucial for generating an occupancy grid and / or determining a preferred vehicle path. The front-facing camera can be used to perform many ADAS functions similar to LiDAR, including emergency braking, pedestrian detection, and collision avoidance. The front-facing camera can also be used in ADAS functions and systems, including Lane Departure Warning (“LDW”), Automatic Cruise Control (“ACC”), and / or other functions such as, but not limited to, traffic sign recognition.
[0350] Various front-mounted cameras can be used, including, for example, monocular camera platforms including CMOS (“complementary metal-oxide-semiconductor”) color imagers. Wide-angle camera 2470 can be used to perceive objects entering the field of view from the periphery (e.g., pedestrians, crossing vehicles, or bicycles). Any number (including zero) of wide-angle cameras 2470 can be mounted on vehicle 2400. Any number of remote cameras 2498 (e.g., long-view stereo camera pairs) can be used for depth-based object detection, particularly for objects on which neural networks have not yet been trained. Remote cameras 2498 can also be used for object detection and classification, as well as basic object tracking.
[0351] It may also include any number of front-mounted stereo cameras 2468. One or more stereo cameras 2468 may include an integrated control unit that includes a scalable processing unit that can provide programmable logic (“FPGA”) and a multi-core microprocessor with an integrated controller area network (“CAN”) or Ethernet interface on a single chip. Such a unit can be used to generate a 3D map of the environment of the vehicle 2400, including distance estimates for all points in the image. One or more stereo cameras 2468 may include a compact stereo vision sensor that may include two camera lenses (one on each side) and an image processing chip that measures the distance from the vehicle 2400 to a target object and uses the generated information (e.g., metadata) to activate autonomous emergency braking and lane departure warning functions. Other types of stereo cameras 2468 may be used in addition to, or as an alternative to, the stereo cameras described herein.
[0352] Cameras (e.g., side-view cameras) that include portions of the environment along the sides of vehicle 2400 can be used in the surround view to provide information for creating and updating occupancy grids and generating side collision warnings. For example, surround view cameras 2474 (e.g., four surround view cameras) can be mounted on vehicle 2400. Surround view cameras 2474 can include any number and combination of wide-angle cameras, fisheye cameras, 360-degree cameras, and / or similar cameras. For example, four fisheye cameras can be mounted at the front, rear, and sides of vehicle 2400, respectively. Vehicle 2400 can use three surround view cameras 2474 (e.g., left, right, and rear) and can utilize one or more other cameras (e.g., front-facing cameras) as a fourth surround view camera.
[0353] A field of view including a portion of the environment behind the vehicle 2400 (e.g., a rear-view camera) can be used for parking assistance, surround view, rear collision warning, and creating and updating an occupancy grid. A wide variety of cameras can be used, including but not limited to cameras that are also suitable as front-facing cameras (e.g., long-range camera 2498 and / or mid-range camera 2476, stereo camera 2468, infrared camera 2472, etc.), as described herein.
[0354] Vehicle 2400 may include any number of SoC 2404 or other processors described elsewhere herein, such as, but not limited to, those described elsewhere. Figure 8-20BThe processors and / or components shown and described. Each SoC 2404 may include a central processing unit (“CPU”) 2406, a graphics processing unit (“GPU”) 2408, a processor 2410, a cache 2412, an accelerator 2414, a data storage 2416, and / or other components and features not shown. SoC 2404 can be used to control vehicle 2400 in various platforms and systems. For example, SoC 2404 may be combined with a high-definition (“HD”) map 2422 in a system (e.g., the system of vehicle 2400), which may obtain map refreshes and / or updates from one or more servers (not shown) via network interface 2424. SoC 2404 may include logic 2415, which may include any combination of software logic, hardware logic, and / or firmware logic for providing the functions or operations described herein, wherein the logic may be collectively or separately embodied as circuitry forming part of a larger system, such as an integrated circuit (IC), a system-on-a-chip (SoC), or one or more processors (e.g., CPU, GPU).
[0355] The CPU 2406 may include a CPU cluster or CPU complex (referred to herein as “CCPLEX”). The CPU 2406 may include multiple cores and / or a secondary (“L2”) cache. For example, the CPU 2406 may include eight cores in a consistent multiprocessor configuration. The CPU 2406 may include four dual-core clusters, each with a dedicated L2 cache (e.g., 2 megabytes (MB) of L2 cache). The CPU 2406 (e.g., CCPLEX) may be configured to support simultaneous cluster operation, allowing any combination of CPU 2406 clusters to be active at any given time.
[0356] One or more of the CPU 2406 can implement power management capabilities, including one or more of the following features: individual hardware blocks can automatically clock-gated when idle to save dynamic power; clock gating can be applied to cores when each core is not actively executing instructions due to executing Wait for Interrupt (“WFI”) / Wait for Event (“WFE”) instructions; each core can independently perform power gating; each core cluster can independently clock-gated when all cores can perform clock or power gating; and / or each core cluster can independently power-gated when all cores can perform power gating. The CPU 2406 can also implement enhanced algorithms for managing power states, where allowed power states and expected wake-up times can be specified, and the hardware / microcode determines which optimal power state the core, cluster, and CCPLEX enter. Processing cores can enter a simplified power state sequence via software support, offloading work to the microcode.
[0357] The GPU 2408 may include an integrated GPU (referred to herein as an "iGPU"). The GPU 2408 may be programmable and capable of efficiently handling parallel workloads. The GPU 2408 may use an enhanced tensor instruction set. The GPU 2408 may include one or more streaming microprocessors, each of which may include a Level 1 ("L1") cache (e.g., an L1 cache with a storage capacity of at least 96KB), and two or more streaming microprocessors may share an L2 cache (e.g., an L2 cache with a storage capacity of 512KB). The GPU 2408 may include at least eight streaming microprocessors. The GPU 2408 may use a computation application programming interface (API). The GPU 2408 may use one or more parallel computing platforms and / or programming models (e.g., NVIDIA's CUDA model). A streaming microprocessor may be referred to as a streaming multiprocessor (“SM”), a streaming processor (“SP”), a streaming processing unit (“SPU”), a compute unit (“CU”), an execution unit (“EU”), and / or a slice, wherein, in this context, a slice may refer to a portion of the processing resources within a processing unit (e.g., 16 cores, a ray tracing unit, a thread bootstrap, or a scheduler).
[0358] One or more of the GPU 2408 can be power-optimized for optimal performance in automotive and embedded use cases. For example, the GPU 2408 can be fabricated on FinFET (“FinFET”) circuitry. Each streaming microprocessor can include multiple mixed-precision processing cores partitioned into blocks. For example, 64 PF32 cores and 32 FP64 cores can be partitioned into four processing blocks. Each processing block can be allocated 16 FP32 cores, 8 FP64 cores, 16 INT32 cores, two mixed-precision NVIDIA Tensor cores for deep learning matrix operations, a level-zero (“L0”) instruction cache, a scheduler (e.g., a warp scheduler) or sequencer, dispatch units, and / or a 64KB register file. Streaming microprocessors can include independent parallel integer and floating-point data paths for efficiently performing workloads involving mixed computation and addressing computation. Streaming microprocessors can include independent thread scheduling capabilities to enable finer-grained synchronization and cooperation between parallel threads. Streaming microprocessors may include a combination of L1 data cache and shared memory units to improve performance while simplifying programming.
[0359] One or more GPUs 2408 may include high-bandwidth memory (“HBM”) and / or a 16GB HBM2 memory subsystem, used in some examples to provide a peak memory bandwidth of approximately 900GB / s. In addition to HBM memory, or as an alternative to HBM memory, synchronous graphics random access memory (“SGRAM”) may be used, such as, but not limited to, graphics double data rate type 5 synchronous random access memory (“GDDR5”).
[0360] The GPU 2408 may include unified memory technology. An address translation service (“ATS”) supports page tables that allow the GPU 2408 to directly access the CPU 2406. When a memory miss occurs in one or more GPU Memory Management Units (“MMUs”) of the GPU 2408, an address translation request can be sent to the CPU 2406. In response, one or more CPUs of the CPU 2406 can look up the virtual-to-physical mapping of the address in their page tables and transfer the translation back to the GPU 2408. Unified memory technology can provide a single, unified virtual address space for the memory of both the CPU 2406 and the GPU 2408, thereby simplifying GPU 2408 programming and application porting to the GPU 2408.
[0361] The GPU 2408 may include any number of access counters that can track the frequency of the GPU 2408's access to the memory of other processors. These access counters help ensure that memory pages are moved to the physical memory of the processor that accesses those pages most frequently, thereby improving the efficiency of shared memory ranges between processors.
[0362] One or more SoCs 2404 may include any number of caches 2412, including the caches described herein. For example, cache 2412 may include a Level 3 (“L3”) cache available for use by both CPU 2406 and GPU 2408 (e.g., connected to both CPU 2406 and GPU 2408). Cache 2412 may include a write-back cache that can track the state of rows, for example, but not limited to, by using a cache coherence protocol (e.g., MEI, MESI, MSI, etc.). According to embodiments, the L3 cache may include 4 MB or more of memory, although a smaller cache size may be used.
[0363] One or more of the SoC 2404 may include one or more accelerators 2414 (e.g., hardware accelerators, software accelerators, or a combination thereof). The SoC 2404 may include a hardware acceleration cluster that may include optimized hardware accelerators and / or large on-chip memory. Large on-chip memory (e.g., 4MB SRAM) enables the hardware acceleration cluster to accelerate neural networks and other computations. The hardware acceleration cluster can be used to supplement the GPU 2408 and offload some of the tasks from the GPU 2408 (e.g., freeing up more cycles of the GPU 2408 to perform other tasks). Accelerators 2414 can be used for targeted workloads (e.g., perceptrons, convolutional neural networks (“CNN”), recurrent neural networks (“RNN”), etc.) that are sufficiently stable to be easily accelerated. CNNs may include region-based or regional convolutional neural networks (“RCNN”) and fast RCNNs (e.g., for object detection) or other types of CNNs.
[0364] Accelerator 2414 (e.g., a hardware acceleration cluster) may include one or more deep learning accelerators (“DLAs”). A DLA may include one or more tensor processing units (“TPUs”) that can be configured to provide trillions of operations per second for deep learning applications and inference, such as those described in this paper (e.g., Figure 18 The TPU described in the text is a TPU that can be configured and optimized for performing image processing functions, such as CNNs, RCNNs, etc. The DLA can be further optimized for a specific set of neural network types, floating-point operations, and inference. The DLA is designed to provide higher performance per millimeter than typical general-purpose GPUs and often significantly outperforms CPUs. The TPU can perform several functions, including single-instance...
Claims
1. A processor, comprising: One or more circuits, said one or more circuits for executing an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel.
2. The processor of claim 1, wherein the one or more circuits are configured to execute the API to generate the one or more addresses of the one or more instructions using the starting address of the memory block.
3. The processor of claim 1, wherein the one or more circuits are further configured to execute the API to launch one or more threads including the one or more instructions to access the one or more addresses.
4. The processor of claim 1, wherein the one or more instructions are included in one or more parallel threads in one or more blocks of a thread grid.
5. The processor of claim 1, wherein the one or more circuits are configured to execute the API in response to a request to initiate one or more threads from the central processing unit.
6. The processor of claim 1, wherein the software kernel is declared using a reference type parameter.
7. The processor of claim 1, wherein the one or more circuits are configured to execute the API on the graphics processing unit.
8. A method comprising: Execute the application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel.
9. The method of claim 8, wherein the one or more addresses are generated using a pointer to the beginning address of a memory block.
10. The method of claim 8, further comprising: Execute the API to start one or more threads, including the one or more instructions, to access the one or more addresses.
11. The method of claim 8, wherein the one or more instructions are included in one or more parallel threads in one or more blocks of a thread grid.
12. The method of claim 8, wherein the API is executed in response to a request to start one or more threads from the central processing unit.
13. The method of claim 8, wherein one or more addresses are storage locations on global memory on a graphics processing unit.
14. The method of claim 8, wherein the API is executed on a graphics processing unit.
15. A system comprising: One or more processors, said one or more processors being used to execute an application programming interface (API) to generate one or more addresses of one or more instructions corresponding to the same software kernel.
16. The system of claim 15, wherein the one or more addresses are generated using pointers to the beginning address of a memory block.
17. The system of claim 15, wherein the one or more processors are further configured to execute the API to launch one or more threads including the one or more instructions to access the one or more addresses in global memory on the graphics processing unit.
18. The system of claim 15, wherein the API is executed in response to a request to initiate one or more threads from the central processing unit.
19. The system of claim 15, wherein the one or more instructions reside in one or more parallel threads within the same thread block.
20. The system of claim 15, wherein the one or more processors are configured to execute the API to generate the one or more addresses using one or more of the following: the starting address of a memory block, the data type of data stored in the memory block, the stride dimension, the grid dimension, or the block dimension.