Processing to accelerate distributed matrix multiplication operations

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using a DPU to accelerate matrix multiplication operations, the performance bottleneck of CPU/GPU in performing matrix multiplication in existing technologies is solved, achieving more efficient computational communication overlap and performance improvement.

CN122240059APending Publication Date: 2026-06-19MELLANOX TECHNOLOGIES LTD(IL)

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: MELLANOX TECHNOLOGIES LTD(IL)
Filing Date: 2025-12-10
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, when performing matrix multiplication operations, the central processing unit (CPU) or graphics processing unit (GPU) needs to spend the main processing cycle on additional operations in addition to the computationally intensive operation, resulting in a decrease in overall performance and difficulty in efficiently performing computationally intensive and network-intensive operations simultaneously.

Method used

A Data Processing Unit (DPU) is used to accelerate General Matrix Multiplication (GEMM) operations. Data transfer, computation, and accumulation steps are executed concurrently in parallel through pipelines, and the accumulation operation is offloaded to the DPU. By leveraging the asynchronous nature of the accumulation operation, asynchronous prefetching and local GEMM computation are achieved.

Benefits of technology

It improves the overall performance of matrix multiplication operations, reduces synchronization overhead, frees up CPU/GPU cores for other calculations, and achieves more efficient computational communication overlap.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240059A_ABST

Patent Text Reader

Abstract

This disclosure relates to accelerating the processing of distributed matrix multiplication operations. The method proposed herein can efficiently perform operations such as General Matrix Multiplication (GEMM). The data required for such operations can be prefetched as needed, for example, immediately before a specific computation is performed on a pair of data blocks. Prefetching for subsequent computations can be performed during the current computation. After the current computation, the results can be accumulated in subsequent computations; for example, the accumulation operation can be offloaded using a data processing unit, thus freeing up one or more processing cores for other computations. Such operations can be performed in parallel until each block of the resulting matrix has been processed. In at least one embodiment, a single computational task can also be partitioned among multiple worker processes (e.g., processing units).

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to distributed processing of data, and in at least one embodiment, to using data prefetching and offloading to one or more acceleration and / or processing units (e.g., data processing units (DPUs)) to accelerate generalized matrix multiplication (GEMM) type operations. Background Technology

[0002] In various computational operations (such as those related to scientific simulations or machine learning), mathematical operations, including matrix multiplication or similar calculations, are required. Most existing methods utilize computational cores on central processing units (CPUs) or graphics processing units (GPUs) to perform not only computationally intensive operations (e.g., submatrix multiplication) but also computationally less demanding additional operations (e.g., accumulation and reduction operations). This approach requires dedicating several main processing cycles of these processing units to these additional operations, rather than using them for computationally intensive operations, which can negatively impact the overall performance of the operation. Furthermore, attempting to perform both types of operations simultaneously in a time-efficient manner is very challenging because network-intensive accumulation and / or reduction operations often fail to make any progress while a given processing unit is performing computationally intensive submatrix multiplication. Attached Figure Description

[0003] Various embodiments of this disclosure will now be described with reference to the accompanying drawings, in which:

[0004] Figure 1 An example network architecture, according to at least one embodiment, is shown that can be used to perform computational operations on behalf of a user.

[0005] Figure 2 A method for distributing matrix multiplication operations across a set of working nodes according to at least one embodiment is shown.

[0006] Figure 3A A matrix multiplication pipeline according to at least one embodiment is shown, including concurrent execution operations.

[0007] Figure 3B An example process flow for matrix multiplication with DPU offloading according to at least one embodiment is shown.

[0008] Figure 4 An example process capable of performing matrix multiplication operations according to at least one embodiment is shown.

[0009] Figure 5 An example data center system according to at least one embodiment is illustrated;

[0010] Figure 6 It is a block diagram illustrating a computer system according to at least one embodiment;

[0011] Figure 7 This is a block diagram illustrating a computer system according to at least one embodiment;

[0012] Figure 8 The illustration shows a computer system according to at least one embodiment;

[0013] Figure 9 The illustration shows a computer system according to at least one embodiment;

[0014] Figure 10 The illustration shows an exemplary integrated circuit and a related graphics processing unit according to at least one embodiment;

[0015] Figure 11A , Figure 11B The illustration shows an exemplary integrated circuit and a related graphics processing unit according to at least one embodiment;

[0016] Figure 12 The illustration shows a computer system according to at least one embodiment;

[0017] Figure 13A The illustration depicts a parallel processor according to at least one embodiment;

[0018] Figure 13B The illustration shows a partitioning unit according to at least one embodiment;

[0019] Figure 14 The illustration shows at least a portion of a graphics processing unit according to one or more embodiments. Detailed Implementation

[0020] Various embodiments are described in the following description. Specific configurations and details are set forth for illustrative purposes in order to provide a thorough understanding of the embodiments. However, it will also be apparent to those skilled in the art that the embodiments can be practiced without specific details. Furthermore, well-known features may be omitted or simplified so as not to obscure the described embodiments.

[0021] The systems and methods described herein can be used in the following (but are not limited to) non-autonomous vehicles or machines, semi-autonomous or autonomous vehicles or machines (e.g., in one or more advanced driver assistance systems (ADAS), one or more in-vehicle infotainment systems, one or more emergency vehicle detection systems), manned and unmanned robots or robotic platforms, warehouse vehicles, off-road vehicles, vehicles coupled to one or more trailers, aircraft, boats, reciprocating vehicles, emergency response vehicles, motorcycles, electric or motorized bicycles, aircraft, engineering vehicles, trains, underwater vehicles, remotely controlled vehicles such as drones, and / or other types of vehicles. Furthermore, the systems and methods described herein can be used for a wide range of purposes, including but not limited to machine control, machine motion, machine driving, synthetic data generation, generative artificial intelligence (AI), model training or updating, perception, augmented reality, virtual reality, mixed reality, robotics, security and surveillance, simulation and digital twins, autonomous or semi-autonomous machine applications, deep learning, environmental simulation, data center processing, conversational AI, optical transport simulation (e.g., ray tracing, path tracing, etc.), collaborative content creation of 3D assets, generative AI, cloud computing, and / or any other suitable applications.

[0022] The disclosed embodiments can be included in a variety of different systems, such as automotive systems (e.g., in-vehicle infotainment systems for autonomous or semi-autonomous machines, perception systems for autonomous or semi-autonomous machines), systems implemented using robots, aviation systems, medical systems, marine systems, smart area monitoring systems, systems for performing deep learning operations, systems for performing simulation operations, systems for performing digital twin operations, systems implemented using edge devices, systems containing one or more virtual machines (VMs), systems for performing synthetic data generation operations, systems implemented at least partially in a data center, systems for performing conversational AI operations, systems implementing one or more language models (e.g., large language models (LLMs)), systems for performing generative AI operations (e.g., using one or more language models), systems for performing optical transport simulations, systems for performing collaborative content creation of 3D assets, systems implemented at least partially using cloud computing resources, and / or other types of systems.

[0023] Methods according to various exemplary embodiments can utilize one or more data processing units (DPUs) or other such distributed processing units to provide acceleration for operations of the Generalized Matrix Multiplication (GEMM) type. The accelerated processing is implemented in part through a pipeline in which (prefetched) data transfer, computation, and accumulation steps can be performed concurrently and in parallel using different processing units. Partly due to the asynchronous nature of the accumulation operation, the accumulation operation for an update can be offloaded to the target DPU. In at least one embodiment, the matrix is partitioned into blocks such that a GEMM request will involve a set of operations to be performed at the block granularity. The processing pipeline can perform asynchronous prefetch (fetch) requests for data required for future matrix block computations and perform local GEMM computations on previously prefetched data, at least partially overlapping in time. If at least one computation has been performed previously, the accumulation operation can be performed concurrently and in parallel with the current computation and prefetching, wherein the accumulation operation can be performed on the corresponding DPU. This process can continue, performing prefetching, computation, and accumulation operations in parallel until the GEMM operation is completed for each block of the resulting matrix, with the data locally stored in the DPU to utilize caching. In at least some embodiments, a single GEMM computation can also be partitioned across multiple worker nodes (e.g., processing units).

[0024] Based on the teachings and suggestions herein, those skilled in the art will understand that such functionality and various variations thereof may also be used within the scope of various embodiments.

[0025] Environments such as data centers can be used to perform various computing operations on behalf of several different entities, for example, by utilizing pools of available resource capacity. Figure 1 An example of such an architecture 100 that may be used according to at least one embodiment is shown. In this example, a user is able to submit one or more requests using client device 102 to access one or more resources, or to perform a task using one or more resources, and other such options. Such requests may be submitted via at least one network 104 (e.g., the Internet or a cellular network) and received to an interface, address, or endpoint in a shared resource environment 106. The request may be received to an interface, such as an application programming interface (API) of interface layer 108, which may also include other networked devices, such as routers, network switches, load balancers, etc. In this example, a request from client device 102 may first need to be analyzed to determine whether the client device, the user, or other entity associated with the request has the right to access one or more resources to be used to process the request, and to determine whether the permitted access type allows the execution of the requested operation.

[0026] In this example, the information used for the request may be directed to Access Control Manager 112 or other such components, systems, or services. Access Control Manager 112 may be used alone or in conjunction with Account Manager 120 to perform various tasks to determine and / or manage access permissions to a set of shared resources, such as extracting relevant information from a received request and comparing the information used for the request with information in Account Repository 116 or other similar locations. Such operations may be used to determine whether the request is associated with a valid account associated with the shared resource environment, such as an account maintained by a user of the provider of Shared Resource Environment 106. Once determined, the account information may be used to determine the type of access permitted to perform one or more operations associated with the request. For example, this may include determining (or verifying) an authorized user identifier associated with the request, and then using that user identifier to determine the access permissions associated with that user identifier, which may be stored in Access Control Data Repository 118 or other similar locations. In at least one embodiment, Access Control Manager 112 may include various modules for performing specific tasks, such as authorization and authentication modules; or Access Control Manager 112 may run on a web server that also contains these modules for use by Access Control Manager 112, and other similar options.

[0027] Once a set of access permissions associated with a request is identified, the access control manager 112 (or associated process) determines whether any of those permissions are present to process the request received from the client device and associated with a user identifier. If appropriate permissions are determined to exist or are available, the access control manager 112 can direct the request information to one or more shared resources 114 (and / or potential private resources) within the shared resource environment 106. In some embodiments, the access control manager 112 may work in conjunction with the resource manager 110 to determine a specific instance of a resource to be used to perform the request-related operation, wherein the resource manager 110 may perform other types of operations as needed, such as allocating additional capacity to a resource, starting a new computing instance, or performing other such tasks related to the request.

[0028] In many cases, a request involving several pending operations may cause these operations, or portions of those operations, to be distributed across a set of processing resources. This could include distribution across several physical computing resources, such as a set of shared servers, and / or could include multiple processing resources (physical or virtual) within a given physical resource. As an example, Figure 1Example components that may be included in a given server 122 are illustrated, which may be allocated to perform one or more processing tasks related to a request. The server 122 in this example is illustrated to contain different types of processing units, including one or more central processing units (CPUs) 124A-N, one or more graphics processing units (GPUs) 126A-N, and one or more data processing units (DPUs) 128A-N, which are interconnected via at least one internal bus 134. In this example, at least the DPUs 128A-N may be connected to local storage 130 within the server, and at least one remote storage instance 132, which may be located outside the server but within a shared resource environment 106. The CPUs 124A-N are typically used for tasks such as single-threaded user applications, while the GPUs 126A-N are typically used to execute multiple small but related operations in parallel. The DPUs can be used to offload processing tasks from these CPUs and GPUs, tasks that may not be optimally executed on these processing units, for example, those related to heterogeneous data center processing tasks that could benefit from different types of accelerated processing. A Data Processing Unit (DPU) is a programmable processor or system-on-a-chip that combines one or more multi-core high-performance CPUs, a set of acceleration engines capable of offloading and boosting the performance of data-centric tasks, and one or more high-performance interfaces capable of parsing, processing, and efficiently transferring data at high speeds. A DPU can be used as a standalone embedded processor, integrated into a server's SmartNIC, or other similar options. Offloading appropriate tasks to such a DPU can help improve performance and reduce power consumption, among many other advantages. Processing units like DPUs are particularly useful for data-centric tasks such as those related to artificial intelligence and machine learning. This can include performing various matrix multiplication tasks, and other such operations.

[0029] Methods according to various exemplary embodiments can utilize one or more data processing units (DPUs) or other such distributed processing units to provide acceleration of generalized matrix multiplication (GEMM) operations. As previously described, accelerated processing can be achieved in part through the use of a pipeline in which (prefetched) data transfer, computation, and accumulation steps can be performed in parallel and concurrently using different processing units. Accumulation operations for updates can be offloaded to the DPU, and matrices can be partitioned into blocks, thereby allowing GEMM (and other such) operations to be performed at a block granularity.

[0030] like Figure 2As shown, in an example GEMM operation, two multidimensional matrices (or tensors) 202 and 204 are multiplied together. Multiplying two tensors together produces a tensor output (or output matrix) 206, the dimension of which depends on the dimensions of the matrices being multiplied. To attempt performance optimization, the individual operations can be distributed across multiple processors because the unit or block values of the output tensor can be computed individually and independently of the computation of other unit values of that output tensor. Several techniques are available for optimizing GEMM operations on distributed processors. These techniques typically involve computing partial data for a portion of the GEMM operation and then performing an accumulation on each point or unit, often involving local matrix multiplication and distributed summation operations to obtain the result. In at least one embodiment, an accumulation or reduction operation can be used to perform the summation. The target process can be polled to determine the progress of the GEMM operation. If DPU offloading is used for such operations, progress determination and target computation can be offloaded to the DPU, freeing up the target core to perform other computations or operations. In at least one embodiment, all local operations can be performed on the target core, while the accumulation and progress operations are offloaded to the DPU.

[0031] As mentioned earlier, General Matrix Multiplication (GEMM) is one of the fundamental linear algebra operations, in which two matrices are multiplied using the following formula:

[0032] C = αA x B + βC,

[0033] Where A, B, and C can be multidimensional dense tensors, and α and β are scalar inputs. Tensor A 202, tensor B 204, and tensor C 206 can be divided into several blocks, which can be evenly distributed across the processes. For example, this could include distribution across one or more worker processes 208, 210, such as... Figure 2 The configuration is shown in 200. Each participating process can obtain blocks A and B from other processes or itself, which may be partly based on locality. Each process can also perform local block-wise computations (e.g., GEMM computations) and generate partial results. Since the C tensor is also distributed, partial results can be accumulated into the corresponding target process.

[0034] Operations such as GEMM can be used in a wide range of applications, from large-scale scientific simulations to big data analytics and deep neural networks. In such distributed settings, various methods can be used to efficiently perform GEMM operations, including MapReduce-based GEMM and Distributed BlockCyclic Decomposition. Summation or accumulation is one of the fundamental steps in many GEMM-based methods because it is necessary to merge partial results computed in different processes into the final output tensor C 206. Furthermore, the output tensor C 206 is typically distributed across multiple processes, thus requiring reduction or accumulation operations, such as MPI_Reduce or MPI_Accumulate, which can be used in message-passing interface (MPI)-based implementations involving a hybrid of put and update semantics. Compared to MPI_Reduce, MPI_Accumulate can often be expressed in a more fine-grained, non-blocking manner. Most existing methods use computing cores on the CPU or GPU to perform the accumulation operation. Furthermore, communication progress mechanisms waste additional cycles, and the unavailability of CPU / GPU cores for performing the summation operation can lead to latency due to synchronization issues.

[0035] According to at least one embodiment, the method can free up the core used for communication and accumulation operations, thereby allowing it to be used for localized GEMM computations. As previously mentioned, in at least one such method, GEMM-type operations can be partially executed using a DPU (or other programmable processor) capable of handling data-intensive tasks, specifically accelerating and optimizing data-centric operations. One approach that can be used to perform block-level GEMM computations in a distributed deployment is to use patterns, such as the Get-Compute-Update pattern, which will be discussed in detail below.

[0036] At least the update step of the GEMM operation can be optimized. In at least one embodiment, optimization can be achieved by offloading the accumulation operation of the update to a Data Processing Unit (DPU) (e.g., a Bluefield Data Processing Unit (DPU) from NVIDIA). This approach can leverage the asynchronous nature of the accumulation operation. In at least one embodiment, the accumulation operation can be configured as a Remote Direct Memory Operation (RDMO), which leverages DPU offloading to achieve finer-grained computational communication overlap. RDMO can be viewed as a restricted offloading of Active Message (AM) callbacks to a communication device (e.g., the corresponding DPU). Offloading the accumulation operation can free up host resources, at least in part, by transferring progress control to the DPU core, while allowing the offloaded operation to proceed independently and quickly, thereby reducing synchronization overhead. During the example accumulation operation, the source process can send an RDMO request to a worker process corresponding to the target process. In at least one embodiment, this could be the DOCA Unified Resource and Unload Manager (UROM) RDMO worker process of NVIDIA's DOCA software framework. This worker process can perform the corresponding accumulation operation. Partly depending on the message size, the source data can be sent with the RDMO request or retrieved by the RDMO worker process at the target location, for example, using a one-sided get request. The target worker process can leverage the temporal locality of operation by caching intermediate results in the DPU's memory. To achieve the non-blocking nature of the accumulation operation, multiple buffers can be used at the source to delay local flush operations. This helps ensure that the source buffer is reusable and that the appropriate callback function is called to add the source data to the target data. The target data can be retrieved and cached in the DPU during the initial call until a flush-all call is invoked (or a similar event). For example, a flush-all call can be invoked once at the end of the GEMM loop, which helps ensure that the DPU cache is synchronized with the target memory. After this global synchronization is complete, the updated result of the entire GEMM operation can be used to modify the elements in C tensor 206.

[0037] exist Figure 2 In the example, matrices A 202 and B 204 (and the output matrix C 206) are all 3x3 matrices. In this block-based approach, each element is not a single element, but rather a tile of size i,j. In this example, shaded tiles are managed by worker process A 208, and unshaded tiles are managed by worker process B 210. Work can then be distributed using some logic (i.e., cyclical distribution of tiles). For tile C in output matrix 206... 11We need to multiply each tile in the top row of matrix A202 with the tiles in the first column of matrix B204, and then sum the results to the final tile C. 11 Two assigned worker processes can generate these intermediate results, which are then summed into the output matrix values.

[0038] Figure 3A An exemplary prefetch-based pipeline 300 is illustrated, which can be used according to at least one embodiment. Such a pipeline 300 consists of non-blocking get operations, local GEMM computation operations, and non-blocking update operations, thereby enabling overlapping computational communication. As shown in this example, the pipeline can involve performing these operations at least partially in parallel using prefetched data, wherein these operations can be performed within (or by) a given worker process 302. In such a process, data required for the next computation can be prefetched during the current computation, thus distributing data retrieval throughout the computation sequence and ensuring it is always available for the next computation when needed. This approach can also perform an accumulation operation on the results of previous computations while the current computation is being performed. As previously described, in at least one embodiment, the get operations, computation operations, and accumulation operations can be performed by different processing units. This approach allows communication progress and accumulation operations on the DPU to overlap with block-level GEMM computations performed on the processor (CPU / GPU) core. This operation can significantly reduce overall runtime compared to implementations that do not utilize such DPU offloading. After each calculation, a synchronization point 308 can be established to ensure that the prefetched data is available for the next calculation.

[0039] In at least one embodiment, a "fire and forget" approach can be employed, whereby after a local GEMM computation 306, a partial result is used to trigger an accumulation operation 304 without requiring an immediate target memory update for the block C tensor. A memory caching mechanism can be used to copy the target memory segment of the C matrix onto the DPU. Accumulation operations for specific C blocks can be executed independently on the DPU. Simultaneously, the next block-level local GEMM computation can be performed on the CPU / GPU computing core, resulting in better computational communication overlap. Furthermore, with this approach, local block-level GEMM operations can utilize more cores without being blocked by accumulation operations, as the communication progress engine and reduction / accumulation operations are offloaded to the DPU. A caching mechanism on the target DPU can be used to cache data during these operations, and a global refresh operation can be performed in a single synchronous operation after aggregation.

[0040] Figure 3B An example process flow diagram 350 is shown, which can be generated by a prefetch matrix multiplication pipeline (e.g., Figure 3AThe pipeline shown is executed. Such a process may include at least one target process and several (e.g., 3) source processes, as well as at least one target DPU that can be used to offload the accumulation operation. A pair of get requests can be executed for each source process at appropriate times, wherein each get request for prefetching data may be followed by a corresponding GEMM computation, and then an accumulation operation is performed using the result of the computation performed on the data retrieved from the pair of requests. As previously described, the computation result can be sent to the target DPU so that the accumulation operation can be performed concurrently with other GEMM computations to be performed by the source processes. In at least one embodiment, prefetching may prefetch two blocks at a time, performed shortly before the next computation requires data from these blocks. After synchronization is complete, prefetching of the next two data blocks can be performed (and accumulation relative to the previous computation, if available). Such a process can hide the get operation of the next block and the accumulation operation of the previous block. Buffers or caches can be reused. If a partial flush is performed, data is transferred from the source to the target DPU, at which point the buffer (X1 or X2) is available for reuse. Figure 3A In the example, two buffers, X1 or X2, are used, and the pipeline can alternate between using these two buffers. Furthermore, synchronization can be performed between the target DPU and the target host at the end of the process. A flush operation can then be performed on all buffers and / or caches used for that process.

[0041] In at least one embodiment, the tile size used in GEMM operations can be an adjustable parameter. For example, the tile size can be chosen to balance throughput and storage, and to ensure that the process can hide transport and accumulation costs by overlapping it with per-tile GEMM operations. Then, it may be necessary to adjust the appropriate value separately for different types of computations or operations. For similar considerations, the number of processing nodes used can also be an adjustable parameter. As an example, the optimal block size and number of nodes for scientific simulations may differ from the optimal block size and number of nodes for performing matrix operations on linear or prediction layers used in training transformers or other deep learning models. Other optimization methods can also be used for different computations, such as modifying the number of prefetch operations performed and fetching (and caching) more data for each such operation. The amount or type of overlap may also differ. This pipeline allows computational operations to overlap with accumulation operations, providing performance improvements because the operations do not need to be performed sequentially. However, in at least one embodiment, if the computation results can be cached or buffered as needed, it may not be necessary to perform accumulation after each computation.

[0042] Figure 4An example process 400 according to at least one embodiment is illustrated, which can be executed to perform an operation such as matrix multiplication using one or more offloading operations. It should be understood that, for the process discussed herein and other processes, within the scope of various embodiments, there may be additional steps, fewer steps, or alternative steps performed in a similar or alternative order, or at least these steps performed in parallel. Furthermore, although matrix multiplication is discussed herein as an example, it should be understood that, within the scope of various embodiments, the advantages of this process also apply to other types of operations or computations. In this example process, a 402 request may be received on behalf of (or otherwise associated with) a user or other entity with certain access rights to resources in a shared resource environment. The request in this example relates to performing a general matrix multiplication (or similar) operation. As part of the operation, an input matrix may be identified and then divided 404 or segmented into blocks with defined dimensions, thus producing a block-based input matrix with fewer blocks than the number of elements in the input matrix. For example, each block may contain 9 elements (3x3 elements) of the input matrix. As part of matrix multiplication, blocks of the first matrix need to be multiplied by blocks of the second matrix. To do this, each such pair of blocks can be directed to a worker process (e.g., a computation instance) for a local GEMM operation, accumulating the results into the corresponding output block of the output (or result) matrix. Then, a set of computations and accumulation operations can be performed on each block of the output matrix. Accordingly, one or more worker processes can be assigned to the current block of the output matrix, where the same or different worker processes can be assigned to different blocks of the output matrix, especially when at least some computations for different output blocks need to be performed concurrently or at least partially overlap in time.

[0043] For each block of the output matrix, one or more worker processes can be assigned to handle the computation of each block. Data to be used for the current computation to be performed on the current block can be prefetched by the corresponding worker process 408, or prefetched for the corresponding worker process 408. Depending on the stage of the computation and accumulation process, one or more steps may be performed after this prefetch. As an example, once the data is available, the current computation using the prefetched data can be performed 410. If subsequent computations are to be performed on the current block, additional prefetching of data 411 can be performed while the current computation is being performed, where this data will be used for subsequent computations. Furthermore, if the results of previous computations exist, the previous block can be accumulated using at least one offloaded DPU 412, where the accumulation can be performed concurrently with the current computation. This process can continue, with prefetching, computation, and accumulation operations performed in parallel and at least partially overlapping in time, and the accumulation operation offloaded to a target DPU. In at least one embodiment, the prefetching operation can be offloaded, and the method can rely on, for example, an RDMA get (GET) operation. A determination can be made at 414 regarding whether further operations should be performed on the current block, for example, in the case of performing at least one additional computation or accumulation. If so, the process can continue to perform the next prefetch, computation, and / or accumulation operation on the current block. As previously mentioned, synchronization points can be used between these parallel operations. If the execution on the current block has been completed, for example, the final accumulation has been performed, the accumulation result can be written to the corresponding block of the output matrix at 416, and a refresh can be performed as appropriate. In some embodiments, if the entire output matrix can be placed in DPU memory or other similar location, the writing of the accumulation result can be omitted (as shown in step 416). A determination can be made at 418 regarding whether there are additional blocks in the output matrix for which computations should be performed. As previously mentioned, in some embodiments, computations can be performed on different blocks of the output matrix sequentially, in parallel, or in a combination of both. If there is at least one block for which computations should be performed, the process can continue to process the next block. If the results for all blocks of the output matrix have been obtained, the final output matrix can be provided at 420 in response to the initially received request. The results can then be used as needed, for example, to help train a transformer network.

[0044] In at least some examples, computing and / or electronic devices that can request or obtain access to various resources to perform GEMM-type operations can include a variety of different devices, such as desktop computers, laptops, set-top boxes, streaming media devices, game consoles, smartphones, tablets, VR headsets, AR glasses, wearable computers, or smart TVs. In one embodiment, such a system can be used to perform graphics rendering operations. In other embodiments, such a system can be used for other purposes, such as providing image or video content to test or validate autonomous machine applications, or performing deep learning operations. In one embodiment, such a system can be implemented using edge devices or may include one or more virtual machines (VMs). In one embodiment, such a system can be implemented at least partially in a data center or at least partially using cloud computing resources.

[0045] Data Center

[0046] Figure 5 An exemplary data center 500 in which at least one embodiment can be used is shown. In at least one embodiment, the data center 500 includes a data center infrastructure layer 510, a framework layer 520, a software layer 530, and an application layer 540.

[0047] In at least one embodiment, such as Figure 5 As shown, the data center infrastructure layer 510 may include a resource coordinator 512, grouped computing resources 514, and node computing resources (“nodes CR”) 516(1)-516(N), where “N” represents a positive integer (which may be an integer “N” different from the integers used in other diagrams). In at least one embodiment, nodes CR 516(1)-516(N) may include, but are not limited to, any number of central processing units (“CPUs”) or other processors (including accelerators, field-programmable gate arrays (FPGAs), graphics processors, etc.), memory storage devices 518(1)-518(N) (e.g., dynamic read-only memory, solid-state storage, or disk drives), network input / output (“NW I / O”) devices, network switches, virtual machines (“VMs”), power modules, and cooling modules, etc. In at least one embodiment, one or more nodes CR 516(1)-516(N) may be servers having one or more of the aforementioned computing resources.

[0048] In at least one embodiment, the grouped computing resources 514 may include individual groups of node CRs housed within one or more racks (not shown), or a plurality of racks housed within data centers (also not shown) in various geographical locations. In at least one embodiment, the individual groups of node CRs within the grouped computing resources 514 may include computing, networking, memory, or storage resources that can be configured or allocated to support groups of one or more workloads. In at least one embodiment, several node CRs, including CPUs or processors, may be grouped within one or more racks to provide computing resources to support one or more workloads. In at least one embodiment, one or more racks may also include any number of power modules, cooling modules, and network switches in any combination.

[0049] In at least one embodiment, resource coordinator 512 may configure or otherwise control one or more nodes CR516(1)-516(N) and / or grouped computing resources 514. In at least one embodiment, resource coordinator 512 may include a Software Design Infrastructure (“SDI”) management entity for data center 500. In at least one embodiment, resource coordinator 512 may include hardware, software, or some combination thereof.

[0050] In at least one embodiment, such as Figure 5 As shown, framework layer 520 includes a job scheduler 522, a configuration manager 524, a resource manager 526, and a distributed file system 528. In at least one embodiment, framework layer 520 may include a framework of software 532 supporting software layer 530 and / or one or more applications 542 supporting application layer 540. In at least one embodiment, software 532 or application 542 may respectively include web-based service software or applications, such as service software or applications provided by Amazon Web Services, Google Cloud, and Microsoft Azure. In at least one embodiment, framework layer 520 may be, but is not limited to, a type of free and open-source software web application framework, such as Apache Spark, which can utilize distributed file system 528 for large-scale data processing (e.g., "big data"). TM(Hereinafter referred to as "Spark"). In at least one embodiment, the job scheduler 522 may include a Spark driver for facilitating the scheduling of workloads supported by various layers of the data center 500. In at least one embodiment, the configuration manager 524 may be able to configure different layers, such as the software layer 530 and the framework layer 520, which includes Spark and a distributed file system 528 for supporting large-scale data processing. In at least one embodiment, the resource manager 526 may be able to manage clustered or grouped computing resources mapped to or allocated to support the distributed file system 528 and the job scheduler 522. In at least one embodiment, the clustered or grouped computing resources may include grouped computing resources 514 at the data center infrastructure layer 510. In at least one embodiment, the resource manager 526 may coordinate with the resource coordinator 512 to manage these mapped or allocated computing resources.

[0051] In at least one embodiment, the software 532 included in the software layer 530 may include software used by at least portions of nodes CR516(1)-516(N), grouped computing resources 514, and / or the distributed file system 528 of the framework layer 520. In at least one embodiment, one or more types of software may include, but are not limited to, Internet web page search software, email virus scanning software, database software, and streaming video content software.

[0052] In at least one embodiment, one or more applications 542 included in application layer 540 may include one or more types of applications used by at least portions of nodes CR516(1)-516(N), grouped computing resources 514, and / or the distributed file system 528 of framework layer 520. In at least one embodiment, one or more types of applications may include, but are not limited to, any number of genomics applications, cognitive computing, applications, and machine learning applications, including training or inference software, machine learning framework software (e.g., PyTorch, TensorFlow, Caffe, etc.), or other machine learning applications used in conjunction with one or more embodiments.

[0053] In at least one embodiment, any of the configuration manager 524, resource manager 526, and resource coordinator 512 can perform any number and type of self-modification actions based on any amount and type of data acquired in any technically feasible manner. In at least one embodiment, self-modification actions can mitigate potentially poor configuration decisions by data center operators of data center 500 and can prevent underutilization and / or poor performance of the data center.

[0054] In at least one embodiment, data center 500 may include tools, services, software, or other resources for training one or more machine learning models or using one or more machine learning models to predict or infer information according to one or more embodiments described herein. For example, in at least one embodiment, a machine learning model can be trained by calculating weight parameters based on a neural network architecture using the software and computing resources described above with respect to data center 500. In at least one embodiment, information can be inferred or predicted using trained machine learning models corresponding to one or more neural networks using the resources described above with respect to data center 500 by using weight parameters calculated through one or more training techniques described herein.

[0055] In at least one embodiment, the data center may use a CPU, application-specific integrated circuit (ASIC), GPU, FPGA, or other hardware to utilize the aforementioned resources to perform training and / or inference. Furthermore, one or more of the aforementioned software and / or hardware resources may be configured as a service to allow a user to train or perform information inference, such as image recognition, speech recognition, or other artificial intelligence services.

[0056] Inference and / or training logic 515 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, inference and / or training logic 515 may be... Figure 5 It is used in systems for inference or prediction operations based at least in part on weight parameters calculated using the neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.

[0057] The embodiments proposed in this paper can prefetch data for each block computation used in GEMM operations and can offload the accumulation to at least one target DPU, thereby allowing concurrent and efficient execution of fetching, computation, and accumulation.

[0058] Computer System

[0059] Figure 6This is a block diagram illustrating an exemplary computer system according to at least one embodiment. The exemplary computer system may be a system of interconnected devices and components, a system-on-a-chip (SoC), or some combination thereof formed with a processor, which may include an execution unit for executing instructions. In at least one embodiment, according to this disclosure, such as in the embodiments described herein, computer system 600 may include, but is not limited to, components such as processor 602 for employing execution units (including logic) to execute algorithms for process data. In at least one embodiment, computer system 600 may include a processor, such as those available from Intel Corporation of Santa Clara, California. Processor family, Xeon TM , Scale TM and / or StrongARM TM , Core TM or Nirvana TM The microprocessor can be used, although other systems (including PCs, engineering workstations, set-top boxes, etc.) with other microprocessors can also be used. In at least one embodiment, the computer system 600 can execute a version of the Windows operating system available from Microsoft Corporation of Redmond, Washington, although other operating systems (such as UNIX and Linux), embedded software, and / or graphical user interfaces can also be used.

[0060] The embodiments can be used in other devices, such as handheld devices and embedded applications. Some examples of handheld devices include cellular phones, Internet Protocol (IP) devices, digital cameras, personal digital assistants (“PDAs”), and handheld PCs. In at least one embodiment, the embedded application may include a microcontroller, a digital signal processor (“DSP”), a system-on-a-chip (SoC), a network computer (“Necks”), a set-top box, a network hub, a wide area network (“WAN”) switch, or any other system capable of executing one or more instructions according to at least one embodiment.

[0061] In at least one embodiment, the computer system 600 may include, but is not limited to, a processor 602, which may include, but is not limited to, one or more execution units 608 for performing machine learning model training and / or inference according to the techniques described herein. In at least one embodiment, the computer system 600 is a single-processor desktop or server system, but in another embodiment, the computer system 600 may be a multiprocessor system. In at least one embodiment, the processor 602 may include, but is not limited to, for example, a Complex Instruction Set Computer (“CISC”) microprocessor, a Reduced Instruction Set Computing (“RISC”) microprocessor, a Very Long Instruction Word (“VLIW”) microprocessor, a processor implementing instruction set combination, or any other processor device, such as a digital signal processor. In at least one embodiment, the processor 602 may be coupled to a processor bus 610, which allows data signaling between the processor 602 and other components in the computer system 600.

[0062] In at least one embodiment, processor 602 may include, but is not limited to, a Level 1 (“L1”) internal cache memory (“cache”) 604. In at least one embodiment, processor 602 may have a single internal cache or multiple levels of internal caches. In at least one embodiment, the cache memory may reside externally to processor 602. Depending on specific implementation and requirements, other embodiments may also include a combination of internal and external caches. In at least one embodiment, register file 606 may store different types of data in various registers, including but not limited to integer registers, floating-point registers, status registers, and instruction pointer registers.

[0063] In at least one embodiment, an execution unit 608, including but not limited to logic for performing integer and floating-point operations, is also located in the processor 602. In at least one embodiment, the processor 602 may also include a microcode (“ucode”) read-only memory (“ROM”) storing the microcode of certain macro instructions. In at least one embodiment, the execution unit 608 may include logic for processing a packaged instruction set 609. In at least one embodiment, by including the packaged instruction set 609 in the instruction set of the general-purpose processor and the associated circuitry to be executed, operations used by numerous multimedia applications can be performed using packaged data in the processor 602. In at least one embodiment, numerous multimedia applications can be accelerated and executed more efficiently by performing operations on packaged data using the full width of the processor's data bus, eliminating the need to transfer smaller data units on the processor's data bus to perform one or more operations on one data element at a time.

[0064] In at least one embodiment, execution unit 608 may also be used in a microcontroller, embedded processor, graphics device, DSP, and other types of logic circuitry. In at least one embodiment, computer system 600 may include, but is not limited to, memory 620. In at least one embodiment, memory 620 may be a dynamic random access memory (“DRAM”) device, a static random access memory (“SRAM”) device, a flash memory device, or other memory device. In at least one embodiment, memory 620 may store one or more instructions 619 and / or data 621 represented by data signals executable by processor 602.

[0065] In at least one embodiment, the system logic chip may be coupled to the processor bus 610 and the memory 620. In at least one embodiment, the system logic chip may include, but is not limited to, a memory controller hub (“MCH”) 616, and the processor 602 may communicate with the MCH 616 via the processor bus 610. In at least one embodiment, the MCH 616 may provide a high-bandwidth memory path 618 to the memory 620 for instruction and data storage, as well as for storage of graphics commands, data, and textures. In at least one embodiment, the MCH 616 may direct data signals between the processor 602, the memory 620, and other components in the computer system 600, and bridge data signals between the processor bus 610, the memory 620, and the system I / O interface 622. In at least one embodiment, the system logic chip may provide a graphics port for coupling to a graphics controller. In at least one embodiment, the MCH 616 may be coupled to the memory 620 via the high-bandwidth memory path 618, and the graphics / video card 612 may be coupled to the MCH 616 via an Accelerated Graphics Port (“AGP”) interconnect 614.

[0066] In at least one embodiment, the computer system 600 may use the system I / O interface 622 as a proprietary hub interface bus to couple the MCH 616 to the I / O controller hub (“ICH”) 630. In at least one embodiment, the ICH 630 may provide direct connectivity to certain I / O devices via a local I / O bus. In at least one embodiment, the local I / O bus may include, but is not limited to, a high-speed I / O bus for connecting peripheral devices to the memory 620, chipset, and processor 602. Examples may include, but are not limited to, an audio controller 629, a firmware hub (“Flash BIOS”) 628, a wireless transceiver 626, a data storage 624, a conventional I / O controller 623 including a user input and keyboard interface 625, a serial expansion port 627 (such as a Universal Serial Bus (“USB”) port), and a network controller 634. In at least one embodiment, the data storage 624 may include a hard disk drive, a floppy disk drive, a CD-ROM device, a flash memory device, or other mass storage device.

[0067] In at least one embodiment, Figure 6 A system including interconnected hardware devices or "chips" is shown, while in other embodiments, Figure 6 An exemplary SoC can be shown. In at least one embodiment, Figure 6 The devices shown can be interconnected using proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, one or more components of the computer system 600 are interconnected using a Compute Fast Link (CXL) interconnect.

[0068] Inference and / or training logic 515 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, inference and / or training logic 515 may be... Figure 6 Used in systems for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases as described herein.

[0069] The embodiments proposed in this paper can prefetch data for each block computation used in GEMM operations and can offload the accumulation to at least one target DPU, thereby allowing concurrent and efficient execution of fetching, computation, and accumulation.

[0070] Figure 7This is a block diagram illustrating an electronic device 700 for utilizing a processor 710 according to at least one embodiment. In at least one embodiment, the electronic device 700 may be, for example, but not limited to, a laptop computer, tower server, rack server, blade server, laptop computer, desktop computer, tablet computer, mobile device, telephone, embedded computer, or any other suitable electronic device.

[0071] In at least one embodiment, the electronic device 700 may include, but is not limited to, a processor 710 communicatively coupled to any suitable number or type of components, peripherals, modules, or devices. In at least one embodiment, the processor 710 is coupled using a bus or interface, such as I... 2 C-bus, System Management Bus (“Sambas”), Low Pin Count (LPC) bus, Serial Peripheral Interface (“SPI”), High Definition Audio (“HDA”) bus, Serial Advanced Technology Accessory (“SATA”) bus, Universal Serial Bus (“USB”) (versions 1, 2, 3, etc.) or Universal Asynchronous Receiver / Transmitter (“UART”) bus. In at least one embodiment, Figure 7 The system shown includes interconnected hardware devices or "chips," while in other embodiments, Figure 7 An exemplary SoC can be shown. In at least one embodiment, Figure 7 The devices shown can be interconnected using proprietary interconnects, standardized interconnects (e.g., PCIe), or some combination thereof. In at least one embodiment, Figure 7 One or more components are interconnected using Computational Fast Link (CXL) interconnects.

[0072] In at least one embodiment, Figure 7 This may include a display 724, a touchscreen 725, a touchpad 730, a near-field communication unit (“NFC”) 745, a sensor hub 740, a thermal sensor 746, a fast chipset (“EC”) 735, a trusted platform module (“TPM”) 738, a BIOS / firmware / flash memory (“BIOS, FW Flash”) 722, a DSP 760, a drive 720 (such as a solid-state drive (“SSD”) or a hard disk drive (“HDD”)), a wireless local area network unit (“WLAN”) 750, a Bluetooth unit 752, a wireless wide area network unit (“WWAN”) 756, a global positioning system (GPS) unit 755, a camera (“USB 3.0 camera”) 754 (such as a USB 3.0 camera), and / or a low-power double data rate (“LPDDR”) memory unit (“LPDDR3”) 715 implemented in, for example, the LPDDR3 standard. These components may each be implemented in any suitable manner.

[0073] In at least one embodiment, other components may be communicatively coupled to processor 710 via the components described herein. In at least one embodiment, accelerometer 741, ambient light sensor (“ALS”) 742, compass 743, and gyroscope 744 may be communicatively coupled to sensor hub 740. In at least one embodiment, thermal sensor 739, fan 737, keyboard 736, and touchpad 730 may be communicatively coupled to EC 735. In at least one embodiment, speaker 763, earphone 764, and microphone (“mic”) 765 may be communicatively coupled to audio unit (“audio codec and Class D amplifier”) 762, which in turn may be communicatively coupled to DSP 760. In at least one embodiment, audio unit 762 may include, for example, but not limited to, audio encoder / decoder (“codec”) and Class D amplifier. In at least one embodiment, SIM card (“SIM”) 757 may be communicatively coupled to WWAN unit 756. In at least one embodiment, components such as WLAN unit 750, Bluetooth unit 752, and WWAN unit 756 may be implemented as next-generation form factors (“NGFF”).

[0074] Inference and / or training logic 515 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, inference and / or training logic 515 may be... Figure 7 Used in systems for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases as described herein.

[0075] The embodiments proposed in this paper can prefetch data for each block computation used in GEMM operations and can offload the accumulation to at least one target DPU, thereby allowing concurrent and efficient execution of fetching, computation, and accumulation.

[0076] Figure 8 A computer system 8100 according to at least one embodiment is shown. In at least one embodiment, the computer system 8100 is configured to implement the various processes and methods described throughout this disclosure.

[0077] In at least one embodiment, the computer system 800 includes, but is not limited to, at least one central processing unit (“CPU”) 802 connected to a communication bus 810 implemented using any suitable protocol, such as PCI (“Peripheral Component Interconnect”), Peripheral Component Interconnect Express (“PCI-Express”), AGP (“Accelerated Graphics Port”), HyperTransport, or any other bus or point-to-point communication protocol. In at least one embodiment, the computer system 800 includes, but is not limited to, main memory 804 and control logic (e.g., implemented in hardware, software, or a combination thereof), and data is stored in the main memory 804, which may take the form of random access memory (“RAM”). In at least one embodiment, a network interface subsystem (“network interface”) 822 provides an interface to other computing devices and networks for receiving data from and sending data to other systems using the computer system 800.

[0078] In at least one embodiment, the computer system 800 includes, but is not limited to, an input device 808, a parallel processing system 812, and a display device 806, which may be implemented using conventional cathode ray tube (“CRT”), liquid crystal display (“LCD”), light-emitting diode (“LED”) display, plasma display, or other suitable display technologies. In at least one embodiment, user input is received from the input device 808 (such as a keyboard, mouse, touchpad, microphone, etc.). In at least one embodiment, each module described herein may reside on a single semiconductor platform to form the processing system.

[0079] Inference and / or training logic 515 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, inference and / or training logic 515 may be... Figure 8 The system is used to perform inference or prediction operations based at least in part on weight parameters calculated using the neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.

[0080] The embodiments proposed in this paper can prefetch data for each block computation used in GEMM operations and can offload the accumulation to at least one target DPU, thereby allowing concurrent and efficient execution of fetching, computation, and accumulation.

[0081] Figure 9A computer system 900 according to at least one embodiment is illustrated. In at least one embodiment, the computer system 900 includes, but is not limited to, a computer 910 and a USB flash drive 920. In at least one embodiment, the computer 910 may include, but is not limited to, any number and type of processors (not shown) and memory (not shown). In at least one embodiment, the computer 910 includes, but is not limited to, a server, a cloud instance, a laptop computer, and a desktop computer.

[0082] In at least one embodiment, the USB flash drive 920 includes, but is not limited to, a processing unit 930, a USB interface 940, and USB interface logic 950. In at least one embodiment, the processing unit 930 can be any instruction execution system, device, or apparatus capable of executing instructions. In at least one embodiment, the processing unit 930 can include, but is not limited to, any number and type of processing cores (not shown). In at least one embodiment, the processing unit 930 includes an application-specific integrated circuit (“ASIC”) optimized to perform any number and type of operations associated with machine learning. For example, in at least one embodiment, the processing unit 930 is a tensor processing unit (“TPC”) optimized to perform machine learning inference operations. In at least one embodiment, the processing unit 930 is a vision processing unit (“VPU”) optimized to perform machine vision and machine learning inference operations.

[0083] In at least one embodiment, the USB interface 940 can be any type of USB connector or USB receptacle. For example, in at least one embodiment, the USB interface 940 is a USB 3.0 Type-C receptacle for data and power. In at least one embodiment, the USB interface 940 is a USB 3.0 Type-A connector. In at least one embodiment, the USB interface logic 950 may include any amount and type of logic enabling the processing unit 930 to interface with a device (e.g., computer 910) via the USB interface 940.

[0084] Inference and / or training logic 515 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, inference and / or training logic 515 may be... Figure 9 Used in systems for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases as described herein.

[0085] The embodiments proposed in this paper can prefetch data for each block computation used in GEMM operations and can offload the accumulation to at least one target DPU, thereby allowing concurrent and efficient execution of fetching, computation, and accumulation.

[0086] Figure 10 Exemplary integrated circuits and associated graphics processors according to various embodiments described herein are illustrated, which can be manufactured using one or more IP cores. In addition to those illustrated, at least one embodiment may also include other logic and circuitry, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.

[0087] Figure 10 This is a block diagram illustrating an exemplary system-on-chip (SOC) integrated circuit 1000 manufactured using one or more IP cores according to at least one embodiment. In at least one embodiment, the SOC integrated circuit 1000 includes one or more application processors 1005 (e.g., CPUs), at least one graphics processor 1010, and may additionally include an image processor 1015 and / or a video processor 1020, any of which may be a modular IP core. In at least one embodiment, the SOC integrated circuit 1000 includes peripheral or bus logic, which includes a USB controller 1025, a UART controller 1030, an SPI / SDIO controller 1035, and an I... 2 S / I 2 C controller 1040. In at least one embodiment, the SOC integrated circuit 1000 may include a display device 1045 coupled to one or more of a High Definition Multimedia Interface (HDMI) controller 1050 and a Mobile Industrial Processor Interface (MIPI) display interface 1055. In at least one embodiment, storage may be provided by a flash memory subsystem 1060, which includes flash memory and a flash memory controller. In at least one embodiment, a memory interface may be provided via a memory controller 1065 for accessing an SDRAM or SRAM memory device. In at least one embodiment, some integrated circuits also include an embedded security engine 1070.

[0088] Inference and / or training logic 515 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, inference and / or training logic 515 may be used in a SOC integrated circuit 1000 to infer or predict operations based at least in part on weight parameters calculated using neural network training operations, neural network functions and / or architectures, or neural network use cases described herein.

[0089] The embodiments proposed in this paper can prefetch data for each block computation used in GEMM operations and can offload the accumulation to at least one target DPU, thereby allowing concurrent and efficient execution of fetching, computation, and accumulation.

[0090] Figures 11A-11BExemplary integrated circuits and associated graphics processors according to various embodiments described herein are illustrated, which may be fabricated using one or more IP cores. In addition to those illustrated, at least one embodiment may include other logic and circuitry, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.

[0091] Figures 11A-11B This is a block diagram illustrating an exemplary graphics processor used within a SoC according to embodiments described herein. Figure 11A An exemplary graphics processor 1110, which can be fabricated using one or more IP cores according to at least one embodiment, is shown. Figure 11B An additional exemplary graphics processor 1140, which can be fabricated using one or more IP cores, is shown according to at least one embodiment. In at least one embodiment, Figure 11A The graphics processor 1110 is a low-power graphics processor core. In at least one embodiment, Figure 11B The graphics processor 1140 is a higher-performance graphics processor core. In at least one embodiment, each graphics processor 1110, 1140 may be... Figure 9 A variant of the computer system 900.

[0092] In at least one embodiment, the graphics processor 1110 includes a vertex processor 1105 and one or more fragment processors 1115A-1115N (e.g., 1115A, 1115B, 1115C, 1115D to 1115N-1 and 1115N). In at least one embodiment, the graphics processor 1110 may execute different shader programs via separate logic, such that the vertex processor 1105 is optimized to perform operations for vertex shader programs, while one or more fragment processors 1115A-1115N perform fragment (e.g., pixel) shading operations for fragment or pixel shader programs. In at least one embodiment, the vertex processor 1105 performs the vertex processing stage of the 3D graphics pipeline and generates primitive and vertex data. In at least one embodiment, one or more fragment processors 1115A-1115N use the primitive and vertex data generated by the vertex processor 1105 to generate framebuffers for display on a display device. In at least one embodiment, one or more fragment processors 1115A-1115N are optimized to execute fragment shader programs as provided in the OpenGL API, which can be used to perform operations similar to those of pixel shader programs provided in the Direct 3D API.

[0093] In at least one embodiment, the graphics processor 1110 additionally includes one or more memory management units (MMUs) 1120A-1120B, one or more caches 1125A-1125B, and one or more circuit interconnects 1130A-1130B. In at least one embodiment, the one or more MMUs 1120A-1120B provide virtual-to-physical address mapping for the graphics processor 1110 (including for the vertex processor 1105 and / or fragment processors 1115A-1115N), and may reference vertex or image / texture data stored in memory in addition to vertex or image / texture data stored in the one or more caches 1125A-1125B. In at least one embodiment, the one or more MMUs 1120A-1120B may be synchronized with other MMUs within the system, including with... Figure 11A One or more application processors 1105, graphics processors 1115, and / or video processors 1120 are associated with one or more MMUs, such that each processor 1105-1120 can participate in a shared or unified virtual memory system. In at least one embodiment, one or more circuit interconnects 1130A-1130B enable the graphics processor 1110 to interface with other IP cores within the SoC via the SoC's internal bus or via a direct connection.

[0094] In at least one embodiment, the graphics processor 1140 includes, as shown below: Figure 11B The one or more shader cores 1155A-1155N (e.g., 1155A, 1155B, 1155C, 1155D, 1155E, 1155F to 1155N-1 and 1155N) shown provide a unified shader core architecture, wherein a single core or type or core can execute all types of programmable shader code, including shader program code for implementing vertex shaders, fragment shaders, and / or compute shaders. In at least one embodiment, the number of shader cores can vary. In at least one embodiment, the graphics processor 1140 includes an inter-core task manager 1145, which acts as a thread dispatcher for dispatching execution threads to one or more shader cores 1155A-1155N and a tile unit 1158 to accelerate tile-based rendering operations, where scene rendering operations are subdivided in image space, for example, to utilize local spatial consistency within the scene or optimize the use of internal caches.

[0095] The embodiments proposed in this paper can prefetch data for each block computation used in GEMM operations and can offload the accumulation to at least one target DPU, thereby allowing concurrent and efficient execution of fetching, computation, and accumulation.

[0096] Figure 12 This is a block diagram illustrating a computing system 1200 according to at least one embodiment. In at least one embodiment, the computing system 1200 includes a processing subsystem 1201 having one or more processors 1202 and a system memory 1204 communicating via an interconnect path that may include a memory hub 1205. In at least one embodiment, the memory hub 1205 may be a separate component within a chipset assembly or may be integrated within one or more processors 1202. In at least one embodiment, the memory hub 1205 is coupled to an I / O subsystem 1211 via a communication link 1206. In at least one embodiment, the I / O subsystem 1211 includes an I / O hub 1207 that enables the computing system 1200 to receive input from one or more input devices 1208. In at least one embodiment, the I / O hub 1207 enables a display controller to provide output to one or more display devices 1210A, the display controller being included in one or more processors 1202. In at least one embodiment, one or more display devices 1210A coupled to the I / O hub 1207 may include local, internal, or embedded display devices.

[0097] In at least one embodiment, the processing subsystem 1201 includes one or more parallel processors 1212 coupled to the memory hub 1205 via a bus or other communication link 1213. In at least one embodiment, the communication link 1213 may use one of any number of standards based on a communication link technology or protocol (such as, but not limited to, PCI Express), or may be a vendor-specific communication interface or communication architecture. In at least one embodiment, one or more parallel processors 1212 form a computationally concentrated parallel or vector processing system, which may include a large number of processing cores and / or processing clusters, such as integrated many-core (MIC) processors. In at least one embodiment, some or all of the parallel processors 1212 form a graphics processing subsystem that can output pixels to one or more display devices 1210A coupled via an I / O hub 1207. In at least one embodiment, one or more parallel processors 1212 may also include a display controller and a display interface (not shown) for implementing direct connection to one or more display devices 1210B. In at least one embodiment, the parallel processor 1212 includes one or more cores, such as the graphics core 1200 discussed herein.

[0098] In at least one embodiment, system storage unit 1214 may be connected to I / O hub 1207 to provide a storage mechanism for computing system 1200. In at least one embodiment, I / O switch 1216 may be used to provide an interface mechanism for enabling connectivity between I / O hub 1207 and other components, such as network adapter 1218 and / or wireless network adapter 1219 integrated into the platform, and various other devices that can be added via one or more additional devices 1220. In at least one embodiment, network adapter 1218 may be an Ethernet adapter or another wired network adapter. In at least one embodiment, wireless network adapter 1219 may include one or more of Wi-Fi, Bluetooth, near field communication (NFC), or other network devices including one or more wireless devices.

[0099] In at least one embodiment, the computing system 1200 may include other components, not explicitly shown, that may also be connected to the I / O hub 1207, including USB or other port connections, optical storage drives, video capture devices, etc. In at least one embodiment, the interconnect can be implemented using any suitable protocol, such as a PCI (Peripheral Component Interconnect) based protocol (e.g., PCI-Express) or other bus or point-to-point communication interface and / or protocol (e.g., NV-Link high-speed interconnect or interconnect protocols). Figure 12 The communication paths of each component.

[0100] In at least one embodiment, one or more parallel processors 1212 include circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constituting a graphics processing unit (GPU), such as one or more parallel processors 1212 including a graphics core 1200. In at least one embodiment, one or more parallel processors 1212 include circuitry optimized for general-purpose processing. In at least one embodiment, components of the computing system 1200 may be integrated with one or more other system elements on a single integrated circuit. For example, in at least one embodiment, one or more parallel processors 1212, a memory hub 1205, one or more processors 1202, and an I / O hub 1207 may be integrated into a system-on-a-chip (SoC) integrated circuit. In at least one embodiment, components of the computing system 1200 may be integrated into a single package to form a system-in-package (SIP) configuration. In at least one embodiment, at least a portion of the components of the computing system 1200 may be integrated into a multi-chip module (MCM), which may interconnect with other MCMs to a modular computing system.

[0101] Inference and / or training logic 515 is used to perform inference and / or training operations associated with one or more embodiments. In at least one embodiment, inference and / or training logic 515 may be... Figure 12 The system is used for inference or prediction operations based at least in part on weight parameters computed using neural network training operations, neural network functions and / or architectures, or neural network use cases as described herein.

[0102] The embodiments proposed in this paper can prefetch data for each block computation used in GEMM operations and can offload the accumulation to at least one target DPU, thereby allowing concurrent and efficient execution of fetching, computation, and accumulation.

[0103] processor

[0104] Figure 13A A parallel processor 1300 according to at least one embodiment is illustrated. In at least one embodiment, the various components of the parallel processor 1300 may be implemented using one or more integrated circuit devices, such as programmable processors, application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs). In at least one embodiment, the illustrated parallel processor 1300 is according to an exemplary embodiment. Figure 12 Variations of the one or more parallel processors 1212 shown. In at least one embodiment, the parallel processor 1300 includes one or more graphics cores 1200.

[0105] In at least one embodiment, the parallel processor 1300 includes a parallel processing unit 1302. In at least one embodiment, the parallel processing unit 1302 includes an I / O unit 1304 that enables communication with other devices, including other instances of the parallel processing unit 1302. In at least one embodiment, the I / O unit 1304 can be directly connected to other devices. In at least one embodiment, the I / O unit 1304 is connected to other devices via a hub or switch interface (e.g., a memory hub 1305). In at least one embodiment, the connection between the memory hub 1305 and the I / O unit 1304 forms a communication link 1313. In at least one embodiment, the I / O unit 1304 is connected to a host interface 1306 and a memory crossbar switch 1316, wherein the host interface 1306 receives commands for performing processing operations, and the memory crossbar switch 1316 receives commands for performing memory operations.

[0106] In at least one embodiment, when host interface 1306 receives a command buffer via I / O unit 1304, host interface 1306 can route work operations for executing those commands to front end 1308. In at least one embodiment, front end 1308 is coupled to scheduler 1310 (which may be referred to as sequencer), which is configured to assign commands or other work items to processing cluster array 1312. In at least one embodiment, scheduler 1310 ensures that processing cluster array 1312 is correctly configured and in an active state before assigning tasks to clusters in processing cluster array 1312. In at least one embodiment, scheduler 1310 is implemented via firmware logic executed on a microcontroller. In at least one embodiment, the microcontroller-implemented scheduler 1310 can be configured to perform complex scheduling and work assignment operations at both coarse and fine granular levels, thereby enabling fast preemption and context switching of threads executing on processing cluster array 1312. In at least one embodiment, host software can demonstrate workloads for scheduling on processing cluster array 1312 via one of multiple graphics processing paths. In at least one embodiment, the workload can then be automatically distributed on the processing cluster array 1312 by the scheduler 1310 logic within the microcontroller, which includes the scheduler 1310.

[0107] In at least one embodiment, the processing cluster array 1312 may include up to "N" processing clusters (e.g., clusters 1314A, 1314B to 1314N), where "N" represents a positive integer (which may be an integer "N" different from the integers used in other diagrams). In at least one embodiment, each cluster 1314A-1314N of the processing cluster array 1312 may execute a large number of concurrent threads. In at least one embodiment, the scheduler 1310 may use various scheduling and / or work allocation algorithms to allocate work to clusters 1314A-1314N in the processing cluster array 1312, which may vary depending on the workload generated for each type of program or computation. In at least one embodiment, scheduling may be handled dynamically by the scheduler 1310, or may be partially assisted by compiler logic during the compilation of program logic configured to be executed by the processing cluster array 1312. In at least one embodiment, different clusters 1314A-1314N in the processing cluster array 1312 may be assigned to process different types of programs or to perform different types of computations.

[0108] In at least one embodiment, the processing cluster array 1312 can be configured to perform various types of parallel processing operations. In at least one embodiment, the processing cluster array 1312 is configured to perform general-purpose parallel computing operations. For example, in at least one embodiment, the processing cluster array 1312 may include logic for performing processing tasks, including filtering video and / or audio data, performing modeling operations, including physical operations, and performing data transformations.

[0109] In at least one embodiment, the processing cluster array 1312 is configured to perform parallel graphics processing operations. In at least one embodiment, the processing cluster array 1312 may include additional logic for supporting the execution of such graphics processing operations, including but not limited to texture sampling logic for performing texture operations, as well as tessellation logic and other vertex processing logic. In at least one embodiment, the processing cluster array 1312 may be configured to execute shader programs related to graphics processing, such as, but not limited to, vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. In at least one embodiment, the parallel processing unit 1302 may transfer data from system memory via I / O unit 1304 for processing. In at least one embodiment, during processing, the transferred data may be stored in on-chip memory (e.g., parallel processor memory 1322) and then written back to system memory.

[0110] In at least one embodiment, when the parallel processing unit 1302 is used to perform graphics processing, the scheduler 1310 can be configured to divide the processing workload into tasks of approximately equal size to better distribute graphics processing operations among multiple clusters 1314A-1314N in the processing cluster array 1312. In at least one embodiment, portions of the processing cluster array 1312 can be configured to perform different types of processing. For example, in at least one embodiment, a first portion can be configured to perform vertex shading and topology generation, a second portion can be configured to perform tessellation and geometry shading, and a third portion can be configured to perform pixel shading or other screen-space operations to produce a rendered image for display. In at least one embodiment, intermediate data generated by one or more of the clusters 1314A-1314N can be stored in a buffer to allow intermediate data to be transferred between the clusters 1314A-1314N for further processing.

[0111] In at least one embodiment, the processing cluster array 1312 may receive processing tasks to be executed via a scheduler 1310, which receives commands defining the processing tasks from a front end 1308. In at least one embodiment, the processing task may include an index of data to be processed, such as surface (patch) data, raw data, vertex data, and / or pixel data, as well as state parameters and commands defining how the data is processed (e.g., what program to execute). In at least one embodiment, the scheduler 1310 may be configured to acquire an index corresponding to a task, or may receive an index from the front end 1308. In at least one embodiment, the front end 1308 may be configured to ensure that the processing cluster array 1312 is configured to be active before initiating the workload specified by an incoming command buffer (e.g., a batch buffer, push buffer, etc.).

[0112] In at least one embodiment, each of one or more instances of the parallel processing unit 1302 may be coupled to the parallel processor memory 1322. In at least one embodiment, the parallel processor memory 1322 may be accessed via a memory crossbar switch 1316, which may receive memory requests from the processing cluster array 1312 and the I / O unit 1304. In at least one embodiment, the memory crossbar switch 1316 may be accessed via a memory interface 1318. In at least one embodiment, the memory interface 1318 may include a plurality of partition units (e.g., partition units 1320A, 1320B to 1320N), each of which may be coupled to a portion (e.g., a memory cell) of the parallel processor memory 1322. In at least one embodiment, the number of partition units 1320A-1320N is configured to be equal to the number of memory units, such that the first partition unit 1320A has a corresponding first memory unit 1324A, the second partition unit 1320B has a corresponding second memory unit 1324B, and the Nth partition unit 1320N has a corresponding Nth memory unit 1324N. In at least one embodiment, the number of partition units 1320A-1320N may not be equal to the number of memory units.

[0113] In at least one embodiment, memory cells 1324A-1324N may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. In at least one embodiment, memory cells 1324A-1324N may also include 3D stacked memory, including but not limited to high-bandwidth memory (HBM), HBM2e, and HDM3. In at least one embodiment, rendering targets such as framebuffers or texture maps can be stored across memory cells 1324A-1324N, allowing partitioning cells 1320A-1320N to write portions of each rendering target in parallel, to efficiently utilize the available bandwidth of the parallel processor memory 1322. In at least one embodiment, local instances of the parallel processor memory 1322 may be excluded to facilitate a unified memory design that utilizes system memory and local cache memory.

[0114] In at least one embodiment, any of clusters 1314A-1314N in the processing cluster array 1312 can process data to be written to any memory cell 1324A-1324N within the parallel processor memory 1322. In at least one embodiment, the memory crossbar switch 1316 can be configured to transfer the output of each cluster 1314A-1314N to any partition cell 1320A-1320N or another cluster 1314A-1314N, which can perform additional processing operations on the output. In at least one embodiment, each cluster 1314A-1314N can communicate with the memory interface 1318 via the memory crossbar switch 1316 to read from or write to various external memory devices. In at least one embodiment, the memory crossbar switch 1316 has a connection to a memory interface 1318 for communicating with I / O unit 1304, and a connection to a local instance of parallel processor memory 1322, enabling processing units within different processing clusters 1314A-1314N to communicate with system memory or other memory not local to parallel processing unit 1302. In at least one embodiment, the memory crossbar switch 1316 can use virtual channels to separate traffic flows between clusters 1314A-1314N and partition units 1320A-1320N.

[0115] In at least one embodiment, multiple instances of the parallel processing unit 1302 may be provided on a single add-in card, or multiple add-in cards may be interconnected. In at least one embodiment, different instances of the parallel processing unit 1302 may be configured to interoperate, even if the different instances have different numbers of processing cores, different numbers of local parallel processor memories, and / or other configuration differences. For example, in at least one embodiment, some instances of the parallel processing unit 1302 may include higher-precision floating-point units relative to other instances. In at least one embodiment, a system including one or more instances of the parallel processing unit 1302 or the parallel processor 1300 may be implemented in various configurations and form factors, including but not limited to desktop computers, laptop or handheld personal computers, servers, workstations, game consoles, and / or embedded systems.

[0116] Figure 13B This is a block diagram of a partitioning unit 1320 according to at least one embodiment. In at least one embodiment, the partitioning unit 1320 is... Figure 13A This is an example of one of the partitioning units 1320A-1320N. In at least one embodiment, the partitioning unit 1320 includes an L2 cache 1321, a frame buffer interface 1325, and a ROP 1326 (raster operation unit). In at least one embodiment, the L2 cache 1321 is a read / write cache configured to perform load and store operations received from the memory crossbar switch 1316 and the ROP 1326. In at least one embodiment, the L2 cache 1321 outputs read misses and urgent write-back requests to the frame buffer interface 1325 for processing. In at least one embodiment, updates can also be sent to the frame buffer for processing via the frame buffer interface 1325. In at least one embodiment, the frame buffer interface 1325 communicates with memory cells in the parallel processor memory (such as...). Figure 13A It is coupled to one of the memory cells 1324A-1324N (e.g., within the parallel processor memory 1322).

[0117] In at least one embodiment, ROP 1326 is a processing unit that performs raster operations such as stenciling, z-testing, blending, etc. In at least one embodiment, ROP 1326 then outputs processed graphics data stored in graphics memory. In at least one embodiment, ROP 1326 includes compression logic for compressing depth or color data written to memory and decompressing depth or color data read from memory. In at least one embodiment, the compression logic may be lossless compression logic utilizing one or more of a variety of compression algorithms. In at least one embodiment, the type of compression performed by ROP 1326 may vary based on the statistical characteristics of the data to be compressed. For example, in at least one embodiment, incremental color compression is performed on depth and color data per tile.

[0118] In at least one embodiment, ROP 1326 is included within each processing cluster (e.g., Figure 13A The clusters 1314A-1314N are used instead of the partition unit 1320. In at least one embodiment, read and write requests for pixel data, rather than pixel fragment data, are transmitted via the memory crossbar switch 1316. In at least one embodiment, the processed graphics data can be displayed on a display device (such as...) Figure 12 Displayed on one or more display devices 1210A, routed by processor 1302 for further processing, or by... Figure 13A One of the processing entities within the parallel processor 1300 is routed for further processing.

[0119] Figure 14 This is a processing system according to at least one embodiment. In at least one embodiment, system 1400 includes one or more processors 1402 and one or more graphics processors 1408, and may be a single-processor desktop system, a multi-processor workstation system, or a server system having a large number of processors 1402 or processor cores 1407. In at least one embodiment, system 1400 is a processing platform included within a system-on-a-chip (SoC) integrated circuit for use in mobile, handheld, or embedded devices. In at least one embodiment, one or more graphics processors 1408 include one or more graphics cores 1200.

[0120] In at least one embodiment, system 1400 may include or be integrated into a server-based gaming platform, a game console including a game and media console, a mobile game console, a handheld game console, or an online game console. In at least one embodiment, system 1400 is a mobile phone, smartphone, tablet computing device, or mobile internet device. In at least one embodiment, processing system 1400 may also include components coupled to or integrated into a wearable device, such as a smartwatch wearable device, smart glasses device, augmented reality device, or virtual reality device. In at least one embodiment, processing system 1400 is a television or set-top box device having one or more processors 1402 and a graphical interface generated by one or more graphics processors 1408.

[0121] In at least one embodiment, each of the one or more processors 1402 includes one or more processor cores 1407 for processing instructions that, when executed, perform operations against the system and user software. In at least one embodiment, each of the one or more processor cores 1407 is configured to process a specific instruction sequence 1409. In at least one embodiment, the instruction sequence 1409 may facilitate Complex Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), or computation via Very Long Instruction Word (VLIW). In at least one embodiment, each processor core 1407 may process a different instruction sequence 1409, which may include instructions that facilitate the emulation of other instruction sequences. In at least one embodiment, the processor core 1407 may also include other processing devices, such as a digital signal processor (DSP).

[0122] In at least one embodiment, processor 1402 includes cache memory 1404. In at least one embodiment, processor 1402 may have a single internal cache or more levels of internal caches. In at least one embodiment, the cache memory is shared among various components of processor 1402. In at least one embodiment, processor 1402 also uses an external cache (e.g., a Level 3 (L3) cache or a last-level cache (LLC)) (not shown), which can be shared among processor cores 1407 using known cache coherence techniques. In at least one embodiment, processor 1402 additionally includes a register file 1406, which may include different types of registers (e.g., integer registers, floating-point registers, status registers, and instruction pointer registers) for storing different types of data. In at least one embodiment, register file 1406 may include general-purpose registers or other registers.

[0123] In at least one embodiment, one or more processors 1402 are coupled to one or more interface buses 1410 to transmit communication signals, such as address, data, or control signals, between the processors 1402 and other components in the system 1400. In at least one embodiment, the interface bus 1410 may be a processor bus, such as a version of the Direct Media Interface (DMI) bus. In at least one embodiment, the interface bus 1410 is not limited to the DMI bus and may include one or more peripheral component interconnect buses (e.g., PCI, PCI Express), memory buses, or other types of interface buses. In at least one embodiment, one or more processors 1402 include an integrated memory controller 1416 and a platform controller hub 1430. In at least one embodiment, the memory controller 1416 facilitates communication between memory devices and other components of the system 1400, while the platform controller hub (PCH) 1430 provides connectivity to I / O devices via a local I / O bus.

[0124] In at least one embodiment, memory device 1420 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, a phase-change memory device, or some other memory device with suitable performance for use as processor memory. In at least one embodiment, memory device 1420 may operate as system memory of system 1400 for storing data 1422 and instructions 1421 for use when one or more processors 1402 execute an application or process. In at least one embodiment, memory controller 1416 is also coupled to an optional external graphics processor 1412, which may communicate with one or more graphics processors 1408 of processor 1702 to perform graphics and media operations. In at least one embodiment, display device 1411 may be connected to one or more processors 1402. In at least one embodiment, display device 1411 may include one or more internal display devices, such as in mobile electronic devices or laptop devices, or external display devices attached via a display interface (e.g., DisplayPort). In at least one embodiment, the display device 1411 may include a head-mounted display (HMD), such as a stereoscopic display device for virtual reality (VR) or augmented reality (AR) applications.

[0125] In at least one embodiment, the platform controller hub 1430 enables peripheral devices to connect to the memory device 1420 and the processor 1402 via a high-speed I / O bus. In at least one embodiment, the I / O peripheral devices include, but are not limited to, an audio controller 1446, a network controller 1434, a firmware interface 1428, a wireless transceiver 1426, a touch sensor 1425, and a data storage device 1424 (e.g., a hard disk drive, flash memory, etc.). In at least one embodiment, the data storage device 1424 may be connected via a storage interface (e.g., SATA) or via a peripheral bus, such as a peripheral component interconnect bus (e.g., PCI, PCIe). In at least one embodiment, the touch sensor 1425 may include a touchscreen sensor, a pressure sensor, or a fingerprint sensor. In at least one embodiment, the wireless transceiver 1426 may be a Wi-Fi transceiver, a Bluetooth transceiver, or a mobile network transceiver, such as a 3G, 4G, or LTE transceiver. In at least one embodiment, the firmware interface 1428 enables communication with the system firmware and may be, for example, a Unified Extensible Firmware Interface (UEFI). In at least one embodiment, network controller 1434 can implement network connectivity to a wired network. In at least one embodiment, a high-performance network controller (not shown) is coupled to interface bus 1410. In at least one embodiment, audio controller 1446 is a multi-channel high-definition audio controller. In at least one embodiment, system 1400 includes an optional legacy I / O controller 1440 for coupling legacy (e.g., Personal System 2 (PS / 2)) devices to system 1400. In at least one embodiment, platform controller hub 1430 can also be connected to one or more Universal Serial Bus (USB) controllers 1442, which connect input devices such as keyboard and mouse 1443 combinations, camera 1444, or other USB input devices.

[0126] In at least one embodiment, instances of the memory controller 1416 and platform controller hub 1430 may be integrated into a discrete external graphics processor, such as external graphics processor 1412. In at least one embodiment, the platform controller hub 1430 and / or the memory controller 1416 may be external to one or more processors 1402. For example, in at least one embodiment, system 1400 may include an external memory controller 1416 and platform controller hub 1430, which may be configured as a memory controller hub and peripheral controller hub in a system chipset communicating with one or more processors 1402.

[0127] The embodiments proposed in this paper can prefetch data for each block computation used in GEMM operations and can offload the accumulation to at least one target DPU, thereby allowing concurrent and efficient execution of fetching, computation, and accumulation.

[0128] Other variations are within the spirit of this disclosure. Therefore, although the disclosed technology is readily adaptable to various modifications and alternative constructions, certain embodiments thereof are illustrated in the accompanying drawings and have been described in detail above. However, it should be understood that the disclosure is not intended to be limited to one or more specific forms disclosed, but rather, it is intended to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of this disclosure as defined in the appended claims.

[0129] Unless otherwise stated or obviously contradicted by the context, the terms “a,” “an,” and “the,” and similar pronouns, used in the context of describing the disclosed embodiments (particularly in the context of the appended claims), should be interpreted as encompassing both singular and plural forms, rather than as definitions of the terms. Unless otherwise stated, the terms “comprising,” “having,” “including,” and “containing” should be interpreted as open-ended terms (meaning “including, but not limited to”). The term “connection” (wherein unmodified, it refers to a physical connection) should be interpreted as partially or wholly included, attached to, or connected together, even with some intervening elements. Unless otherwise indicated herein, references to numerical ranges herein are intended only as a way of abbreviating each individual value falling within that range, and each individual value is incorporated into the specification as if it were separately described herein. In at least one embodiment, unless otherwise indicated or contradicted by the context, the use of the terms “set” (e.g., “item set”) or “subset” should be interpreted as a non-empty set comprising one or more members. Furthermore, unless otherwise indicated or contradicted by the context, the term “subset” of the corresponding set does not necessarily mean an appropriate subset of the corresponding set, but rather that the subset and the corresponding set can be equal.

[0130] Unless otherwise explicitly stated or clearly contradicted by the context, connective phrases such as “at least one of A, B, and C” or “at least one of A, B, and C” are understood in the context to generally refer to items, terms, etc., which can be A or B or C, or any non-empty subset of the set A, B, and C. For example, in an illustrative example of a set with three members, the connective phrases “at least one of A, B, and C” and “at least one of A, B, and C” refer to any of the following sets: {A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}. Therefore, such connective language is generally not intended to imply that some embodiments require the presence of each of at least one of A, at least one of B, and at least one of C. Additionally, unless otherwise stated or contradicted by the context, the term “multiple” indicates a plural state (e.g., “multiple items” indicates multiple items). In at least one embodiment, the number of items in the multiple items is at least two, but may be more if explicitly indicated or indicated by the context. Furthermore, unless otherwise stated or clearly understood from the context, the phrase “based on” means “at least partially based on” rather than “based on only”.

[0131] Unless otherwise indicated herein or clearly contradicted by the context, the operations of the processes described herein may be performed in any suitable order. In at least one embodiment, processes such as those described herein (or variations thereof and / or combinations thereof) are executed under the control of one or more computer systems configured with executable instructions and are implemented as code (e.g., executable instructions, one or more computer programs, or one or more application programs) executed jointly by hardware or a combination thereof on one or more processors. In at least one embodiment, the code is stored on a computer-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. In at least one embodiment, the computer-readable storage medium is a non-transitory computer-readable storage medium that excludes transient signals (e.g., propagating transient electrical or electromagnetic transmissions) but includes non-transitory data storage circuitry (e.g., buffers, caches, and queues) within a transceiver of transient signals. In at least one embodiment, code (e.g., executable code or source code) is stored on a set of one or more non-transitory computer-readable storage media (or other memory for storing executable instructions) on which executable instructions are stored, which, when executed by one or more processors of a computer system (i.e., as a result of execution), cause the computer system to perform the operations described herein. In at least one embodiment, the set of non-transitory computer-readable storage media comprises a plurality of non-transitory computer-readable storage media, and one or more of the various non-transitory storage media lack the complete code, but the plurality of non-transitory computer-readable storage media collectively store the complete code. In at least one embodiment, the executable instructions are executed such that different instructions are executed by different processors; for example, the non-transitory computer-readable storage media store the instructions, and the main central processing unit (“CPU”) executes some instructions while the graphics processing unit (“GPU”) executes other instructions. In at least one embodiment, different components of the computer system have separate processors, and the different processors execute different subsets of the instructions.

[0132] In at least one embodiment, an arithmetic logic unit is a set of combinational logic circuits that takes one or more inputs to produce a result. In at least one embodiment, a processor uses arithmetic logic units to implement mathematical operations, such as addition, subtraction, or multiplication. In at least one embodiment, arithmetic logic units are used to implement logical operations, such as logical AND / OR or XOR. In at least one embodiment, arithmetic logic units are stateless and made of physical switching elements, such as semiconductor transistors arranged to form logic gates. In at least one embodiment, arithmetic logic units may operate internally as stateful logic circuits with an associated clock. In at least one embodiment, arithmetic logic units may be configured as asynchronous logic circuits whose internal state is not maintained in an associated set of registers. In at least one embodiment, a processor uses arithmetic logic units to combine operands stored in one or more registers of the processor and produce an output that can be stored by the processor in another register or memory location.

[0133] In at least one embodiment, as a result of processing instructions retrieved by the processor, the processor presents one or more inputs or operands to the arithmetic logic unit (ALU), causing the ALU to produce a result at least partially based on instruction code provided to the ALU of the inputs. In at least one embodiment, the instruction code provided by the processor to the ALU is at least partially based on instructions executed by the processor. In at least one embodiment, combinational logic in the ALU processes the inputs and produces an output placed on a bus within the processor. In at least one embodiment, the processor selects a destination register, memory location, output device, or output storage location on the output bus to clock the processor such that the result produced by the ALU is sent to the desired location.

[0134] Within the scope of this application, the term Arithmetic Logic Unit or ALU is used to refer to any computational logic circuit that processes operands to produce results. For example, in this document, the term ALU may refer to a floating-point unit, a DSP, a tensor core, a shader core, a coprocessor, or a CPU.

[0135] Therefore, in at least one embodiment, the computer system is configured to implement one or more services that perform the processes described herein individually or collectively, and such a computer system is configured with suitable hardware and / or software to enable the performance of the operations. Furthermore, the computer system implementing at least one embodiment of this disclosure is a single device, and in another embodiment it is a distributed computer system comprising multiple devices operating differently, such that the distributed computer system performs the operations described herein, and that no single device performs all operations.

[0136] The use of any and all examples or exemplary language (e.g., “such as”) provided herein is intended only to better illustrate embodiments of this disclosure and does not impose a limitation on the scope of the disclosure unless otherwise required. No language in the specification should be construed as indicating that any unclaimed element is essential to the practice of the disclosure.

[0137] All references cited in this article, including publications, patent applications and patents, are incorporated herein by reference to the same extent that each reference is individually and specifically indicated as incorporated herein by reference and its entire contents are set forth herein.

[0138] The terms “coupled” and “connected”, and their derivatives, may be used in the specification and claims. It should be understood that these terms may not be intended to be synonyms with each other. Rather, in certain examples, “connected” or “coupled” may be used to indicate that two or more elements are in direct or indirect physical or electrical contact with each other. “Coupled” may also mean that two or more elements are not in direct contact with each other, but still cooperate or interact with each other.

[0139] Unless otherwise expressly stated, it will be understood that throughout this specification, terms such as “processing,” “calculation,” “operation,” “determine,” etc., refer to the actions and / or processes of a computer or computing system or similar electronic computing device that manipulate and / or convert data represented as physical quantities (e.g., electronic quantities) in the registers and / or memory of the computing system into other data similarly represented as physical quantities in the memory, registers, or other such information storage, transmission, or display devices of the computing system.

[0140] Similarly, the term "processor" can refer to any device or part of a device that processes electronic data from registers and / or memory and converts that electronic data into other electronic data that can be stored in registers and / or memory. As a non-limiting example, a "processor" can be a CPU or a GPU. A "computing platform" can include one or more processors. As used herein, a "software" process can include, for example, software and / or hardware entities that perform work over time, such as tasks, threads, and intelligent agents. Likewise, each process can refer to multiple processes that execute instructions sequentially or in parallel, continuously or intermittently. In at least one embodiment, the terms "system" and "method" are used interchangeably herein, provided that a system can embody one or more methods, and a method can be considered a system.

[0141] In this document, reference may be made to acquiring, collecting, receiving analog or digital data, or inputting analog or digital data into a subsystem, computer system, or computer-implemented machine. In at least one embodiment, the process of acquiring, collecting, receiving, or inputting analog and digital data can be accomplished in various ways, such as by receiving data as a parameter to a function call or a call to an application programming interface. In at least one embodiment, the process of acquiring, collecting, receiving, or inputting analog or digital data can be accomplished by transmitting data via a serial or parallel interface. In at least one embodiment, the process of acquiring, collecting, receiving, or inputting analog or digital data can be accomplished by transmitting data from a providing entity to an acquiring entity via a computer network. In at least one embodiment, reference may also be made to providing, outputting, transmitting, sending, or presenting analog or digital data. In various examples, the process of providing, outputting, transmitting, sending, or presenting analog or digital data can be implemented by transmitting data as an input or output parameter to a function call, an application programming interface, or an inter-process communication mechanism.

[0142] While this document describes example implementations of the described technologies, other architectures can be used to implement the described functionality and are intended to fall within the scope of this disclosure. Furthermore, although specific assignments of responsibilities have been defined above for descriptive purposes, various functions and responsibilities may be assigned and divided in different ways depending on the circumstances.

[0143] Furthermore, although the subject matter has been described in language specific to structural features and / or methodological actions, it should be understood that the subject matter claimed in the appended claims is not limited to the specific features or actions described, but rather discloses specific features and actions as exemplary forms for implementing the claims.

Claims

1. A processor, comprising: One or more logic units are used for: For the matrix multiplication operation to be performed, prefetch the first data to be used for the first computation of the identified block for the result matrix; Perform the first calculation while prefetching second data to be used for the second calculation of the identified block for the result matrix; Perform the second calculation while prefetching third data to be used for the third calculation of the identified blocks in the result matrix; During the execution of the second calculation, the result of the first calculation is used to perform an accumulation operation on the identified block; as well as Continue to perform the corresponding prefetching, matrix calculation, and accumulation operations concurrently and in parallel on the identified block and other blocks of the result matrix until the matrix multiplication operation is completed.

2. The processor according to claim 1, wherein, The accumulation operation is performed using at least one offload processor, central processing unit (CPU), or graphics processing unit (GPU).

3. The processor according to claim 1, wherein, The matrix multiplication operation is performed on blocks of at least the first and second matrices, and the number of computations to be performed on the corresponding blocks of the resulting matrix depends in part on the dimensions of the first and second matrices.

4. The processor according to claim 1, wherein, At least the first calculation, the second calculation, and the third calculation for the identified block of the result matrix are performed in parallel.

5. The processor according to claim 1, wherein, The one or more logical units are also configured to use one or more local buffers to store the result of the calculation to be used in the accumulation operation.

6. The processor according to claim 1, wherein, The one or more processors are also configured to refresh the stored data after the respective prefetch, computation, and accumulation operations.

7. The processor according to claim 1, wherein, The computation includes local block-level matrix computation, and the prefetching and accumulation operations are non-blocking.

8. The processor according to claim 1, wherein, Before performing any computations using the retrieved data, a prefetch is performed to obtain data instances.

9. The processor according to claim 1, wherein, The calculations for each data block are evenly distributed across a set of processing units.

10. A computer-implemented method, comprising: Before performing the block multiplication computation to be performed on at least the first and second matrices, data instances for the block multiplication computation are prefetched; The block multiplication calculation is performed using the prefetched data instance and one or more processing units; In parallel with the prefetching and execution of the block multiplication calculation, an accumulation operation is performed using the result of the block multiplication calculation and at least one unloading processing unit; as well as Continue to perform the corresponding prefetching, block multiplication calculation, and accumulation operation concurrently and in parallel for the current block and remaining blocks of the result matrix until the matrix multiplication operation has been completed.

11. The computer-implemented method according to claim 10, wherein, The at least one unloading processing unit includes at least one data processing unit (DPU).

12. The computer-implemented method according to claim 10, wherein, The execution of the block multiplication calculations for each data block is distributed uniformly or non-uniformly across the one or more processing units.

13. The computer-implemented method according to claim 10, further comprising: Use one or more local buffers to store the results of the block multiplication calculations for use in the accumulation operation.

14. The computer-implemented method according to claim 10, wherein, The block multiplication computation includes local block-level matrix computation, and the prefetching and accumulation operations are non-blocking.

15. The computer-implemented method according to claim 10, wherein, The block multiplication calculation is performed on each block of the first matrix and the second matrix, and the blocks of the first matrix and the second matrix are multiplied as part of the General Matrix Multiplication (GEMM) operation.

16. A system comprising one or more processing units, the one or more processing units being configured to perform matrix multiplication by prefetching data to be used for computation of each matrix block at least partially concurrently and in parallel, performing each matrix computation using one or more target processing units, and performing an accumulation operation using the results of the each matrix computation using at least one offload processing unit.

17. The system according to claim 16, wherein, The at least one unloading processing unit includes at least one data processing unit (DPU).

18. The system according to claim 16, wherein, The execution of the matrix calculations for each data block is evenly distributed across the one or more target processing units.

19. The system according to claim 16, wherein, The one or more processing units are further configured to use one or more local buffers to store the results of the respective matrix calculations to be used for the accumulation operation, wherein the respective matrix calculations are performed on respective blocks of the first matrix and the second matrix, and the respective blocks of the first matrix and the second matrix are multiplied as part of the General Matrix Multiplication (GEMM) operation.

20. The system according to claim 16, wherein, The system is at least one of the following: A system used to perform simulation operations; A system used to perform simulations to test or validate autonomous machine applications; Systems used to perform digital twin operations; A system for performing optical transmission simulation; A system used for rendering graphics output; A system used to perform deep learning operations; A system for performing generative AI operations using large language model LLM; Systems implemented using edge devices; Systems used to generate or present virtual reality (VR) content; A system for generating or presenting augmented reality (AR) content; A system for generating or presenting mixed reality (MR) content; A system containing one or more virtual machines (VMs); A system that is at least partially implemented in a data center; A system for using simulation to perform hardware testing; A system for performing generative operations using a language model LM; Systems for generating synthetic data; A collaborative content creation platform for 3D assets; or A system that utilizes cloud computing resources at least in part.