Rectangular matrix multiplier

The system with unequal dimensioned matrix multipliers and coupled memory devices addresses inefficiencies in large-scale matrix computations by reducing memory requirements and data transfer overhead, enhancing throughput and efficiency.

WO2026128268A1PCT designated stage Publication Date: 2026-06-18MAJESTIC LABS

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
MAJESTIC LABS
Filing Date
2025-12-03
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Current matrix multipliers face inefficiencies due to increased memory requirements and data transfer overhead when handling large-scale matrix computations, particularly in AI workloads, leading to reduced performance and throughput.

Method used

A system comprising a first memory device and a second memory device, coupled with a general-purpose AI unit, utilizes a plurality of matrix multipliers with unequal dimensions to perform multiply-and-accumulate operations efficiently, reducing the need for increased batch sizes and cache capacities.

🎯Benefits of technology

This approach enhances computational throughput and reduces memory utilization, allowing for efficient processing of large-scale matrix operations without increasing batch size, thereby improving overall system efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2025057874_18062026_PF_FP_ABST
    Figure US2025057874_18062026_PF_FP_ABST
Patent Text Reader

Abstract

Systems and methods are provided herein for rectangular matrix multiplication. The system may include a first memory device having a first and second dimension and stored on a first memory device. The system may further include a second memory device having the second and a third dimension and stored on a second memory device. The system may also include a general-purpose artificial intelligence unit (GP-AIU) coupled to the first memory device and the second memory device. The system may additionally include a plurality of matrix multipliers coupled to the GP-AIU and having a fourth dimension and a fifth dimension, which are configured to generate a third matrix having the first and third dimension.
Need to check novelty before this filing date? Find Prior Art

Description

Agent Ref.: 000472-0006-W01RECTANGULAR MATRIX MULTIPLIERCROSS REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of U.S. Provisional Application No. 63 / 730,883, filed December 11, 2024, which is hereby incorporated by reference herein in its entirety.BACKGROUND

[0002] The present disclosure relates to advanced computation in matrix multiplier engines, and in particular, to the size of matrix multiplication units.SUMMARY

[0003] In recent years, artificial intelligence (Al) and machine learning (ML) have undergone rapid advancement, due in part to the emergence of large-scale deep learning models such as large language models (LLMs). These models rely heavily on matrix multiplication operations that demand significant computational and memory resources. For example, LLMs use extensive neural networks that require substantial computing power and large memory capacities for optimal performance. As the scale and complexity of Al workloads continue to increase, there is a growing demand for improved hardware architectures and processing techniques capable of efficiently performing high-dimensional matrix computations.

[0004] Currently, input data is represented as one or more vectors (e.g., embedding vectors, token vectors) that are arranged (e.g., batched) into matrix form. The input matrix may be multiplied by a corresponding weight matrix (e.g., which may include parameters that have been configured to perform a transformation or inference task). The input matrix and weight matrix may be partitioned into smaller submatrices that are processed by one or more matrix multipliers (e.g., matrix multiplication unit) of a fixed dimension (e.g., 16x16, 64x64, etc.) of an array of processing elements (PEs). Each matrix multiplier may perform multiple-and-accumulate (MAC) operations on the submatrices to generate partial results, which are combined to form the output matrix.

[0005] In some approaches, to meet the growing computational demand of Al workloads, the sizes of the matrix multipliers have increased to allow more MAC operations to be performed concurrently. However, while increasing the matrix multiplier dimensions may allow greater parallel computation, it requires increasing the batch size. Increasing the batch size, in turn, requires larger cache capacities and higher memory bandwidth to transfer data between processing units (e.g., PEs) and memory, which can lead to reduced overall system efficiency and diminished performance gains. For example, in transformer-based architectures, doubling the batch size maydouble the amount of data that must be fetched or written for associated key and value (KV) matrices used in attention computations, thereby further increasing memory capacity requirements, bandwidth requirements, and data transfer overhead.

[0006] In certain implementations, matrix multiplication operations involving token vectors and attention projection matrices (e.g., query, key, and value matrices) may account for the highest number of floating-point operations (FLOPs) within a model. For example, a token vector may be represented as a 1 x 16,000 vector, which, for 16-bit floating-point (FP16) precision, may require 32,000 bytes. For a batch of 64 token vectors, the resulting input matrix may have dimensions of 64 x 16,000 and occupy approximately 2 MB (2,000,000 bytes) in FP16 format. A weight matrix may have dimensions of 16,000 x 16,000. In some examples, query, key, value, and / or output projection matrices may be square matrices with dimensions equal to the model size (e.g., with both the number of rows and columns equal to the size of the model’s hidden representation). For example, in the LLaMA 405B model, where the model dimension is 16,000 and assuming a bfloatl6 (BF16) data type, each of these projection matrices may require approximately 512 MB of memory. In such example, the feed-forward network (FFN) matrices W1 and W2 may have dimensions corresponding to the model size and the FFN size, where the FFN size is up to four times the model dimension. Accordingly, for the same model, each of W1 and W2 may require up to 2 GB of memory. These large memory requirements for operations on token vectors and attention projection matrices may pose challenges for efficient processing and storage during inference. Memory bandwidth and cache capacity may become limiting factors, causing stalls in data movement and reducing achievable throughput. Efficient use of memory and computational resources may be critical to enable practical deployment of large models.

[0007] In view of these deficiencies, there exists a need for matrix multipliers that can efficiently perform large-scale multiply-accumulate operations. Accordingly, techniques are disclosed herein for a system to provide increased performance in matrix multiplication engines without increasing the batch size required (e.g., without increasing the KV cache size required). In accordance with certain representative embodiments, a system comprises a first memory device (e.g., on-chip static random-access memory (SRAM), scratchpad memory, , a register file, on-chip cache such as LI, L2, or L3 cache, etc.), a second memory device (e.g., off-chip high bandwidth memory (HBM), dynamic random-access memory (DRAM), non-volatile or volatile storage, etc.), and a general- purpose artificial intelligence unit (GP-AIU) coupled to the first memory device and the second memory device. The first memory device may store a first matrix (e.g., an input activation matrix, embedding matrix, token representation matrix, or the like) having a first dimension and a second dimension. For example, the first matrix may correspond to a plurality of instances. In someembodiments, the first matrix may be transposed from an input matrix such that the first and second dimensions are interchanged. The second memory device may store a second matrix (e.g., a weight matrix, parameter matrix, coefficient matrix, or the like) having the second dimension and a third dimension. In some examples, the second matrix may correspond to feature dimensions associated with the plurality of input instances.

[0008] A plurality of matrix multipliers may be coupled to the GP-AIU. The plurality of matrix multipliers may be configured to generate a third matrix having the first dimension and the third dimension (e.g., the first dimension from the first matrix and the third dimension from the second matrix). In some embodiments, each matrix multiplier of the plurality of matrix multipliers may be configured to perform multiply-and-accumulate operations between respective portions of the first matrix and the second matrix. Each matrix multiplier of the plurality of matrix multipliers may have a fourth dimension and a fifth dimension. In some embodiments, the fifth dimension is greater than (e.g., twice) the fourth dimension. In some embodiments, at least one or more of the first dimension, the second dimension, or the third dimension, is divisible by the fourth dimension. In some embodiments, each matrix multiplier of the plurality of matrix multipliers is configured as a rectangular array (e.g., systolic array) of processing elements. The fourth and fifth dimension may correspond to the number of processing elements. For example, the fourth dimension may correspond to 64 processing elements and the fifth dimension may correspond to 128 processing elements. A processing element may correspond to a node in the rectangular systolic array, and may be configured to perform the MAC operations and to transmit partial results to one or more adjacent processing elements.BRIEF DESCRIPTION OF THE DRAWINGS

[0009] The accompanying drawings provide additional details related to some embodiments of the disclosure described herein. The drawings are provided for purposes of illustration only and merely depict typical or example embodiments. For example, FIGS. 1-5 may display one or more systems, portions of systems, appliances, portions of appliances, microelectronic devices, portions of microelectronic devices, computing systems, processor architectures, artificial intelligence accelerators, portions of matrix multiplication engines, processing elements, memory hierarchies, communication fabrics, connectors, portions of connectors, interconnects, and / or, portions of interconnects in accordance with embodiments of the disclosure. These drawings are provided to facilitate an understanding of the concepts disclosed herein and should not be considered limiting of the breadth, scope, or applicability of these concepts. It should be noted that for clarity and ease of illustration, these drawings are not necessarily made to scale.

[0010] FIG. l is a diagram depicting a circuit topology of a matrix multiplier, in accordance with certain embodiments.

[0011] FIGs. 2A is a diagram depicting an example of matrix-matrix multiplication, in accordance with certain embodiments.

[0012] FIG. 2B is a diagram depicting an example of a striped matrix-matrix multiplication, in accordance with certain embodiments.

[0013] FIG. 3 is a diagram depicting another example of matrix-matrix multiplication, in accordance with certain embodiments.

[0014] FIG. 4 is a diagram depicting an example of parallelized matrix-matrix multiplication across multiple matrix multiplication engines (MMEs), in accordance with certain embodiments.

[0015] FIG. 5 depicts a system of devices for implementing matrix-matrix multiplication, in accordance with certain embodiments.DETAILED DESCRIPTION

[0016] FIG. l is a diagram depicting a circuit topology of a matrix multiplier 100, in accordance with certain embodiments. In some embodiments, matrix multiplier 100 may be implemented as a two-dimensional systolic array of processing elements (PEs) configured to perform matrix-matrix multiplication operations. As shown in FIG. 1, for purposes of illustration and simplicity, matrix multiplier 100 is depicted as a 4x4 array of PEs. Each PE may correspond to a node in the systolic array. Each cell (e.g., PE 101) of the matrix multiplier 100 may be configured to perform multiply- and-accumulate (MAC) operations using input data received from adjacent PEs or from an input activation or weight matrix stream. The partial results generated by any respective PE may be propagated through the array and accumulated to produce elements of an output matrix. For example, PE 101 receives input 102 (e.g., element IN left [0]) and 104 (e.g., element IN up [3]) from the adjacent cells. Inputs 102 and 104 may be multiplied together and added to the locally accumulated sum, as shown in 106 (e.g., element C[0,3]). The input value may then shift to the next cell or propagate to a neighboring PE (e.g., downward or to an adjacent PE within the array) and the partial product continues to accumulate to generate a complete output matrix element.

[0017] Although FIG. 1 illustrates a 4x4 array for simplicity, the matrix multiplier 100 may include any suitable array size or dimension (e.g., 8x8, 16x 16, 64x64) depending on implementation requirements. In certain representative embodiments, the matrix multiplier 100 may be set as a rectangle such that a first dimension is larger (e.g., twice as large) the size of a second dimension. Further aspects of this configuration are shown and described with respect to FIG. 2B.

[0018] FIG. 2A is a diagram depicting an example of matrix-matrix multiplication, in accordance with certain embodiments. In matrix-matrix multiplication example 200, matrix A 202 has a first dimension (Dim 1) and a second dimension (Dim 2), matrix B 204 has the second dimension (Dim 2) and a third dimension (Dim 3), and matrix C 206 has the third dimension (Dim 3) and the first dimension (Dim 1). Matrix C 206 may be generated according to a matrix multiplication operation of the form C = ATx B, where each element C(m,n) is computed as the sum of products of corresponding elements from matrix AT(matrix A 202 transposed) and matrix B 204. In such embodiments, index m of element C(m,n) may correspond to a row position of the transposed matrix AT(e.g., matrix A 202 transposed), and index n may correspond to a column position of matrix B 204. In some embodiments, the first dimension is 256 units, the second dimension is 1024 units, and the third dimension is 1024 units. The units may correspond to individual data elements (e.g., numeric values, feature activation values, coefficients, tokens, parameters, intermediate computation values, and / or similar such elements) represented within the respective matrices. In some embodiments, the units correspond to data values processed by respective PEs of a matrix multiplier (e.g., for example, matrix multiplier 100 of FIG. 1), where each unit represents a single operand or accumulation element handled by the compute array.

[0019] In certain implementations, the dimensions of matrix A 202 and matrix B 204 may be sufficiently large such that direct computation is inefficient and / or exceeds available on-chip memory capacity. Accordingly, the matrices may be divided (or “striped”) into smaller submatrices or tiles that may be independently processed by a plurality of matrix multipliers of fixed dimension, as shown in FIG. 2B. For example, the overall multiplication operation may be decomposed into a sequence of smaller matrix multiplications whose partial results are accumulated to form the full output matrix C 206. As an example, matrix A 202 may include 16,000 rows of elements and 64 columns of elements, and matrix B 204 may include 16,000 rows of elements and 16,000 columns of elements.

[0020] FIG. 2B is a diagram depicting an example of a striped matrix-matrix multiplication, in accordance with certain embodiments. In matrix-matrix multiplication example 201, matrix A 208 (e.g., which may be the transpose of matrix A 202, such the dimensions are interchanged) and matrix B 204 may be partitioned (e.g., striped, tiled) such that the dimensions of the partitions correspond to the dimensions of the matrix multiplier to allow each submatrix pair (e.g., tile 214 and tile 216) to be fully mapped onto the processing elements of the hardware array during a given compute pass. For example, matrix A 208 may be partitioned into a tile 214 having a fourth dimension (Dim 4) and the second dimension (Dim 2). Matrix B 204 may be partitioned into a tile 216 having a fifth dimension (Dim 5) and the second dimension (Dim 2). In such embodiment,each submatrix (e.g., tile) of matrix A 208 and matrix B 204 may be processed by a corresponding matrix multiplier (e.g., matrix multiplier 217) having the fourth dimension (Dim 4) and the fifth dimension (Dim 5) to generate the output submatrix 218 of matrix C 206.

[0021] In some embodiments, matrix A 202 and / or matrix A 208 is an input activation matrix, embedding matrix, token representation matrix, or the like. For example, matrix A 202 and / or matrix A 208 may correspond to a collection of batch vectors of size 1 x Dim 2 (e.g., for LLM engines). In some examples, matrix B 204 is a weight matrix, parameter matrix, coefficient matrix, or the like. In some examples, matrix C 206 is an output matrix. As such, Dim 1 may correspond to the batch size, and Dim 2 and Dim 3 may correspond to the model (e.g., embedding) dimension associated with a ML model.

[0022] In some embodiments, data for matrix A 208 may be fetched from one or more local memory sources, such as on-chip cache (e.g., LI, L2, L3 cache), SRAM, scratchpad memory, register file, shared memory, or the like. In other embodiments, data for matrix A 208 may be fetched from off-chip memory, such as DRAM, HBM, double data rate (DDR) memory, or the like. In some examples, data for matrix A 208 may be obtained through a cache path (e.g., LI or L2 cache). In some examples, the data may be routed through a core-local port (e.g., an uncached path), which may depend on the data locality or access latency.

[0023] In some embodiments, matrix B 204 of FIG. 2 A and FIG. 2B may be stored in long-term storage (e.g., off-chip storage) such as DRAM, DDR, HBM, Flash, or the like, and is fetched when needed. In some embodiments, matrix B 204 may be stored in short-term storage (e.g., on-chip local memory) such as SRAM, L1 / L2 / L3 cache, shared memory / scratchpad, accelerator local buffer, or the like.

[0024] In some implementations, Dim 4 corresponds to a matrix multiplier internal dimension. The dimensions of the A and B matrices may be integer multiples of the matrix multiplier internal dimension. For example, in some embodiments, at least one or more of Dim 1, Dim 2, Dim 3, or Dim 5 is an integer multiple of (e.g., divisible by) Dim 4.

[0025] In some embodiments, each submatrix (e.g., tile 214) of matrix A 208 and corresponding submatrix (e.g., tile 216) of matrix B 204 are multiplied by a matrix multiplier (e.g., matrix multiplier 217) to produce a partial output submatrix (e.g., 218), and the partial outputs are subsequently combined and / or accumulated to form the complete output matrix C 206. As illustrated in FIG. 2B, in some embodiments, each respective tile of matrix A 208 and respective tile of matrix B 204 may contribute to a corresponding submatrix (e.g., 218) of the final output C

[0026] In some embodiments, the striping dimensions (e.g., Dim 4 and Dim 5) correspond to the internal array dimensions of the matrix multipliers, (e.g., matrix multiplier 100 of FIG. 1) such as the number of processing elements in each row and column of a systolic array. For example, the size of the tiles 214 and 216 may be determined by the size of a matrix multiplier having the fourth dimension (Dim 4) and the fifth dimension (Dim 5).

[0027] In some implementations, the computation of submatrix 218 performed using a matrix multiplier (e.g., matrix multiplier 217) having unequal row and column dimensions (e.g., Dim 4 Dim 5) may utilize the same batch configuration and processing sequence as a computation performed using a matrix multiplier with equal row and column dimensions (e.g., Dim 4 = Dim 5). Further, in such embodiments, the computation of submatrix 218 performed using a matrix multiplier having unequal row and column dimensions (e.g., Dim 4 Dim 5) may require less memory utilization (e.g., less internal caches required) than that of a computation performed using a matrix multiplier with equal row and column dimensions (e.g., Dim 4 = Dim 5). For example, the reuse of operands and improved data locality may require less cache bandwidth and thereby increase computational throughput.

[0028] For example, the first dimension (Dim 1) may correspond to 256 units, the second dimension (Dim 2) may correspond to 1024 units, the third dimension (Dim 3) may correspond to 1024 units, the fourth dimension (Dim 4) may correspond to 64 units, and the fifth dimension (Dim 5) may correspond to 128 units. In such example, to calculate the result for submatrix 218 of output matrix 206, a row-stripe (e.g., tile 214) of size 64x1024 from matrix A 208 may be multiplied by a column-stripe (e.g., tile 216) of size 1024x128 from matrix B 204. To continue in such example, a column of matrix A 208 (e.g., the transposed matrix of matrix A 202) of size 64x1, and a row from matrix B 204 of size 1x128 may be pushed through a matrix multiplier, for example, matrix multiplier 217 (e.g., for k steps). In some embodiments, the value of the submatrix (e.g., 218) may be updated according to Equation (1) below:Equation (1) where m corresponds to a row index of C and AT(e.g., where C may be output matrix 206, and ATmay be matrix A 208 or the transpose of matrix A 202), n corresponds to a column index of the C and B (e.g., where C may be output matrix 206, and B may be matrix B 204), and k corresponds to an iteration index over the shared inner dimension between matrix ATand matrix B (e.g., Dim 2 = Dim 3). The shared inner dimension between matrix ATand matrix B may correspond to the dimension along which elementwise multiplication and accumulation are performed.

[0029] In some embodiments, after k steps (e.g., where k corresponds to the value of Dim 2= Dim3), m may correspond to a value defined by the following inequality:RB * Dim4 < m < Dim4 * RB ) + Dim4 where RB corresponds to current row block (e.g., row tile) being calculated and Dim 4 corresponds to a dimension (e.g., internal dimension) of a matrix multiplier (e.g., Dim 4 of FIG. 2). For example, if matrix A 208 has dimensions Dim 1 = 256, and Dim 2 = 1024, and tile 214 has dimensions Dim 4 = 64, then four tiles would be calculated (e.g., Dim 1 divided by Dim 4), and values 1 through 4 would be input for the RB variable for the respective current row tile. In some embodiments, after k steps (e.g., where k corresponds to the value of Dim 2= Dim 3), n may correspond to a value defined by the following inequality:CB * Dim4 < n < (Dim4 * CB) + Dim4 where CB corresponds to current column block (e.g., column tile) being calculated and Dim 4 corresponds to a dimension of a matrix multiplier (e.g., fourth dimension (Dim 4) of FIG. 2). For example, if matrix B 204 has dimensions Dim 3 = 1024, and Dim 2 = 1024, and tile 216 has dimensions Dim 5 = 128, then sixteen tiles would be calculated (e.g., Dim 3 divided by Dim 4), and values 1 through 16 would be input for the CB variable for the respective current column tile.

[0030] In the previous example of Dim 1 = 256, Dim 2 = 1024, Dim 3 = 1024, Dim 4 = 64, and Dim 5 = 128, after k=1024 (e.g., after 1024 iterations) the calculation of Cm nfor (64 * row) < m < (64 * row) + 64, and (128 * column) < n < (128 * column) + 128, may be completed.

[0031] In some embodiments, and under the assumption that the dimensions of the matrix A 202, matrix A 208, and matrix B 204 are integer multiples of the matrix multiplier internal dimension (e.g., Dim 4), the number of row blocks (RB total) and column blocks (CB total) may be defined by the following Equation (2) and Equation (3): Equation (2) Equation (3)In some embodiments, one matrix multiplier engine (MME) can be used RB x CB times to perform all block calculations. In some embodiments, many MMEs may be used in parallel.

[0032] In some examples, each submatrix (e.g., 218) of output matrix 206 may take a number of steps equal to Dim 2 to stream the matrix A tiles (e.g., tile 214) and the matrix B tiles (e.g., tile 216) into the MME. In some examples, matrix A may have to streamed through the L2 cache tile by tile (e.g., Dim 4 x Dim 5 at a time), so it may be transposed before entering a MME.

[0033] In some embodiments, the larger dimension of a rectangular matrix multiplier may be twice the size of the smaller dimension of the rectangular matrix multiplier (e.g., the fifth dimension of 128 units is twice the fourth dimension of 64 units). A matrix multiplier with onedimension larger than the other dimension may increase performance of a matrix engine without increasing the batch size required, and therefore may not require greater cache size. Overall latency may be defined by Equation (4): Equation(4)’ where #MME is the amount of MMEs.

[0034] Matrix multiplication may take Dim 4 x Dim 5 elements and perform Dim 4 x Dim 4 x Dim 5 steps (Dim 4 x Dim 5 per step over Dim 4 steps). For example, calculating one submatrix (e.g., 218) may reuse matrix A 208’s tile 214, Dim 2 / Dim 1 number of times. In some examples, it may read through the entire matrix B 204 only once.

[0035] In some embodiments, the order of block calculation may instead be column-first. The selection between row-major and column-major order may be determined based on an internal representation of the model weights.

[0036] FIG. 3 is a diagram depicting an example of matrix-matrix multiplication, in accordance with certain embodiments. As shown in matrix-matrix multiplication example 300, input matrix 302 (T) is multiplied by weight matrix 304 (W), producing an output matrix 306 (R). In some embodiments, the operations illustrated in FIG. 3 may correspond to a projection computation within a transformer-based model, such as a query, key, or value projection (e.g., using Wq, Wk, Wv). In some embodiments, the matrix multiplication operations may be performed for other layers or components of a neural network model.

[0037] In some embodiments, the input matrix 302 (e.g., which may be the same as Matrix A 202 and matrix A 208 of FIG. 2) may represent a batch of token vectors, where each token vector corresponds to a row of matrix 302. The weight matrix 304 (e.g., which may be the same as Matrix B 204 of FIG. 2) may include a plurality of submatrices or elements Wi,j, where each element represents a portion of the overall weight matrix. The matrix multiplication engine (MME) may perform the multiplication of the token matrix 302 by the weight matrix 304 to generate a result matrix 306 (e.g., which may be the same as matrix C 206 of FIG. 2) comprising output elements(e.g., Ro through R255) that each corresponding to a transformed token representation.

[0038] In some examples, matrix 302 may be divided into smaller submatrices (e.g., blocks, tiles, for example TO, Tl, T2, etc.) to align with the width and compute granularity of the MME. Likewise, the weight matrix 304 may be partitioned into smaller block matrices (e.g., Wi,j). The resulting matrix 306 may also be organized in a corresponding block format (e.g., RBi, where Bi corresponds to a given block index) to facilitate subsequent processing stages, such as attention computations or feed-forward projections. In some embodiments, the individual blocks or tiles ofthe input matrix 302 (e.g., To, Ti, T2, etc.) may have different dimensions than corresponding blocks or tiles of the weight matrix 304 (for example, Wi,j). For example, an input tile To may be sized to align with an MME tile (e.g., 64 x k), while a weight tile Wi,j may have dimensions (k x m), resulting in a rectangular (non-square) block multiplication between To and Wi,j. Block-level rectangular configurations may allow flexible tiling strategies, better fit to local memory or compute resources, support for operations (e.g., certain FFN projections) where operand dimensions naturally differ, combinations of the same, or the like.

[0039] While FIG. 3 illustrates an example using 256 token elements for simplicity, in some examples, the input, weight, and result matrices may have dimensions corresponding to the model configuration. For example (e.g., in some transformer architectures), the input matrix may include 64 (e.g., independent) token vectors having a feature dimension of 16,000, and the weight matrices (e.g., 1 / Fq, IVfc, Wv) may each have dimensions of 16,000 x 16,000. The described matrix-matrix multiplication may thus represent a generalized example of a high-dimensional operation executed by the MME during inference or training of large-scale models. In certain implementations, the rectangular configuration of the MME may allow for multiplication between matrices of differing dimensions (e.g., 64 x 16,000 multiplied by 16,000 x 4,000), which may occur in feed-forward network (FFN) layers or projection stages.

[0040] In some embodiments, the input matrix 302 may be reused across multiple projection operations (e.g., when being multiplied by VFq, Wk, Wvmatrices). The input matrix 302 may be stored in a local memory (e.g., on-chip cache such as LI cache, L2 cache, or L3 cache, shared memory, SRAM, SRAM buffer, register file, scratchpad memory, or the like) to allow such reuse. In some embodiments, the input matrix 302 may be stored in an off-chip memory (e.g., DRAM, HBM, or the like).

[0041] In some examples, an MME may perform a single outer-product matrix multiplication step in two cycles (e.g., at 1 GH). For example, for a 64x64 matrix using an FP16 data type, an MME may perform a [64x2] x [2x64] calculation in two cycles. In such example, a single output tile RBimatrix may to require (64 / 2) x 2 * 256=16k cycles at 1GHz to compute.

[0042] FIG. 4 is a diagram depicting an example of parallelized matrix-matrix multiplication across multiple matrix multiplication engines (MMEs), in accordance with certain embodiments. As shown in the matrix-matrix multiplication example 400 of FIG. 4, an input matrix 402 (e.g., comprising tiles To through T255) is multiplied by a weight matrix 404 (e.g., comprising submatrices Wo,o through W255,25s) to produce an output matrix 412 (e.g., Ro through R255). Portions of the operation may be distributed among multiple MMEs (e.g., 406, 408, 410). The weight matrix 404 may be partitioned into a plurality of column blocks (e.g., W_Co, W_Ci, . . . W_C2ss), eachcorresponding to a distinct matrix module (e.g., Mo, Mi, M255). Each column block W_Ci may include a set of smaller submatrices Wi,j, representing tile-level partitions used for distributed or parallel multiplication within a MME. Input matrix 402 may be the same as matrix A 202 and matrix A 208 of FIG. 2, matrix 302 of FIG. 3, or the like. Weight matrix 404 may be the same as matrix B 204 of FIG. 2, matrix 304 of FIG. 3, or the like. Output matrix 412 may be the same as output matrix 206 of FIG. 2, matrix 306 of FIG. 3, or the like.

[0043] In some embodiments, each MME (e.g., MMEo 406, MMEi 408, MME255 410) may comprise one or more matrix multipliers, buffer memories (e.g., input buffers, output buffers, accumulator buffers), control logic, and interconnect circuitry configured to coordinate data flow between the matrix multipliers and memory devices. Each MME may operate in parallel with other MMEs to perform tiled portions of a larger matrix-matrix multiplication. The matrix multipliers of each MME (e.g., MMEo 406, MMEi 408, MME255 410) may be set as a rectangle such that one dimension of the matrix multiplier is larger than the other (e.g., twice the size). Accordingly, the operations of rectangular matrix - matrix multiplication described with respect to FIG. 2 may also be performed and / or implemented by MMEs of matrix - matrix multiplication example 400.

[0044] For example, each MME (e.g., MMEo 406, MMEi 408, MME255 410) may receive a corresponding column block of the weight matrix 404 and the appropriate input tile(s) from the input matrix 402. As shown in FIG. 4, a single input tile (T2) of the input matrix 402 is provided to multiple MMEs (e.g., MMEo 406, MMEi 408, MME255 410). Each MME may include one or more matrix multipliers configured to perform independent partial matrix-matrix multiplication operations (e.g., T2 x W_Co, T2 x W_Ci, T2 x W C255). The results of these partial operations (e.g., Ro through R255) may collectively form the output matrix 412.

[0045] In some implementations, the input tile(s) (e.g., tiles To through T255) may be streamed to multiple MMEs from a local memory (e.g., LI cache, L2 cache, L3 cache, SRAM, scratchpad memory, register file, or the like), which may allow reuse across multiple column-block multiplications. In some examples, by streaming the weight matrix 404 through the compute logic pipeline (CLP) to multiple MMEs in parallel, high throughput and efficient utilization of the available bandwidth may be achieved. For example, the CLP may support a bandwidth of 512 GB / s at 1 GHz (-512 bytes per cycle), such that transferring a [2 x 64] block (-256 bytes) of the weight matrix 404 to an MME can be completed in a single cycle. For a 64-element column Ci within a column block W_Ci of the weight matrix 404, the MME may perform (16,000 / 2) steps x 2 cycles per step = 16 K cycles at 1 GHz (-16 ps). As an example, when 256 MMEs operate in parallel and input tiles may be reused, the entire T x w computation may be completed in approximately 16 K cycles at 1 GHz.

[0046] FIG. 5 depicts a system of devices for implementing matrix-matrix multiplication, in accordance with certain embodiments. System 500 comprises compute device 502 (e.g., GP-AIU, GP-AIU server, GPU, TPU, CPU, custom Al accelerator, a processor, or the like) coupled to a first memory device 504 (e.g., HBM on-package, SRAM, LI, L2 or L3 cache, scratchpad memory, DRAM, register file, any other local memory, any other off-chip memory, or the like), and a second memory device 506 (e.g., HBM, DRAM, on-chip cache such as LI, L2, or L3 cache, block RAM (BRAM), other non-volatile or volatile storage, any other local memory, any other off-chip memory, or the like). While FIG. 5 illustrates a single compute device 502 coupled to two memory devices 504, 506, in some embodiments, system 500 may comprise any suitable number of compute devices (e.g., GP-AIUs) and memory devices, each compute device being coupled to one or more memory devices within a larger computing system.

[0047] In some embodiments, the first memory device may comprise a high-bandwidth memory (HBM), cache memory, an on-chip SRAM configured to provide low-latency access to data (e.g., token data), scratchpad memory, combinations of the same, or the like. The first memory device may be configured to store input data (e.g., token vectors, activations) and / or stream such data to the compute device 502 (e.g., a GP-AIU) for processing. In some embodiments, the second memory device may comprise a high-capacity memory (e.g., HBM, Graphics Double Data Rate (GDDR), BRAM, cache memory, DDR memory, other non-volatile or volatile storge, or the like) configured to store model parameters and / or stream data (e.g., weight data) to the compute device 502.

[0048] In some embodiments, first memory device 504 may store an input matrix (e.g., a token or activation matrix). For example, matrix A 202 and matrix A 208 of FIG. 2, matrix 302 of FIG. 3, input matrix 402 of FIG. 4, or the like, may be stored on first memory device 504. In some embodiments, second memory device 506 may store a model matrix (e.g., a weight, parameter, or coefficient matrix). For example, matrix B 204 of FIG. 2, matrix 304 of FIG. 3, weight matrix 404 of FIG. 4, or the like, may be stored on the second memory device 506.

[0049] In some implementations, each of the first memory device 504 and the second memory device 506 may comprise any number of memory chips (e.g., 8, 16, 32) disposed on a substrate or integrated within a memory appliance. As shown in FIG. 5, first memory device 504 may comprise eight chips (e.g., chip 508), and second memory device 506 may comprise eight chips (e.g., 510). Each chip may include local controllers, buffers, or interconnect interfaces configured to facilitate high-speed data transfer with compute device 502.

[0050] Chip 508 may correspond to an input matrix (e.g., matrix A 202 and matrix A 208 of FIG.2, matrix 302 of FIG. 3, input matrix 402 of FIG. 4) and chip 510 may correspond to a modelmatrix (e.g., matrix B 204 of FIG. 2, matrix 304 of FIG. 3, weight matrix 404 of FIG. 4). In some examples, data stored across the chip (e.g., chip 508, chip 510) of first memory device 504 and second memory device 506 may be partitioned into tiles or blocks corresponding to submatrices (e.g., input tiles To-Tnand weight tiles Wi,j of FIG. 4). Each tile may be streamed or transmitted to one or more MMEs of compute device 502 for processing. For example, respective chips may concurrently supply distinct matrix tiles to different MMEs, allowing simultaneous computation of multiple submatrix multiplications.

[0051] Compute device 502 may include one or more matrix multiplier engines configured to perform matrix-matrix multiplications computations (e.g., for LLM). For example, MMEo through MME255 of FIG. 4 (e.g., 406, 408, 410) may be disposed on, integrated within, included in, or coupled to compute device 502. For example, compute device 502 may comprise one or more chiplets (e.g., chiplet 512). An MME capable of performing matrix-matrix multiplication may be implemented within a chiplet (e.g., chiplet 512). Accordingly, the operations of rectangular matrix - matrix multiplication described with respect to FIG. 2B may also be performed and / or implemented by MMEs within system 500. For simplicity, FIG. 5 illustrates compute device 502 comprising two chiplets (e.g., chiplet 512). Each chiplet may be connected to any number of CPC connectors (e.g., connector 514). For simplicity, FIG. 5 illustrates chiplet 512 connected to two CPC connectors (e.g., connector 514).

[0052] In some examples, the MMEs (e.g., implemented within one or more chiplets such as chiplet 512) may be integrated within the compute device 502 (e.g., GP-AIU) and then coupled to both the first and second memory devices via co-packaged copper (CPC) connectors.

[0053] In some implementations, a plurality of connectors (e.g., connector 516) are placed along an edge of compute device 502 (e.g., the East edge). In some implementations, a plurality of connectors is placed along an edge of first memory device 504 and / or second memory device 506 (e.g., the West edge), for example, connector 518 of first memory device 504. In some embodiments, first memory device 504 and / or second memory device 506 are connected to compute device 502 via wires (e.g., 224G Twinax cables).

[0054] In some embodiments, the compute device 502 and the memory devices 504, 506 may be coupled via interconnects (e.g., CPCs and / or 224G Twinax cables), which may be configured to support the bandwidth required for matrix-matrix multiplication operations. For example, the interconnects may connect compute device 502 to one or more memory devices (e.g., first memory device 504 and second memory device 506) via CPC connectors. In some embodiments, one or more interconnects comprise at least one copper wire. For example, interconnect 520 maycomprise sixty-four wires, and interconnect 522 may comprise one hundred and twenty-eight wires.

[0055] As shown in FIG. 5, compute device 502 comprises one or more chiplets (e.g., chiplet 512), each chiplet having one or more connectors (e.g., connector 514). Interconnects may connect chiplet connectors (e.g., 514) to compute device connectors (e.g., 516), for example, interconnect 520 bridges connector 514 and connector 516. Interconnects may connect compute device connectors (e.g., 516) to memory device connectors (e.g., 518), for example, interconnect 522 bridges connector 516 and connector 518. Interconnects may connect memory device connectors (e.g., 518) to memory device chip (e.g., 508) connectors.

[0056] In some implementations, the connectors (e.g., connectors 514, 508, 510) may be arranged with only receive signals (Rx) on a first wafer. For example, connectors 514, 508, 510 may be CPC connectors that comprise a wafer having only Rx signals. In some embodiments, the connectors (e.g., connectors 514, 508, 510) may be arranged with only transmit signals (Tx) on a second wafer. For example, connectors 514, 508, 510 may be CPC connectors that comprise a wafer having only Tx signals.

[0057] In some embodiments, compute device 502 (e.g., GP-AIU) comprises a first substrate, an interposer placed directly on a portion of the first substrate, and a chip (e.g., chiplet 512) placed on a portion of the interposer. A second chip may be placed on a second portion of the interposer. The second chip (e.g., logic chip, sterilizer / deserializers chip, may be adjacent to the first chip, and may be configured to access memory devices (e.g., memory device 504, memory device 506) that are not located on the first substrate. In some embodiments, the second chip may be electrically coupled to the first chip.

[0058] The interconnects 520, 522 and connectors 514, 516, 518 as described with respect to compute device 502 and memory device 504, may likewise be used to couple compute device 502 to memory device 506, or to couple any additional memory or compute devices within the system 500.

[0059] In some embodiments, the compute device 502 (e.g., GP-AIU) may include or interface with one or more matrix multiplication engines, tensor cores, or processing arrays configured to perform the MAC operations. In some embodiments, the coupling between the compute device 502 and the memory devices 504, 506 may be implemented through one or more memory buses, interconnect fabrics, system-on-chip (SoC) architectures, combinations of the same, or the like.

Claims

CLAIMSWhat is claimed is:

1. A system comprising: a first memory device, wherein a first matrix having a first dimension and second dimension is stored on the first memory device; a second memory device, wherein a second matrix having the second dimension and a third dimension is stored on the second memory device; and a general-purpose artificial intelligence unit (GP-AIU) coupled to the first memory device and the second memory device, wherein: a plurality of matrix multipliers is coupled to the GP-AIU, wherein each matrix multiplier in the plurality of matrix multipliers has a fourth dimension and a fifth dimension; the plurality of matrix multipliers is configured to generate a third matrix having the first dimension and the third dimension; and the fifth dimension is twice the fourth dimension.

2. The system of claim 1, wherein: each matrix multiplier of the plurality of matrix multipliers is configured as a rectangular systolic array of processing elements.

3. The system of claim 2, wherein a processing element corresponds to a node in the rectangular systolic array, and wherein the processing element is configured to perform multiply-and- accumulate (MAC) operations and to transmit partial results to one or more adjacent processing elements.

4. The system of claim 3, wherein the first dimension, second dimension, and third dimension are measured in units of individual data elements representing any one or more of: numeric values, tokens, feature activations, coefficients, parameters, or intermediate computation values.

5. The system of claim 4, wherein the MAC operations between respective portions of the first matrix and the second matrix comprises: computing elements of the third matrix according to a relationship:C(m, n) = Ek AT(k, m) x B(k, n), wherein ATcorresponds to the first matrix, and B corresponds to the second matrix, and C corresponds to the third matrix.

6. The system of claim 5, wherein at least one or more of: the first dimension, the second dimension, or the third dimension, is divisible by the fourth dimension.

7. The system of claim 6, wherein: the first matrix corresponds to a plurality of input instances; the first matrix is a transpose of an input data matrix associated with the plurality of input instances; and the second matrix correspond to feature dimensions associated with the plurality of input instances.

8. A system of claim 7, further comprising: a first connector connected to the GP-AIU; a second connector connected to the first memory device; a third connector connected to the second memory device; a first interconnect, wherein a first side of the first interconnect is attached to the first connector and a second side of the first interconnect is attached to the second connector; and a second interconnect, wherein a first side of the second interconnect is attached to the first connector and the second side of the second interconnect is attached to the third connector;9. The system of claim 8, wherein: one or more of the first connector, second connector, or third connector comprises a copackaged copper (CPC) connector; and one or more of the first connector, second connector, or third connector comprises a plurality of wafers; wherein each wafer in the plurality of wafers is configured to send only one of: (i) Rx signals or (ii) Tx signals.

10. The system of claim 1, wherein: the fourth dimension corresponds to 64 processing elements; and the fifth dimension corresponds to 128 processing elements.

11. A system comprising: a first memory device, wherein a first matrix having a first dimension and second dimension is stored on the first memory device;a second memory device, wherein a second matrix having the second dimension and a third dimension is stored on the second memory device; and a processor connected to the first memory device and the second memory device, wherein: a plurality of matrix multipliers is stored on the processor, and wherein each matrix multiplier in the plurality of matrix multipliers has a fourth dimension and a fifth dimension; the plurality of matrix multipliers is configured to generate a third matrix having the first dimension and the third dimension; and the fifth dimension is greater than the fourth dimension.

12. A system comprising: a first memory device, wherein a first matrix having a first dimension and second dimension is stored on the first memory device; a second memory device, wherein a second matrix having the second dimension and a third dimension is stored on the second memory device; and a general-purpose artificial intelligence unit (GP-AIU) coupled to the first memory device and the second memory device, wherein: a plurality of matrix multipliers is coupled to the GP-AIU, wherein each matrix multiplier in the plurality of matrix multipliers has a fourth dimension and a fifth dimension; the plurality of matrix multipliers is configured to generate a third matrix having the first dimension and the third dimension; the fifth dimension is greater than the fourth dimension; and the first dimension, the second dimension, the third dimension, and the fifth dimension, are divisible by the fourth dimension.