Systems and methods for ultra memory-efficient on-FPGA training of transformers using tensor-compressed optimization

Tensor-compressed optimization using TT and TTM decompositions and bi-directional contraction allows transformer model training on resource-constrained devices, reducing memory and computational demands while preserving accuracy.

WO2026128915A1PCT designated stage Publication Date: 2026-06-18RGT UNIV OF CALIFORNIA

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
RGT UNIV OF CALIFORNIA
Filing Date
2025-12-15
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Transformer model training requires substantial computational resources and memory capacity, limiting deployment to powerful server-based systems due to quadratic computational complexity and large memory demands, which are challenging for resource-constrained environments.

Method used

Implement tensor-compressed optimization using tensor-train (TT) and tensor-train-matrix (TTM) decompositions for weight matrices and embedding tables, enabling bi-directional tensor-train contraction and on-chip memory management to maintain compressed representations throughout training, reducing memory and computational requirements.

🎯Benefits of technology

Enables efficient transformer model training on resource-constrained devices with reduced memory footprints and computational overhead, maintaining training accuracy comparable to uncompressed models.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure US2025059729_18062026_PF_FP_ABST
    Figure US2025059729_18062026_PF_FP_ABST
Patent Text Reader

Abstract

One embodiment includes a method for training a transformer model on a resource-constrained device. The method includes encoding input tokens into hidden representations using tensor-compressed embedding tables stored in tensor-train-matrix (TTM) format. The method includes performing forward propagation on the hidden representations through tensor-compressed layers including tensor-train (TT) decomposed weight matrices to generate model outputs, wherein the forward propagation maintains all weight parameters in compressed tensor format without reconstructing dense weight matrices. The method includes computing gradients of loss from the model outputs. The method includes backpropagating the gradients through the tensor-compressed layers to compute gradients with respect to tensor cores of the TTM format embedding tables and the TT format weight matrices using tensor network contractions. The method includes updating the tensor cores of the tensor-compressed embedding tables and weight matrices directly in compressed format using the computed gradients with respect to the tensor cores.
Need to check novelty before this filing date? Find Prior Art

Description

SYSTEMS AND METHODS FOR ULTRA MEMORY-EFFICIENT ON-FPGA TRAINING OF TRANSFORMERS USING TENSOR-COMPRESSED OPTIMIZATION CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Application No. 63 / 733,881, titled Systems and Methods for Ultra Memory-Efficient On-FPGA Training of Transformers Using Tensor-Compressed Optimization", filed Dec 13, 2024, which is hereby incorporated by reference in its entirety.FIELD OF INVENTION

[0002] The present disclosure relates to neural network training systems for resource-constrained computing devices, and more particularly to systems and methods for ultra memoryefficient on-FPGA training of transformer models using tensor-compressed optimization with bidirectional tensor-train contraction algorithms.BACKGROUND

[0003] Transformer models have become a cornerstone of modem machine learning, particularly in natural language processing and computer vision applications. These models utilize attention mechanisms to process sequential data and have demonstrated state-of-the-art performance across a wide range of tasks including language translation, text classification, and image recognition. The architecture consists of multiple layers containing self-attention mechanisms and feed-forward networks, with residual connections and layer normalization components that enable effective training of deep networks.

[0004] The training of transformer models involves processing large datasets through forward propagation, backward propagation, and parameter update stages. During forward propagation, input data flows through embedding layers, attention mechanisms, and feed-forward networks to generate predictions. The backward propagation stage computes gradients with respect to model parameters using chain rule differentiation, while parameter updates apply optimization algorithms such as stochastic gradient descent to adjust model weights. This training process requires substantial computational resources and memory capacity to store model parameters, intermediate activations, and gradient information throughout the training iterations.

[0005] Traditional transformer training approaches utilize dense matrix representations for weight parameters and embedding tables, which can consume hundreds of megabytes or gigabytesof memory depending on model size. The computational complexity of attention mechanisms scales quadratically with sequence length, while feed-forward networks require large matrixvector multiplication operations during both forward and backward propagation stages. These computational and memory requirements have led to the development of various optimization techniques including gradient accumulation, mixed-precision training, and distributed training across multiple devices.

[0006] The substantial resource requirements of transformer training have traditionally limited deployment to powerful server-based systems with high-performance graphics processing units or specialized hardware accelerators. The memory and computational demands present challenges for training operations in environments with limited resources, where conventional approaches may exceed available memory capacity or computational capabilities. These constraints have motivated research into alternative training methodologies that can reduce resource requirements while maintaining training effectiveness and model accuracy.SUMMARY

[0007] This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

[0008] One embodiment includes a method for training a transformer model on a resource-constrained device. The method includes encoding input tokens into hidden representations using tensor-compressed embedding tables stored in tensor-train-matrix (TTM) format. The method includes performing forward propagation on the hidden representations through tensor-compressed layers including tensor-train (TT) decomposed weight matrices to generate model outputs, wherein the forward propagation maintains all weight parameters in compressed tensor format without reconstructing dense weight matrices. The method includes computing gradients of loss from the model outputs. The method includes backpropagating the gradients through the tensor-compressed layers to compute gradients with respect to tensor cores of the TTM format embedding tables and the TT format weight matrices using tensor network contractions. The method includes updating the tensor cores of the tensor-compressed embedding tables and weight matrices directly in compressed format using the computed gradients with respect to the tensor cores.

[0009] In another embodiment, performing forward propagation includes utilizing bidirectional tensor-train contraction that performs contractions from both left and right directions toward a middle point in parallel.

[0010] In yet another embodiment, the bi-directional tensor-train contraction reduces a total number of computation stages from 2d to d+1, where d represents a decomposition depth of the tensor-compressed layers.

[0011] In a further embodiment, backpropagating the gradients includes computing activation gradients and parameter gradients simultaneously through fused parallel tensor-train contraction techniques.

[0012] In an additional embodiment, the method further includes storing all tensor cores representing the tensor-compressed embedding tables and weight matrices in on-chip memory to eliminate off-chip memory access during parameter operations.

[0013] In still another embodiment, the on-chip memory includes block RAM (BRAM) with capacity less than 6MB and ultra RAM (URAM) with capacity of 22.5MB.

[0014] In yet a further embodiment, performing forward propagation through tensor-compressed layers includes processing query, key, and value transformations through TT-format linear layers, and computing attention mechanisms using compressed representations without reconstructing dense weight matrices during matrix-vector multiplication operations.

[0015] In another embodiment, backpropagating the gradients includes computing gradients for each tensor core Gk through tensor network contractions that eliminate a target core from a tensor network while maintaining connections between remaining tensor cores.

[0016] In still a further embodiment, updating the tensor cores includes applying gradient descent operations directly to individual tensor cores using learning rates while maintaining all parameter representations in compressed tensor formats throughout optimization steps.

[0017] In yet still another embodiment, the tensor-compressed format is maintained throughout forward propagation, backward propagation, and parameter updates without recovering original uncompressed matrices.

[0018] One embodiment includes a tensorized transformer training accelerator. The accelerator includes on-chip memory and computation kernels configured to store tensor cores representing tensor-compressed model parameters including TTM cores for embedding tables and TT cores for weight matrices. The accelerator includes off-chip memory configured to storetraining data, activations, and labels. The accelerator includes a forward propagation engine configured to receive input tokens from the off-chip memory, encode the input tokens into hidden representations using the tensor-compressed embedding tables, and process the hidden representations through the tensor-compressed weight matrices to generate model outputs while maintaining all parameters in compressed format without reconstructing dense matrices. The accelerator includes a back propagation engine configured to receive gradients of loss computed from the model outputs, and backpropagate the gradients through tensor-compressed layers to compute gradients with respect to the tensor cores through tensor network contractions. The accelerator includes a parameter update kernel configured to update the tensor cores stored in the on-chip memory directly in compressed format using the gradients computed by the back propagation engine.

[0019] In another embodiment, the forward propagation engine is configured to utilize bidirectional tensor-train contraction that performs contractions from both left and right directions toward a middle point in parallel.

[0020] In yet another embodiment, the bi-directional tensor-train contraction reduces a total number of computation stages from 2d to d+1, where d represents a decomposition depth of the tensor-compressed layers.

[0021] In a further embodiment, the back propagation engine is configured to utilize fused parallel tensor-train contraction techniques that compute activation gradients and parameter gradients simultaneously while eliminating intermediate tensor storage requirements during gradient calculation.

[0022] In an additional embodiment, the on-chip memory includes block RAM (BRAM) with capacity less than 6MB and ultra RAM (URAM) with capacity of 22.5MB, wherein all tensor cores are stored entirely on-chip.

[0023] In still another embodiment, the on-chip memory and computation kernels implement array partitioning that partitions tensor core data into multiple smaller arrays mapped to separate BRAM blocks to enable parallel data access.

[0024] In yet a further embodiment, the forward propagation engine includes a TTM-FP kernel configured to process tensor-train-matrix embedding operations, a TT-FP kernel configured to process tensor-train linear layer operations through bi-directional tensor network contractions,and an MM kernel configured to perform matrix multiplication operations for attention mechanisms.

[0025] In another embodiment, the back propagation engine includes a TTM-BP kernel configured to compute gradients for tensor-train-matrix embedding layers, a TT-BP kernel configured to compute gradients for tensor-train linear layers through bi-directional tensor network contractions, and an MM kernel configured to handle matrix multiplication operations during backward propagation for attention mechanisms.

[0026] In still a further embodiment, the accelerator further includes task scheduling circuitry configured to optimize parallel tensor operations by moving non-urgent operations to later time steps without increasing total latency and enabling hardware resource sharing through temporal multiplexing.

[0027] In yet still another embodiment, the tensor cores remain in compressed format throughout forward propagation, backward propagation, and parameter update operations.

[0028] The foregoing general description of the illustrative embodiments and the following detailed description thereof are merely exemplary aspects of the teachings of this disclosure and are not restrictive.BRIEF DESCRIPTION OF FIGURES

[0029] Non-limiting and non-exhaustive examples are described with reference to the following figures.

[0030] FIG. 1 illustrates a transformer structure for classification tasks in accordance with an embodiment of the invention.

[0031] FIG. 2 illustrates tensor decomposition formats and operations used in neural network compression in accordance with an embodiment of the invention.

[0032] FIG. 3 illustrates a flowchart for a tensor-compressed transformer training process in accordance with an embodiment of the invention.

[0033] FIG. 4 illustrates a detailed training flow of one encoder block in a tensor-compressed transformer training system in accordance with an embodiment of the invention.

[0034] FIG. 5 illustrates tensor graph representations for forward propagation, activation gradient computation, and TT-core gradient computation in a tensor-train format linear layer in accordance with an embodiment of the invention.

[0035] FIG. 6 illustrates a comparison of computational and memory complexity between tensor-train format and bi-directional tensor-train format forward propagation in accordance with an embodiment of the invention.

[0036] FIG. 7 illustrates a block diagram of a tensorized transformer training accelerator in accordance with an embodiment of the invention.

[0037] FIG. 8 illustrates a system diagram of a tensorized transformer training accelerator during backward propagation in accordance with an embodiment of the invention.

[0038] FIG. 9 illustrates a system diagram of a tensorized transformer training accelerator during forward propagation in accordance with an embodiment of the invention.

[0039] FIG. 10 illustrates a comparison of task scheduling for BTT-format forward propagation before and after optimization in accordance with an embodiment of the invention.

[0040] FIG. 11 illustrates a sequence diagram representing a training process with fused parallel BTT in accordance with an embodiment of the invention.

[0041] FIG. 12 illustrates a computing device architecture for performing tensor-compressed training in accordance with an embodiment of the invention.DETAILED DESCRIPTION

[0042] The following description sets forth exemplary aspects of the present disclosure. It should be recognized, however, that such description is not intended as a limitation on the scope of the present disclosure. Rather, the description also encompasses combinations and modifications to those exemplary aspects described herein.

[0043] In many embodiments of the invention, tensor-compressed training systems achieve transformer training through software hardware co-design in which model weights are represented in tensor-compressed form using tensor-train decomposition and tensor-train-matrix decomposition, and trained directly using a bi-directional tensor-train contraction algorithm, with parameter updates performed in compressed space and supported by on-chip memory-optimized hardware. Tensor-compressed training systems in accordance with several embodiments of the invention enable end-to-end transformer training on resource-constrained edge devices by maintaining model parameters in compressed tensor formats throughout the entire training process, eliminating the need to reconstruct dense weight matrices during forward propagation, backward propagation, or parameter updates.

[0044] In various embodiments of the invention, tensor-compressed training systems distinguish from prior art approaches that may compress models for inference or apply posttraining compression techniques but do not support end-to-end training with tensor-train space training or bi-directional tensor-train contraction. The model may be trained directly in tensor-compressed form using tensor-train (TT) decomposition for weight matrices and tensor-train-matrix (TTM) decomposition for embedding tables, avoiding the computational and memory overhead associated with reconstructing full-rank weight matrices. Tensor-compressed training systems in accordance with many embodiments of the invention implement bi-directional tensortrain contraction that enables parallel forward and backward propagation with reduced floatingpoint operations and reduced intermediate memory requirements compared to conventional sequential contraction approaches.

[0045] The bi-directional tensor contraction technique may perform contractions from both left and right directions towards a middle point in parallel, reducing the total number of computation stages from 2d to d+1, where d represents the decomposition depth. In several embodiments of the invention, tensor-compressed training systems utilize fused core-gradient computation that removes intermediate tensors during gradient calculation, enabling efficient gradient updates to be performed directly on tensor-train cores without storing large intermediate activation tensors. The software hardware co-design approach may maintain all model parameters and optimizer states in on-chip memory, reducing off-chip memory access and associated latency and energy costs during training operations.

[0046] Tensor-compressed training systems in accordance with various embodiments of the invention provide robustness through the integration of algorithmic tensor compression techniques with specialized hardware architectures designed to support tensor network operations. The codesign approach may enable transformer models with memory footprints that exceed the capacity of edge devices to be trained locally while maintaining training accuracy comparable to uncompressed models. In many embodiments of the invention, tensor-compressed training systems achieve memory reduction ratios that allow models requiring hundreds of megabytes in uncompressed form to be trained within the on-chip memory constraints of field-programmable gate arrays and other resource-limited computing platforms.

[0047] Turning now to the drawings, systems and methods for implementing tensor-compressed transformer training architectures configured in accordance with variousembodiments of the invention are illustrated. Such tensor-compressed transformer training architectures may enhance the memory efficiency and computational performance of transformer model training on resource-constrained edge devices. Relevant systems may involve, but are not limited to tensor-train decomposition formats, tensor-train-matrix decomposition structures, bidirectional contraction algorithms, on-chip memory management, embedding layer compression, attention mechanism optimization, and feed-forward network tensorization.

[0048] A transformer structure for classification tasks in accordance with an embodiment of the invention is illustrated in Fig. 1. The transformer structure demonstrates a standard approach for classification tasks. The structure includes an embedding layer (EMBD) that processes three types of embeddings: token embeddings (Etok), segment embeddings (Eseg), and positional embeddings (Epos). These embeddings may be combined through summation and layer normalization (LN) operations to produce intermediate representations that feed into subsequent processing stages. In tensor-compressed training systems in accordance with various embodiments of the invention, the embedding tables may utilize tensor-train-matrix (TTM) format with specific tensor ranks to achieve substantial memory reduction compared to conventional dense embedding representations.

[0049] The encoder blocks (ENC x N) contain attention layers that perform query (Q), key (K), and value (V) transformations using weight matrices Wq, Wk, and Wv. The attention mechanism computes attention scores through matrix multiplication of Q and K matrices followed by softmax operations, which are then used to weight the V matrix. The attention output may be processed through another linear transformation using weight matrix Woand combined with residual connections and layer normalization to produce intermediate outputs YattnClassificationbased transformer architectures in accordance with several embodiments of the invention may compress these weight matrices using tensor-train (TT) format with rank 12, enabling the linear layers to maintain computational accuracy while reducing memory requirements substantially compared to uncompressed implementations.

[0050] Feed-forward networks within the encoder blocks apply sequential linear transformations using weight matrices W i and W2 with GELU activation functions between the transformations. The feed-forward network outputs may be combined with residual connectionsand layer normalization to produce final encoder outputs Yffn. In many embodiments of the invention, classification-based transformer architectures compress these feed-forward weight matrices using TT format with rank 12, allowing the networks to operate within the memory constraints of edge devices while preserving the representational capacity needed for complex classification tasks.

[0051] The classifier block (CLS) processes encoder outputs to generate final predictions through hyperbolic tangent (Tanh) activation functions and weight matrices Wpooland Wcls. The classifier may compute logits for classification tasks by applying these transformations to the processed encoder representations. In accordance with various embodiments of the invention, classification-based transformer architectures may compress classifier weight matrices using tensor-train decomposition while maintaining the final task-specific linear layer in uncompressed form to preserve classification accuracy.

[0052] Although specific examples of transformer structures are described above with reference to FIG. 1, alternative implementations are possible that are appropriate to the requirements of specific applications in accordance with various embodiments of the invention. Classification-based transformer architectures may incorporate different numbers of encoder blocks, alternative attention mechanisms, or modified feed-forward network configurations while maintaining the tensor compression principles that enable efficient training on resource-constrained platforms.

[0053] Tensor decomposition formats provide mathematical frameworks for representing high-dimensional data structures in compressed forms that maintain computational efficiency while reducing memory requirements. These formats enable the decomposition of large matrices and tensors into sequences of smaller, interconnected components that preserve the mathematical relationships of the original structures. Tensor graph representations utilize visual notation systems to illustrate the connections and operations between tensor components, facilitating the understanding of complex tensor network architectures and their computational flows.

[0054] Tensor decomposition formats and operations in accordance with an embodiment of the invention are illustrated in FIG. 2. The tensor graph representations demonstrate various mathematical structures used in tensor-compressed training systems, ranging from basic matrix representations to complex tensor network decompositions. These representations may provide thefoundational mathematical framework for implementing memory-efficient transformer training on resource-constrained devices.

[0055] A matrix representation shows a standard two-dimensional array denoted as A ∈ Rn₁×n₂which may serve as the baseline format for weight matrices in conventional neural network implementations. The matrix representation utilizes a single rectangular node in the tensor graph notation, indicating a two-dimensional structure with modes n₁ and n₂. In tensor-compressed training systems in accordance with various embodiments of the invention, these standard matrices may be decomposed into tensor-train formats to achieve substantial memory reduction while maintaining computational accuracy.

[0056] An order-4 tensor representation extends the matrix concept to higher dimensions, denoted as A ∈ Rn₁×n₂×n₃×n₄The tensor graph representation displays this structure as a single node with four extending edges, each representing one of the tensor modes. Higher-order tensors in accordance with several embodiments of the invention may provide increased flexibility for representing complex weight relationships in neural networks while enabling more aggressive compression ratios through tensor decomposition techniques.

[0057] A tensor contraction operation between two order-3 tensors demonstrates the mathematical process of combining tensors along shared dimensions. The operation A ×13B represents the contraction of tensor A along its third mode with tensor B along its first mode. The tensor graph representation shows two connected nodes with a shared edge indicating the contracted dimension. In many embodiments of the invention, tensor contraction operations form the computational foundation for forward propagation, backward propagation, and gradient computation in tensor-compressed training systems.

[0058] The tensor-train matrix (TTM) decomposition format represents a specialized tensor network structure for compressing embedding tables and other matrix-like data structures. TheTTM format decomposes a matrix W GR into a sequence of tensor cores Fk∈ Rr×n×m×r, where the decomposition may be expressed as W =F1× F2× ... × Fd. The tensor graph representation shows nodes connected by rank parameters r1, r2, through rd, with both vertical edges representing the matrix dimensions ni and n, and horizontal edges representing the rank connections between cores. TTM decomposition in accordance with various embodiments of theinvention may reduce the number of parameters from ∏dk=1mknkto ∑dk=1rk-1mknkrk, achieving compression ratiosof O(md-1nd-1 / dr2) for embedding tables.

[0059] The tensor train (TT) decomposition format provides a sequential tensor network structure for compressing weight matrices in linear layers. The TT format decomposes a tensor A G Rn‘x-xndinto a sequence of three-dimensional tensor cores Gk∈ Rr×n×r, where the decomposition may be expressed as A = G1x13G2x13... Gd. The tensor graph representation displays sequential nodes connected by horizontal rank edges n,, through rd, with single vertical edges representing the tensor dimensions m,, through nd. TT decomposition in accordance with several embodiments of the invention may reduce the number of parameters from ∏dk=1mknkto ∑dk=1rk-1mkrk+ rk-1+dnkrk+d, enabling substantial memory reduction with compression ratios of O(dr2(m + n)) for linear layer weight matrices.

[0060] In tensor-compressed training systems in accordance with many embodiments of the invention, large weight matrices may be converted directly into TT cores without reconstructing the original dense matrices during training operations. The conversion process involves reshaping weight matrices W ∈ RM×Ninto higher-order tensors W G Rmix-xmdxnix-xndj where M = nf=imi and N = ]”[=1nj. The tensorized weight matrices may then be decomposed into 2d TT cores that represent both input and output dimensions of the original matrix, enabling efficient tensor network contractions during forward and backward propagation.

[0061] Embedding tables in tensor-compressed training systems may be converted into TTM cores to achieve substantial memory reduction for vocabulary representations. The conversion process involves reshaping embedding matrices Etok∈ RM×Ninto higher-order tensor format Etok∈ Rm×n×...×m×n, followed by TTM decomposition into d tensor cores Fk∈ Rr×m×n×r. The TTM representation may enable efficient lookup operations through tensor network contractions while maintaining the semantic relationships encoded in the original embedding space.

[0062] Model parameters in tensor-compressed training systems in accordance with various embodiments of the invention may be stored and trained directly as TT and TTM cores without reconstructing the original dense matrices during any stage of the training process. The tensor cores may be updated through gradient descent operations applied directly to the compressed representations, eliminating the computational and memory overhead associated with dense matrixreconstruction. Parameter updates may be performed using gradients computed through tensor network contractions, where the gradient with respect to each tensor core Gk may be calculated as G'k= ∂L / ∂Gkthrough efficient tensor contraction operations involving the remaining tensor cores and activation gradients.

[0063] Although specific examples of tensor decomposition formats are described above with reference to FIG. 2, alternative implementations are possible that are appropriate to the requirements of specific applications in accordance with various embodiments of the invention.Training Process and Dataflow

[0064] A flowchart for a tensor-compressed transformer training process in accordance with an embodiment of the invention is illustrated in FIG. 3. The tensor-compressed transformer training process 300 demonstrates a sequential workflow that leverages tensor compression throughout forward propagation, gradient computation, backpropagation, and parameter update stages. Processes in accordance with various embodiments of the invention may maintain tensor-compressed representations of model parameters and perform computations directly on these compressed formats without recovering the original uncompressed weight matrices during any stage of the training operation.

[0065] Process 300 encodes (310) input tokens into hidden representations. In many embodiments of the invention, processes utilize tensor-train-matrix (TTM) decomposed embedding tables to convert discrete input tokens into dense vector representations that capture semantic relationships while maintaining substantial memory reduction compared to conventional dense embedding approaches. Processes in accordance with various embodiments generate tensorized input representations that serve as compressed feature vectors for subsequent processing stages, where the embedding lookup operations are performed directly on TTM cores through efficient tensor contraction sequences.

[0066] Process 300 performs (320) forward propagation with tensor-compressed layers. In several embodiments of the invention, processes utilize tensor-train (TT) decomposed weight matrices in attention layers and feed-forward networks to compute forward propagation without reconstructing dense weight matrices during matrix-vector multiplication operations. In certain embodiments, processes execute bi-directional tensor-train contraction algorithms that reduce computational complexity and memory requirements compared to conventional sequentialcontraction approaches. Processes in accordance with some embodiments process query, key, and value transformations through TT-format linear layers, compute attention mechanisms using compressed representations, and generate intermediate activations through tensorized feedforward networks while maintaining all computations in compressed tensor space.

[0067] Process 300 computes (330) gradients of loss from model outputs. In various embodiments of the invention, processes calculate loss gradients with respect to model predictions and propagate these gradients backward through the tensor-compressed network architecture without expanding compressed parameters to their original dense forms. Processes in accordance with a variety of embodiments utilize standard loss functions such as cross-entropy for classification tasks while maintaining gradient computations in formats compatible with tensor-compressed parameter representations. In a number of embodiments, processes generate gradient signals that flow backward through the network, where the gradient computation process maintains compatibility with tensor network contraction operations used in subsequent backpropagation stages.

[0068] Process 300 backpropagates (340) loss through the tensor-compressed network. In many embodiments of the invention, processes compute gradients with respect to activations and tensor cores through tensor network contractions that eliminate intermediate tensor storage requirements during gradient calculation. Processes in accordance with several embodiments utilize fused parallel tensor-train contraction techniques that compute activation gradients and parameter gradients simultaneously while reducing memory overhead compared to conventional backpropagation approaches. In some embodiments, processes propagate gradients through attention layers, feed-forward networks, and embedding layers using tensor contraction operations that maintain computational efficiency while preserving gradient accuracy for parameter updates.

[0069] Process 300 updates (350) tensor-compressed model parameters on device. In accordance with several embodiments of the invention, processes perform parameter updates directly on TT and TTM cores using gradients computed through tensor network contractions, eliminating the computational overhead associated with dense matrix reconstruction during optimization steps. Processes may apply gradient descent operations to individual tensor cores Gkand Fkusing learning rates and optimization algorithms such as stochastic gradient descent while maintaining all parameter representations in compressed tensor formats. In various embodiments, processes may store updated tensor cores in on-chip memory resources, enabling subsequenttraining iterations to access compressed parameters without off-chip memory transfers that would increase latency and energy consumption during training operations.

[0070] Processes in accordance with various embodiments of the invention may achieve substantial memory reduction ratios that enable transformer models requiring hundreds of megabytes in uncompressed form to be trained within the on-chip memory constraints of field-programmable gate arrays and other resource-limited computing platforms. The tensor-compressed training approach may maintain training accuracy comparable to uncompressed models while reducing memory requirements by factors ranging from 30× to 74× compared to conventional training approaches. In many embodiments of the invention, processes utilize bidirectional tensor contraction algorithms that reduce the total number of computation stages from 2d to d+1, where d represents the decomposition depth, enabling parallel computation of tensor network operations during both forward and backward propagation phases.

[0071] Various processes for tensor-compressed training are discussed above with reference to FIG. 3. Alternative processes can be utilized as appropriate to the requirements of specific applications, including different tensor decomposition formats, alternative contraction algorithms, or modified parameter update strategies. These alternative processes also can be used for tensor-compressed training in accordance with various embodiments of the invention.

[0072] A detailed training flow of one encoder block in accordance with an embodiment of the invention is illustrated in FIG. 4. The tensor-compressed encoder block training architectures demonstrate sequential workflows that leverage tensor compression throughout forward propagation, gradient computation, backpropagation, and parameter update stages. Tensor-compressed encoder block training architectures in accordance with various embodiments of the invention may maintain tensor-compressed representations of model parameters and perform computations directly on compressed formats without recovering original uncompressed weight matrices during any stage of the training operation.

[0073] In the forward propagation stage, input X flows through multiple tensor-compressed weight matrices to generate attention representations. The input may be processed through tensortrain decomposed weight matrices Wq, Wk, and Wvto generate query (Q), key (K), and value (V) representations using tensor network contractions rather than conventional matrix-vector multiplications. Tensor-compressed encoder block training architectures in accordance with several embodiments of the invention may utilize bi-directional tensor-train contraction algorithmsto compute these transformations efficiently while maintaining all weight parameters in compressed TT core formats throughout the computation process.

[0074] The Q and K matrices may undergo multiplication operations followed by softmax functions to compute attention scores in tensor-compressed encoder block training architectures. These attention scores may then be multiplied with V representations through attention mechanisms that operate directly on tensor-compressed data structures. The attention output may be processed through tensor-train decomposed weight matrix Woand combined with the original input X through residual connections. In many embodiments of the invention, tensor-compressed encoder block training architectures apply layer normalization (LN) operations to produce intermediate outputs Yattnwhile maintaining computational efficiency through compressed parameter representations.

[0075] The intermediate output Yattnmay be fed into feed-forward networks consisting of tensor-train decomposed weight matrices Wi and W2 with GELU activation functions applied between the transformations. The feed-forward processing may generate intermediate results Oi that pass through GELU activation functions to produce O2 outputs. In several embodiments of the invention, tensor-compressed encoder block training architectures combine O2 outputs with residual connections and apply layer normalization operations to produce final encoder outputs Yffn while maintaining all weight matrices in compressed tensor-train formats throughout the forward propagation process.

[0076] In the backward propagation stage, gradients flow in reverse direction through the tensor-compressed network architecture. The gradient of the final output dYffnmay propagate backward through the network, where gradients are computed for each weight matrix and activation using tensor-network contractions involving transposed weight matrices and intermediate activation values. Tensor-compressed encoder block training architectures in accordance with various embodiments of the invention may compute gradients dO2, dO1, dO, dAttention, dScore, dV, dQ, and dK through efficient tensor contraction operations that eliminate the need for reconstructing dense weight matrices during gradient computation processes.

[0077] The gradient computation process may utilize fused parallel tensor-train contraction techniques that compute activation gradients and parameter gradients simultaneously whilereducing memory overhead compared to conventional backpropagation approaches. In many embodiments of the invention, tensor-compressed encoder block training architectures propagate gradients through attention layers and feed-forward networks using tensor contraction operations that maintain computational efficiency while preserving gradient accuracy for subsequent parameter updates. The gradient flow may maintain compatibility with tensor network contraction operations used throughout the backpropagation stage, where intermediate gradient tensors are processed directly in compressed formats.

[0078] In the parameter update stage, compressed model parameters may be updated using gradients computed during backward propagation. Weight matrices Wv, Wq, Wk, Wo, W1, and W2may be stored and updated using their respective gradients computed through tensor network contractions. Tensor-compressed encoder block training architectures in accordance with several embodiments of the invention may perform parameter updates directly on TT cores using gradients computed through tensor network contractions, eliminating computational overhead associated with dense matrix reconstruction during optimization steps. The parameter update process may apply gradient descent operations to individual tensor cores G using learning rates and optimization algorithms while maintaining all parameter representations in compressed tensor formats.

[0079] Tensor-compressed encoder block training architectures in accordance with various embodiments of the invention may demonstrate that activations are represented in specific formats, activation gradients are represented in alternative formats, and model parameters are maintained in compressed tensor formats throughout the training process. The training flow may illustrate bidirectional data flow between various components, with computational paths indicating the direction of processing during forward and backward passes. In a number of embodiments, tensor-compressed encoder block training architectures integrate tensor-compressed layers with standard neural network operations to enable memory-efficient training on resource-constrained hardware platforms while maintaining training accuracy comparable to uncompressed implementations.

[0080] Various tensor-compressed encoder block training architectures are discussed above with reference to FIG. 4. Alternative tensor-compressed encoder block training architectures can be utilized as appropriate to the requirements of specific applications. These alternative architectures also provide training capabilities in accordance with various embodiments of the invention.

[0081] Tensor graph representations for forward propagation, activation gradient computation, and TT-core gradient computation in a tensor-train format linear layer in accordance with an embodiment of the invention are illustrated in FIG. 5. The tensor graph representations demonstrate computational stages where interconnected nodes represent tensor operations, with circles representing tensor cores and rectangles representing input or output tensors. These representations may provide visual frameworks for understanding the mathematical operations involved in tensor-compressed training systems.

[0082] The forward propagation representation shows the contraction of input tensors with tensor-train cores through a bi-directional contraction technique. In many embodiments of the invention, tensor graph representations utilize input tensor X that flows through multiple tensor cores G1, G2, G3, and G4in a parallel contraction sequence. The bi-directional tensor contraction technique may perform contractions from both left and right directions towards a middle point in parallel, reducing the total number of computation stages from 2d to d+1, where d represents the decomposition depth. Tensor graph representations in accordance with several embodiments of the invention may demonstrate how the input tensor contracts with tensor cores from multiple directions simultaneously, enabling parallel processing of tensor network operations during forward propagation.

[0083] The activation gradient computation representation illustrates how gradients with respect to activations are computed through tensor network contractions. In various embodiments of the invention, tensor graph representations show the flow of gradient information backward through the tensor network, where activation gradients are calculated using the same tensor cores involved in forward propagation but with reversed computational flow. The gradient computation process may utilize tensor contraction operations that maintain the structural relationships between tensor cores while computing derivatives with respect to intermediate activations. Tensor graph representations in accordance with a number of embodiments may demonstrate how activation gradients flow through the network without requiring reconstruction of dense weight matrices during the gradient calculation process.

[0084] The TT-core gradient computation representation details the process of computing gradients with respect to individual tensor cores. In several embodiments of the invention, tensor graph representations show how gradients are computed for specific tensor cores by eliminating the target core from the tensor network while maintaining connections between all othercomponents. The core-wise gradient derivation process may involve tensor contractions between input activations, output gradients, and the remaining tensor cores to compute the gradient with respect to the eliminated core. Tensor graph representations in accordance with many embodiments of the invention may illustrate how parameter gradients are calculated directly for tensor cores G2 through tensor network operations that preserve the compressed parameter format throughout the gradient computation process.

[0085] The interconnected nodes in the tensor graph representations demonstrate the data flow and computational dependencies between different stages of the tensor-compressed training process. In accordance with various embodiments of the invention, tensor graph representations utilize circular nodes to represent three-dimensional tensor cores that store compressed weight parameters, while rectangular nodes represent input and output tensors that carry activation data through the network. The connections between nodes may indicate tensor contraction operations where shared dimensions are contracted to produce intermediate results or final outputs. Tensor graph representations in accordance with several embodiments may show how the bi-directional contraction flow enables efficient computation of both forward propagation and gradient calculations while maintaining all operations in compressed tensor space.

[0086] Free edges indicate the dimensions of result tensors produced by tensor network contractions. In many embodiments of the invention, tensor graph representations utilize these visual indicators to show how tensor dimensions are preserved or modified during contraction operations. The free edges may represent the output dimensions of tensor network operations, where the number and size of free edges determine the shape of the resulting tensor after contraction. Tensor graph representations in accordance with various embodiments of the invention may demonstrate how the dimensional structure of tensors is maintained throughout the computation process, enabling efficient memory management and computational optimization during training operations.

[0087] The bi-directional contraction flow shown in the tensor graph representations enables parallel computation of tensor network operations during both forward and backward propagation phases. In accordance with several embodiments of the invention, tensor graph representations demonstrate how contractions can be performed simultaneously from multiple directions, reducing the sequential dependencies that limit parallelization in conventional tensor network implementations. The parallel contraction approach may enable more efficient utilization ofcomputational resources while reducing the total number of computation stages required for tensor network operations. Tensor graph representations in accordance with many embodiments of the invention may show how the bi-directional approach maintains mathematical accuracy while improving computational efficiency compared to sequential contraction methods.

[0088] Parameter updates in tensor graph representations operate directly on tensor cores without requiring reconstruction of dense weight matrices during optimization steps. In various embodiments of the invention, tensor graph representations show how gradients computed through tensor network contractions can be applied directly to individual tensor cores using gradient descent operations. The parameter update process may maintain all weight representations in compressed tensor formats throughout the optimization process, eliminating the computational and memory overhead associated with dense matrix reconstruction. Tensor graph representations in accordance with a number of embodiments may demonstrate how optimizer updates preserve the compressed structure of model parameters while enabling effective learning through gradientbased optimization algorithms.

[0089] Various tensor train tensor graph representations are discussed above with reference to FIG. 5. Alternative tensor train tensor graph representations can be utilized as appropriate to the requirements of specific applications. These alternative representations also provide training capabilities in accordance with various embodiments of the invention.Bi-directional Tensor-train Contraction

[0090] A comparison of computational and memory complexity between tensor-train format and bi-directional tensor-train format forward propagation in accordance with an embodiment of the invention is illustrated in FIG. 6. The comparison demonstrates two distinct approaches for performing tensor network contractions during forward propagation, where conventional sequential methods may be contrasted with bi-directional parallel techniques that achieve reduced computational complexity and memory requirements. Forward propagation comparison systems in accordance with various embodiments of the invention may utilize bi-directional tensor contraction strategies to enable more efficient training operations on resource-constrained hardware platforms.

[0091] The TT-FP approach illustrates a sequential contraction flow that represents conventional tensor-train forward propagation methods. In many embodiments of the invention,the sequential contraction flow processes tensor operations in a right-to-left manner, where input tensor X connects through a series of intermediate tensors Zi, Z2, and Z3 before producing output tensor Y. The conventional approach may require sequential processing of tensor cores, where each contraction step depends on the completion of the previous operation, limiting opportunities for parallel computation. Forward propagation comparison systems in accordance with several embodiments of the invention may demonstrate how conventional TT contraction creates large intermediate tensors during the sequential processing stages, resulting in increased memory requirements and computational overhead compared to alternative approaches.

[0092] The gradient inputs G4, G3, G2, and Gi feed into the computation chain from below in the sequential approach, where each gradient component may be processed in sequence with the corresponding tensor core operations. In various embodiments of the invention, the sequential contraction flow maintains dependencies between computation stages that prevent parallel processing of tensor operations. The conventional approach may require storage of all intermediate results throughout the computation process, leading to memory overhead that scales with the sequence length and tensor dimensions. Forward propagation comparison systems in accordance with a number of embodiments may show how the sequential nature of conventional TT contraction limits the efficiency of tensor network operations during forward propagation.

[0093] The BTT-FP approach demonstrates a bi-directional contraction strategy where computations flow from both ends toward the middle point of the tensor network. In several embodiments of the invention, the bi-directional approach enables input tensor X to feed into tensor core Zi, which then branches to connect with tensor core Z2, while tensor core Z3 receives inputs from both Gi and G2 simultaneously. The bi-directional contraction strategy may enable parallel processing of tensor operations by eliminating sequential dependencies between computation stages. Forward propagation comparison systems in accordance with many embodiments of the invention may demonstrate how BTT contracts from both ends simultaneously, enabling parallel and memory-efficient computation of tensor network operations.

[0094] In accordance with various embodiments of the invention, the contraction operations in the bi-directional approach can be executed in parallel, reducing the total computation time compared to sequential processing methods. The parallel contraction capability may enable more efficient utilization of computational resources while maintaining mathematical accuracy of tensor network operations. Forward propagation comparison systems in accordance with severalembodiments may show how the bi-directional approach reduces the number of sequential computation stages while preserving the mathematical relationships between tensor components.

[0095] In many embodiments of the invention, the bi-directional approach reduces the number of computational stages compared to the sequential method, enabling faster completion of forward propagation operations. The stage separation may demonstrate how the bi-directional contraction strategy eliminates intermediate dependencies that limit parallelization in conventional tensor network implementations. Forward propagation comparison systems in accordance with various embodiments of the invention may utilize the reduced stage count to achieve improved computational efficiency while maintaining compatibility with tensor-compressed training operations.

[0096] The computational complexity comparison demonstrates mathematical expressions that quantify the efficiency improvements achieved through bi-directional tensor contraction. The computational complexity may be expressed as ΔComp = O((K-m)mr2+ (K-n)nr2), where K represents the sequence length, m and n represent tensor dimensions, and r represents the tensor rank. In several embodiments of the invention, the computational complexity reduction applies when conditions K > m and K > n are satisfied, which commonly occurs in natural language processing and computer vision applications. Forward propagation comparison systems in accordance with a number of embodiments may achieve substantial reductions in floating-point operations through the bi-directional contraction approach, enabling more efficient training on resource-constrained platforms.

[0097] The memory complexity comparison shows how bi-directional tensor contraction reduces intermediate memory requirements during forward propagation operations. The memory complexity may be expressed as ΔMemory = O((K-m)mr + (K-n)nr), where the reduction in memory overhead enables larger models to be trained within the constraints of edge computing devices. In accordance with various embodiments of the invention, the memory complexity reduction eliminates the need to store large intermediate tensors that accumulate during sequential contraction operations. Forward propagation comparison systems in accordance with several embodiments may demonstrate how reduced intermediate memory requirements enable scalable training operations that can accommodate varying sequence lengths and model sizes while maintaining computational efficiency.

[0098] The conditions K > m and K > n specify the parameter ranges where bi-directional tensor contraction provides computational and memory advantages over conventional sequential approaches. In many embodiments of the invention, these conditions are commonly satisfied in transformer training scenarios where sequence lengths exceed the individual tensor dimensions used in tensor-train decomposition. The parameter conditions may ensure that the bi-directional approach achieves meaningful efficiency improvements compared to sequential contraction methods. Forward propagation comparison systems in accordance with various embodiments of the invention may utilize these parameter relationships to determine when bi-directional contraction provides optimal performance for specific training configurations.

[0099] The convergence points in both computational approaches show how tensor operations combine to produce final output tensors Y, where the bi-directional method may achieve the same mathematical result as the sequential approach while reducing computational and memory overhead. In several embodiments of the invention, the convergence behavior demonstrates that bi-directional tensor contraction maintains mathematical equivalence with conventional methods while providing efficiency improvements. The output generation process may preserve the accuracy of tensor network operations while enabling faster computation through parallel processing capabilities. Forward propagation comparison systems in accordance with many embodiments of the invention may demonstrate how the bi-directional approach achieves equivalent mathematical results with improved computational efficiency compared to sequential contraction methods.

[0100] Various BTT computing flows are discussed above with reference to FIG. 6. Alternative BTT computing flows can be utilized as appropriate to the requirements of specific applications. These alternative implementations also provide computing capabilities in accordance with various embodiments of the invention.Hardware Architecture

[0101] A block diagram of a tensorized transformer training accelerator 700 in accordance with an embodiment of the invention is illustrated in FIG. 7. The tensorized transformer training accelerator 700 includes on-chip memory and computation kernels 710, a back propagation engine 715, a parameter update kernel 720, parameters 725, a forward propagation engine 730, activations / gradients 735, and off-chip memory 740. The off-chip memory 740 stores training data745, activations 750, labels 755, and loss / prediction 760. Tensorized transformer training accelerators in accordance with various embodiments of the invention enable end-to-end transformer training on resource-constrained edge devices by maintaining model parameters in compressed tensor formats throughout the entire training process while utilizing specialized hardware architectures optimized for tensor network operations.

[0102] On-chip memory and computation kernels 710 house the core computational elements of tensorized transformer training accelerators, including specialized engines for forward and backward propagation operations. In certain embodiments of the invention, on-chip memory and computation kernels include both block RAM (BRAM) and ultra RAM (URAM) components, with BRAM capacity less than 6MB and URAM capacity of 22.5MB. The on-chip memory may store all tensor-train cores and tensor-train-matrix cores in compressed formats, eliminating the need for off-chip memory access during parameter operations. On-chip memory and computation kernels in accordance with several embodiments of the invention may maintain all highly compressed model parameters and gradient information on chip, creating an on-chip-memory-only framework for each stage in training that reduces off-chip communication and minimizes latency and energy costs.

[0103] In various embodiments of the invention, on-chip memory and computation kernels include custom computing kernels for various operations including tensor contraction in TTM-format embedding tables, TT-format linear layers, matrix multiplication in attention parts, and non-linear functions like softmax, GELU, Tanh, and LayerNorm. The custom computing kernels may be optimized for tensor operations and maintain parameters in high-speed on-chip memory for efficient access during training operations. On-chip memory and computation kernels in accordance with a number of embodiments may support different data formats including floatingpoint 32-bit (FP32) and floating-point 16-bit (FP16) formats, where the accelerator operates at different clock frequencies including 100MHz for FP32 format and 125MHz for FP16 format.

[0104] Forward propagation engine 730 receives training data 745 from off-chip memory 740 and processes the data using parameters 725 stored in on-chip memory and computation kernels 710. In several embodiments of the invention, forward propagation engines implement tensor network contractions for computing forward propagation without reconstructing dense weight matrices during matrix-vector multiplication operations. Forward propagation engines may utilize bi-directional tensor-train contraction algorithms that reduce computational complexity andmemory requirements compared to conventional sequential contraction approaches. Forward propagation engines in accordance with many embodiments of the invention may process query, key, and value transformations through TT-format linear layers, compute attention mechanisms using compressed representations, and generate intermediate activations through tensorized feedforward networks while maintaining all computations in compressed tensor space.

[0105] The forward propagation engine 730 generates activations / gradients 735, which may be stored both on-chip and in off-chip memory 740 as activations 750. The forward propagation engine 730 produces loss / prediction 760, which is stored in off-chip memory 740 for subsequent processing stages. The forward propagation operations may maintain tensor-compressed representations throughout the computation process, enabling efficient processing of transformer models that would exceed the memory capacity of edge devices in uncompressed form. Forward propagation engine in accordance with several embodiments may execute unified contraction engines that support forward propagation operations designed around bi-directional tensor-train execution characteristics.

[0106] Back propagation engine 715 receives activations / gradients 735 and labels 755 from off-chip memory 740 to compute gradients by backpropagating through the network using stored activations 750 and parameters 725. In many embodiments of the invention, back propagation engines implement backward pass computations for gradient calculation during the training process, working in conjunction with forward propagation engine to enable complete training cycles. Back propagation engines may compute gradients with respect to activations and tensor cores through tensor network contractions that eliminate intermediate tensor storage requirements during gradient calculation. Back propagation engines in accordance with various embodiments of the invention may utilize fused parallel tensor-train contraction techniques that compute activation gradients and parameter gradients simultaneously while reducing memory overhead compared to conventional backpropagation approaches.

[0107] In several embodiments of the invention, back propagation engines propagate gradients through attention layers, feed-forward networks, and embedding layers using tensor contraction operations that maintain computational efficiency while preserving gradient accuracy for parameter updates. The gradient computation process may utilize tensor contraction operations that maintain the structural relationships between tensor cores while computing derivatives with respect to intermediate activations. Back propagation engines in accordance with a number ofembodiments may demonstrate how activation gradients flow through the network without requiring reconstruction of dense weight matrices during the gradient calculation process, where the computed gradients are passed to parameter update kernel 720.

[0108] Parameter update kernel 720 updates parameters 725 based on gradients received from back propagation engine 715, where the updated parameters may be stored in on-chip memory and computation kernels 710 for use in subsequent training iterations. In accordance with various embodiments of the invention, parameter update kernels perform parameter updates directly on TT and TTM cores using gradients computed through tensor network contractions, eliminating computational overhead associated with dense matrix reconstruction during optimization steps. The parameter update process may apply gradient descent operations to individual tensor cores using learning rates and optimization algorithms while maintaining all parameter representations in compressed tensor formats. Parameter update kernel in accordance with several embodiments of the invention may maintain all weight representations in compressed tensor formats throughout the optimization process, eliminating the computational and memory overhead associated with dense matrix reconstruction.

[0109] Parameters 725 represent the tensor-compressed model weights stored within on-chip memory and computation kernels 710, where bidirectional arrows between parameters and both forward propagation engine 730 and back propagation engine 715 indicate that parameters are accessed and updated during both forward and backward propagation stages. In many embodiments of the invention, parameters include tensor-train cores and tensor-train-matrix cores that represent compressed weight matrices and embedding tables respectively. The parameters may be maintained in compressed tensor formats throughout the entire training process, enabling substantial memory reduction compared to conventional dense parameter representations. Parameters in accordance with various embodiments of the invention may be stored entirely on-chip in BRAM and URAM components, enabling efficient access during tensor network operations without requiring off-chip memory transfers.

[0110] Activations / gradients 735 are exchanged between on-chip memory and computation kernels 710 and off-chip memory 740, as indicated by bidirectional arrows that allow intermediate computation results to be stored off-chip when memory constraints require such storage while maintaining efficient access to frequently used data on-chip. In several embodiments of the invention, activations / gradients represent intermediate computation results generated duringforward propagation and gradient computation stages. The activation and gradient data may be managed through a hybrid storage approach where frequently accessed data remains on-chip while larger intermediate results are offloaded to off-chip memory as needed. Activations / gradients in accordance with a number of embodiments may enable scalable training operations that can accommodate varying sequence lengths and model sizes while maintaining computational efficiency through optimized memory management strategies.

[0111] Off-chip memory 740 stores training data 745, activations 750, labels 755, and loss / prediction 760, providing external storage capacity for data that exceeds the on-chip memory constraints of the accelerator architecture. In accordance with various embodiments of the invention, off-chip memory separates the storage of training data, intermediate activations, labels, and predictions from the computation-intensive operations performed by forward propagation engine, back propagation engine, and parameter update kernel within on-chip memory and computation kernels. The off-chip memory may utilize external memory technologies to provide larger storage capacity for activation tensors that may be optionally offloaded when on-chip memory resources are constrained. Off-chip memory in accordance with several embodiments may enable the accelerator to handle larger models and longer sequences by providing additional storage capacity while maintaining the computational efficiency achieved through on-chip parameter storage.

[0112] Training data 745 stored in off-chip memory 740 provides input sequences and associated information for training operations, where the data may be transferred to on-chip processing elements as needed during training iterations. In many embodiments of the invention, training data includes tokenized input sequences, positional information, and other input features required for transformer training operations. The training data may be accessed by forward propagation engine during forward pass operations to generate predictions and intermediate activations. Training data in accordance with various embodiments of the invention may be managed through efficient data transfer protocols that minimize the impact of off-chip memory access on overall training performance while enabling the processing of datasets that exceed on-chip storage capacity.

[0113] Activations 750 represent intermediate computation results stored in off-chip memory 740 that are generated during forward propagation and may be retrieved during backward propagation for gradient computation. In several embodiments of the invention, activations includeintermediate tensor representations produced by attention layers, feed-forward networks, and other processing stages within the transformer architecture. The activations may be selectively stored off-chip when memory constraints require such storage, while maintaining efficient access patterns for gradient computation during backward propagation. Activations in accordance with a number of embodiments may enable the training of larger transformer models by providing external storage for intermediate results while maintaining the computational efficiency achieved through tensor-compressed parameter representations.

[0114] Labels 755 stored in off-chip memory 740 provide target outputs for supervised training operations, where the labels may be accessed by back propagation engine during gradient computation stages. In accordance with various embodiments of the invention, labels include ground truth classifications, sequence targets, or other supervisory signals required for training transformer models on specific tasks. The labels may be compared with predictions generated by forward propagation engine to compute loss values that drive the gradient computation process. Labels in accordance with several embodiments may be managed through efficient storage and retrieval mechanisms that support various training tasks while maintaining compatibility with the tensor-compressed training framework.

[0115] Loss / prediction 760 represents the output values generated by forward propagation engine 730 and stored in off-chip memory 740, where these values may include model predictions and computed loss values used for gradient computation. In many embodiments of the invention, loss / prediction includes classification probabilities, regression outputs, or other task-specific predictions generated by the transformer model during forward propagation. The loss values may be computed by comparing predictions with labels to generate gradient signals that drive the backward propagation process. Loss / prediction in accordance with various embodiments of the invention may be stored off-chip to accommodate the memory requirements of different training tasks while maintaining efficient access for gradient computation operations.

[0116] Tensorized transformer training accelerators in accordance with several embodiments of the invention may be implemented on AMD Alveo U50 FPGA with hardware resources including 872k LUTs, 5952 DSPs, 5.08MB BRAMs, and 22.5MB URAMs. The FPGA implementation may enable specialized tensor network operations through custom hardware designs optimized for tensor-train and tensor-train-matrix computations. In a number of embodiments, tensorized transformer training accelerators utilize the separation of on-chip andoff-chip storage to enable efficient training operations where compressed parameters remain on-chip for fast access while larger intermediate data may be managed through off-chip memory as needed. The architecture may demonstrate how unified contraction engines support forward propagation, backward propagation, and core-gradient updates through optimized tensor computations designed around bi-directional tensor-train execution characteristics.

[0117] Various tensorized transformer training accelerators are discussed above with reference to FIG. 7. Alternative tensorized transformer training accelerators can be utilized as appropriate to the requirements of specific applications. These alternative accelerators also provide training acceleration capabilities in accordance with various embodiments of the invention.

[0118] A system diagram of a tensorized transformer training accelerator during backward propagation in accordance with an embodiment of the invention is illustrated in FIG. 8. Tensorized transformer training accelerators in accordance with various embodiments of the invention may utilize unified contraction engines that support forward propagation, backward propagation, and core-gradient updates through optimized tensor computations designed around bi-directional tensor-train execution characteristics.

[0119] On-chip memory and computation kernels in accordance with some embodiments of the invention house specialized processing elements that execute backward propagation operations through tensor network contractions. In many embodiments of the invention, on-chip memory and computation kernels implement a TT core grouping method that concatenates multiple TT cores without data dependency into a single array to improve BRAM utilization efficiency. The TT core grouping approach may increase the array depth by combining tensor cores that can be processed independently, enabling more efficient use of fixed-size BRAM blocks while maintaining parallel access capabilities. On-chip memory and computation kernels in accordance with several embodiments of the invention may implement array partitioning and array reshaping techniques to support parallel data access, where array partitioning uses r BRAM blocks per TT core and array reshaping concatenates elements by increasing bit-width to optimize memory allocation patterns.

[0120] The array partitioning technique may partition large data arrays into multiple smaller arrays that are mapped to separate BRAM blocks, enabling parallel data loading for tensor cores during backward propagation operations. In various embodiments of the invention, array partitioning utilizes r BRAM blocks per TT core to support parallel access patterns required for efficient tensor network contractions. The partitioning approach may enable simultaneous accessto multiple tensor core elements during gradient computation, reducing memory access latency and improving computational throughput. On-chip memory and computation kernels in accordance with a number of embodiments may configure BRAM blocks with various width and depth combinations to accommodate different tensor core sizes while maximizing utilization efficiency through the partitioning strategy.

[0121] The array reshaping technique concatenates multiple elements of tensor arrays by increasing the bit-width to support parallel data processing within individual BRAM blocks. In several embodiments of the invention, array reshaping enables more memory-efficient storage compared to array partitioning by utilizing the full width capacity of BRAM blocks. The reshaping approach may combine multiple tensor elements into wider data words that can be processed simultaneously during tensor network operations. On-chip memory and computation kernels in accordance with many embodiments of the invention may utilize array reshaping when the combined bit-width of tensor elements remains within the maximum width capacity of BRAM blocks, enabling efficient parallel processing while minimizing the number of required memory blocks.

[0122] Back propagation engines in accordance with many embodiments execute gradient computation operations through specialized kernels that process tensor network contractions during backward propagation stages. In accordance with various embodiments of the invention, back propagation engines utilize multiple computational kernels including TTM-BP kernels for processing gradients through tensor-train-matrix embedding layers, TT-BP kernels for computing gradients through tensor-train linear layers, and MM kernels for handling matrix multiplication operations in attention mechanisms. The backward propagation process may compute gradients with respect to both activations and tensor cores through fused parallel tensor-train contraction techniques that eliminate intermediate tensor storage requirements during gradient calculation.

[0123] The TTM-BP kernel processes gradient computations for tensor-train-matrix embedding layers during backward propagation operations. In many embodiments of the invention, TTM-BP kernels compute gradients with respect to TTM cores Fk through tensor network contractions that involve input gradients and intermediate activation values stored from forward propagation. The TTM gradient computation may utilize the relationship F'k[ik, jk] =i i,...,id (F k-l[ik-l, jk-l]. F i[ii, j i]y'i1,...,ik-l,ik+l,...,id) (F d[id, jd]-. F k+i[ik+l, jk+1]),where F'k[ik, jk] represents a slice of the derivative tensor by fixing indices as ik and jk- TTM-BP kernels in accordance with several embodiments of the invention may process these gradient computations directly on compressed TTM cores without reconstructing dense embedding matrices during backward propagation.

[0124] The TT-BP kernel computes gradients for tensor-train linear layers through bidirectional tensor network contractions that process parameter gradients and activation gradients simultaneously. In various embodiments of the invention, TT-BP kernels utilize fused parallel contraction techniques that divide gradient computation operations into fine-grained contractions, enabling efficient memory utilization during backward propagation. The TT gradient computation may process gradients with respect to tensor cores Gk through tensor network operations that eliminate the target core from the network while maintaining connections between remaining components. TT-BP kernels in accordance with a number of embodiments may compute activation gradients x' through tensor network contractions expressed as x' = G dUd]- -G i[j i] Eii,...,id T TG d[id].. G i[ii]y', where the contraction maintains computational efficiency while preserving gradient accuracy.

[0125] The MM kernel handles matrix multiplication operations during backward propagation for attention mechanisms and other components that utilize conventional dense matrix representations. In several embodiments of the invention, MM kernels process gradient computations for attention score calculations, softmax operations, and classifier layers that may not be compressed using tensor decomposition techniques. The MM kernel operations may compute gradients through standard matrix multiplication algorithms while interfacing with tensor-compressed components through appropriate data format conversions. MM kernels in accordance with many embodiments of the invention may maintain compatibility with tensor network operations by processing gradients in formats that can be efficiently transferred to and from tensor-compressed processing stages.

[0126] The parameter update kernel 720 receives gradient information from back propagation engine 715 and applies optimization updates directly to tensor cores stored in parameters 725. In accordance with various embodiments of the invention, parameter update kernels perform gradient descent operations on individual tensor cores Gk and Fk using learning rates and optimization algorithms while maintaining all parameter representations in compressed tensor formats. Theparameter update process may apply updates as Gk Gk - aG'k for TT cores and Fk <— Fk - aF'k for TTM cores, where a represents the learning rate and the gradient terms are computed through tensor network contractions during backward propagation. Parameter update kernels in accordance with several embodiments may store updated tensor cores in on-chip memory resources, enabling subsequent training iterations to access compressed parameters without off-chip memory transfers.

[0127] The activations / gradients 735 flow between on-chip processing elements and off-chip memory 740 during backward propagation operations, where gradient information may be computed on-chip and intermediate results may be stored off-chip when memory constraints require such management. In many embodiments of the invention, activations / gradients represent both forward propagation activations that are retrieved for gradient computation and backward propagation gradients that are computed during the backward pass. The gradient flow may utilize efficient data transfer protocols that minimize the impact of off-chip memory access while enabling the processing of larger models that exceed on-chip storage capacity. Activations / gradients in accordance with various embodiments of the invention may be managed through hybrid storage approaches where frequently accessed gradient data remains on-chip while larger intermediate gradient tensors are offloaded to off-chip memory as needed.

[0128] The system characteristics during backward propagation demonstrate how tensorized transformer training accelerators maintain TT cores entirely on-chip in BRAM and URAM components while optionally offloading activation tensors to DRAM when memory constraints require external storage. In several embodiments of the invention, the on-chip storage of tensor cores enables efficient access during gradient computation operations without requiring off-chip memory transfers that would increase latency and energy consumption. The activation tensor offloading capability may provide scalability for larger models and longer sequences while maintaining the computational efficiency achieved through compressed parameter representations. Tensorized transformer training accelerators in accordance with a number of embodiments may utilize this hybrid memory management approach to balance on-chip processing efficiency with the storage capacity requirements of varying model sizes and training configurations.

[0129] The unified contraction engines support forward propagation, backward propagation, and core-gradient updates through specialized tensor network operations designed around bidirectional tensor-train execution characteristics. In accordance with various embodiments of the invention, unified contraction engines enable efficient processing of tensor network operationsduring all stages of the training process while maintaining compressed parameter representations throughout the computation flow. The bi-directional execution characteristics may enable parallel processing of tensor contractions from multiple directions, reducing sequential dependencies and improving computational throughput during backward propagation operations. Unified contraction engines in accordance with several embodiments may demonstrate how the system architecture optimizes tensor network computations for resource-constrained hardware platforms while maintaining training accuracy comparable to uncompressed implementations.

[0130] Various back propagation processes using tensorized transformer training accelerators are discussed above with reference to FIG. 8. Alternative back propagation processes using tensorized transformer training accelerators can be utilized as appropriate to the requirements of specific applications. These alternative representations also provide training acceleration capabilities in accordance with various embodiments of the invention.

[0131] Although specific examples of system components are described above with reference to FIG. 8, alternative implementations are possible that are appropriate to the requirements of specific applications in accordance with various embodiments of the invention.

[0132] A system diagram of a tensorized transformer training accelerator during forward propagation in accordance with an embodiment of the invention is illustrated in FIG. 9. The tensorized transformer training accelerator demonstrates detailed interactions between computational kernels during forward propagation operations, where specialized engines process tensor network contractions through multiple processing stages while maintaining compressed parameter representations throughout the training process. Tensorized transformer training accelerators in accordance with various embodiments of the invention may utilize unified contraction engines that support forward propagation, backward propagation, and core-gradient updates through optimized tensor computations designed around bi-directional tensor-train execution characteristics.

[0133] The forward propagation process begins with training data flowing into a TTM-FP kernel that processes tensor-train-matrix embedding operations during the initial stages of forward propagation. In many embodiments of the invention, TTM-FP kernels execute embedding lookup operations through tensor network contractions that convert discrete input tokens into dense vector representations while maintaining substantial memory reduction compared to conventional dense embedding approaches. The TTM-FP kernel may process input sequences through compressedembedding tables stored as TTM cores, where the lookup operations are performed directly on tensor-compressed representations without reconstructing dense embedding matrices. TTM-FP kernels in accordance with several embodiments of the invention may generate tensorized input representations that serve as compressed feature vectors for subsequent processing stages in the forward propagation pipeline.

[0134] The output from the TTM-FP kernel flows through a layer normalization (LN) operation that normalizes the embedded representations before feeding into subsequent tensor-compressed processing stages. In various embodiments of the invention, layer normalization operations maintain compatibility with tensor-compressed data formats while providing the stabilization effects needed for effective transformer training. The layer normalization process may operate on tensorized representations produced by the TTM-FP kernel, ensuring that the normalized outputs maintain the compressed format required for efficient processing by downstream tensor network operations. Layer normalization operations in accordance with a number of embodiments may preserve the mathematical properties of the embedded representations while enabling stable gradient flow during backward propagation stages.

[0135] Parameters stored in on-chip memory provide tensor-compressed model parameters to multiple processing kernels throughout the forward propagation process. In several embodiments of the invention, parameters include tensor-train cores and tensor-train-matrix cores that represent compressed weight matrices and embedding tables respectively, where all parameters are maintained in compressed tensor formats throughout the entire training process. The parameters may be stored entirely on-chip in BRAM and URAM components, enabling efficient access during tensor network operations without requiring off-chip memory transfers that would increase latency and energy consumption. Parameters in accordance with many embodiments of the invention may be accessed simultaneously by multiple processing kernels during forward propagation, where the on-chip storage enables parallel access patterns required for efficient tensor network contractions.

[0136] The TT-FP kernel receives inputs from both the parameters and the layer normalization output, processing tensor-train linear layer operations through bi-directional tensor network contractions. In accordance with various embodiments of the invention, TT-FP kernels implement forward propagation computations for tensor-compressed linear layers without reconstructing dense weight matrices during matrix-vector multiplication operations. The TT-FP kernel may utilize bi-directional tensor-train contraction algorithms that reduce computational complexity andmemory requirements compared to conventional sequential contraction approaches. TT-FP kernels in accordance with several embodiments may process query, key, and value transformations through TT-format linear layers while maintaining all computations in compressed tensor space, where the bi-directional contraction enables parallel processing of tensor network operations during forward propagation.

[0137] The output from the TT-FP kernel flows through a GELU activation function that applies non-linear transformations to the tensor-compressed representations. In many embodiments of the invention, GELU activation functions maintain compatibility with tensor-compressed data formats while providing the non-linear processing capabilities required for effective transformer training. The GELU activation may operate directly on tensorized representations produced by TT-FP kernels, preserving the compressed format while applying the mathematical transformations needed for feed-forward network processing. GELU activation functions in accordance with various embodiments of the invention may enable efficient processing of tensor-compressed activations while maintaining the mathematical properties required for accurate gradient computation during backward propagation stages.

[0138] The MM kernel performs matrix multiplication operations for attention mechanisms and other components that utilize conventional dense matrix representations during forward propagation. In several embodiments of the invention, MM kernels handle attention score calculations, softmax operations, and other processing stages that may not be compressed using tensor decomposition techniques. The MM kernel operations may interface with tensor-compressed components through appropriate data format conversions, enabling seamless integration between compressed and uncompressed processing stages. MM kernels in accordance with a number of embodiments may maintain compatibility with tensor network operations by processing data in formats that can be efficiently transferred to and from tensor-compressed processing stages during forward propagation.

[0139] The MM kernel applies a Tanh activation function before producing intermediate results that flow to subsequent processing stages in the forward propagation pipeline. In accordance with various embodiments of the invention, Tanh activation functions provide nonlinear transformations for classification and other output processing stages while maintaining compatibility with the overall tensor-compressed training framework. The Tanh activation may operate on outputs from matrix multiplication operations, providing the final non-linear processingneeded for task-specific predictions. Tanh activation functions in accordance with several embodiments may enable effective classification performance while interfacing efficiently with the tensor-compressed components that dominate the computational processing throughout the transformer architecture.

[0140] The system characteristics during forward propagation demonstrate how tensorized transformer training accelerators maintain TT cores entirely on-chip in BRAM and URAM components while optionally offloading activation tensors to DRAM when memory constraints require external storage. In many embodiments of the invention, the on-chip storage of tensor cores enables efficient access during forward propagation operations without requiring off-chip memory transfers that would increase latency and energy consumption. The activation tensor offloading capability may provide scalability for larger models and longer sequences while maintaining the computational efficiency achieved through compressed parameter representations. Tensorized transformer training accelerators in accordance with various embodiments may utilize this hybrid memory management approach to balance on-chip processing efficiency with the storage capacity requirements of varying model sizes and training configurations.

[0141] The unified contraction engines support forward propagation, backward propagation, and core-gradient updates through specialized tensor network operations designed around bidirectional tensor-train execution characteristics. In several embodiments of the invention, unified contraction engines enable efficient processing of tensor network operations during forward propagation stages while maintaining compressed parameter representations throughout the computation flow. The bi-directional execution characteristics may enable parallel processing of tensor contractions from multiple directions, reducing sequential dependencies and improving computational throughput during forward propagation operations. Unified contraction engines in accordance with a number of embodiments may demonstrate how the system architecture optimizes tensor network computations for resource-constrained hardware platforms while maintaining training accuracy comparable to uncompressed implementations.

[0142] The forward propagation flow demonstrates bidirectional data flow between various components, with computational paths indicating the direction of processing during forward pass operations. In accordance with various embodiments of the invention, the bidirectional data flow enables efficient transfer of tensorized representations between processing stages while maintaining the compressed format throughout the forward propagation pipeline. Thecomputational paths may show how tensor-compressed layers integrate with standard neural network operations to enable memory-efficient training on resource-constrained hardware platforms. The forward propagation architecture in accordance with several embodiments may illustrate how specialized kernels process different types of tensor operations while maintaining compatibility with the overall transformer training framework through optimized data flow patterns.

[0143] Various forward propagation processes using tensorized transformer training accelerators are discussed above with reference to FIG. 9. Alternative forward propagation processes using tensorized transformer training accelerators can be utilized as appropriate to the requirements of specific applications. These alternative representations also provide training acceleration capabilities in accordance with various embodiments of the invention.

[0144] A comparison of task scheduling for BTT-format forward propagation before and after optimization in accordance with an embodiment of the invention is illustrated in FIG. 10. The task scheduling optimization demonstrates how tensorized transformer training accelerators may implement rescheduling techniques to optimize parallel BTT computation while maintaining computational accuracy and reducing hardware resource requirements. Task scheduling optimization systems in accordance with various embodiments of the invention may enable more efficient utilization of computational resources during tensor network operations by reorganizing the temporal execution of tensor contraction operations without increasing total processing latency.

[0145] The left side of FIG. 10 shows a schematic representation where tensor cores and intermediate activations, and input and output of the TT-linear layer are represented. In many embodiments of the invention, the schematic representation illustrates the computational dependencies between tensor cores and intermediate results during BTT-format forward propagation operations. The tensor cores may store compressed weight parameters that participate in bi-directional tensor network contractions, while intermediate activations represent the temporary results generated during the contraction process. Task scheduling optimization systems in accordance with several embodiments may utilize these schematic representations to identify opportunities for parallel execution of tensor operations that do not have direct computational dependencies.

[0146] The right side of FIG. 10 displays timeline diagrams illustrating different task execution patterns, where different colors represent different computational tasks during the BTT forward propagation process. In various embodiments of the invention, the timeline diagrams demonstrate how tensor contraction operations can be reorganized temporally to achieve better resource utilization while maintaining the mathematical correctness of the computation results. The timeline visualization may show the execution sequence of MULO, MUL1, and MUL2 kernels that perform different stages of the bi-directional tensor contraction process. Task scheduling optimization systems in accordance with a number of embodiments may utilize timeline analysis to identify computational bottlenecks and opportunities for parallel execution of independent tensor operations.

[0147] The " Before" timeline shows resource allocation with six MUL0 kernels executing simultaneously during the initial stages of BTT forward propagation. In several embodiments of the invention, the simultaneous execution of multiple MUL0 kernels represents a naive parallelization approach where all independent tensor contraction operations are launched concurrently without considering hardware resource constraints. The concurrent execution may require substantial hardware resources to support six parallel MUL0 operations, leading to increased area overhead and power consumption on resource-constrained platforms. Task scheduling optimization systems in accordance with many embodiments may demonstrate how the simultaneous approach can lead to resource contention and suboptimal hardware utilization when computational resources are limited.

[0148] The MUL0 kernels in the simultaneous execution approach perform tensor contractions between tensor cores G1 and G3 to produce intermediate results Z1, as well as contractions between tensor cores G1 and G2 to produce intermediate results Z3. In accordance with various embodiments of the invention, the MULO operations represent the initial stages of bidirectional tensor contraction where independent tensor cores can be processed in parallel without computational dependencies. The simultaneous execution may enable faster completion of the initial contraction stages but requires multiple dedicated hardware resources to support parallel MULO operations. Task scheduling optimization systems in accordance with several embodiments may analyze the resource requirements of simultaneous MUL0 execution to identify opportunities for hardware resource reduction through temporal rescheduling.

[0149] The " After" timeline demonstrates a rescheduled approach that reduces hardware requirements to two reusable MUL0 kernels by moving non-urgent operations to later time steps without increasing total latency. In many embodiments of the invention, the rescheduled approach identifies tensor contraction operations that can be delayed without affecting the critical path of the overall computation, enabling hardware resource sharing between different computational tasks. The rescheduling process may move non-critical MUL0 operations to time slots where hardware resources become available after completing other tensor contraction stages. Task scheduling optimization systems in accordance with various embodiments may achieve the same computational throughput as the simultaneous approach while requiring significantly fewer dedicated hardware resources through intelligent temporal reorganization of tensor operations.

[0150] The tensorized transformer training accelerator implements task rescheduling to optimize parallel BTT computation by moving non-urgent operations to later time steps and running them with other multipliers in parallel without increasing total latency. In several embodiments of the invention, the task rescheduling process analyzes the computational dependencies between tensor operations to identify which operations can be delayed without affecting the overall completion time of the BTT forward propagation. The rescheduling algorithm may prioritize operations on the critical path while deferring non-critical operations to time slots where computational resources become available. Task scheduling optimization systems in accordance with a number of embodiments may enable efficient resource utilization by sharing hardware multipliers between different types of tensor operations, reducing the total number of required computational units while maintaining processing throughput.

[0151] The optimization process enables hardware resource sharing where the same MUL0 kernels can be reused for different tensor contraction operations at different time steps during the BTT forward propagation process. In accordance with various embodiments of the invention, the resource sharing approach reduces the total number of required multiplier units from six to two while maintaining the same computational throughput through temporal multiplexing of hardware resources. The reusable MUL0 kernels may process different tensor core pairs sequentially, where the scheduling ensures that all required tensor contractions are completed within the same total time as the simultaneous approach. Task scheduling optimization systems in accordance with several embodiments may demonstrate how intelligent scheduling can achieve substantial hardware resource reduction without compromising computational performance or accuracy.

[0152] The rescheduled execution maintains the mathematical correctness of BTT forward propagation while achieving better hardware resource utilization compared to the simultaneous execution approach. In many embodiments of the invention, the rescheduling process preserves all computational dependencies between tensor operations, ensuring that intermediate results are available when needed for subsequent contraction stages. The optimized scheduling may enable more efficient utilization of computational resources while reducing the area overhead and power consumption associated with parallel tensor processing hardware. Task scheduling optimization systems in accordance with various embodiments may achieve improved energy efficiency and reduced hardware complexity through temporal optimization of tensor network operations without sacrificing computational accuracy or processing speed.

[0153] The task scheduling optimization enables tensorized transformer training accelerators to achieve better resource utilization efficiency while maintaining the computational benefits of bi-directional tensor contraction algorithms. In several embodiments of the invention, the optimization process balances the trade-offs between parallel execution and hardware resource requirements, enabling scalable implementations that can adapt to different resource constraints on edge computing platforms. The rescheduling approach may provide flexibility for implementing BTT forward propagation on various hardware platforms with different computational resource availability. Task scheduling optimization systems in accordance with a number of embodiments may enable efficient deployment of tensor-compressed transformer training on resource-constrained devices by optimizing the temporal execution of tensor network operations while preserving the memory and computational advantages of the BTT approach.

[0154] Although specific examples of task scheduling optimizations are described above with reference to FIG. 10, alternative implementations are possible that are appropriate to the requirements of specific applications in accordance with various embodiments of the invention.

[0155] A sequence diagram representing a training process with fused parallel BTT in accordance with an embodiment of the invention is illustrated in FIG. 11. The sequence diagram depicts interactions between various computational stages and data flows during the training operation, including forward propagation, backward propagation, and parameter update stages. The diagram may show operations involving tensor contractions and matrix multiplications through multiple tensor cores, with intermediate results being generated and passed between computational steps during the training process.

[0156] The sequence diagram illustrates operations labeled as MUL2 and MUL3, representing multiplication operations performed during the training process. In many embodiments of the invention, MUL2 operations contract between output gradients and input intermediate results, while MUL3 operations contract intermediate results with tensor factors to produce parameter gradients. The sequence may show how these operations interact during backward propagation stages, where gradient information flows backward through the network architecture with operations involving tensor cores and intermediate gradient computations.

[0157] The diagram demonstrates a comparison between two computational approaches, labeled as " Before" and " After," showing an optimization in the processing flow for memory efficiency. In several embodiments of the invention, the " Before" section shows a sequence where operations are performed with larger buffer requirements, where entire intra-layer results between contraction steps are stored, creating large buffers with high memory consumption. The conventional approach may require storage of all intermediate results throughout the computation process, leading to memory overhead that scales with tensor dimensions and computational complexity.

[0158] The " After" section demonstrates a modified approach where operations are split into multiple smaller multiplication operations with reduced buffer size requirements. In accordance with various embodiments of the invention, the back propagation engine implements a fused parallel BTT dataflow that eliminates memory overhead by dividing contraction operations into fine-grained contractions with immediate reuse of intermediate results. The fused approach may divide each contraction operation into multiple fine-grained contractions, where MUL2 and MUL3 operations are formulated as smaller computational units that process tensor slices rather than complete tensors.

[0159] The fine-grained contraction approach enables immediate reuse of intermediate results, where when one fine-grained contraction creates a small intermediate result, the next finegrained contraction step uses the result immediately. In many embodiments of the invention, the fused parallel BTT dataflow requires only a small buffer with size O(r) for all fine-grained contraction steps, where r represents the tensor rank. The memory cost of intermediate contraction results may be completely eliminated in the parameter gradient computation process, enabling more efficient utilization of on-chip memory resources during backward propagation operations.

[0160] The sequence diagram shows how the fused parallel BTT approach maintains computational accuracy while reducing memory requirements compared to conventional tensor contraction methods. In several embodiments of the invention, the fine-grained contraction operations preserve the mathematical relationships between tensor components while eliminating the need to store large intermediate tensors during gradient computation. The fused approach may enable scalable backward propagation operations that can accommodate varying tensor sizes and model configurations while maintaining efficient memory utilization on resource-constrained hardware platforms.

[0161] The temporal relationship between operations in the sequence diagram illustrates how different computational kernels interact during the training process, with arrows indicating the direction of data flow and the progression of computations through time. In accordance with various embodiments of the invention, the sequence demonstrates how data dependencies are managed during fused parallel BTT operations, where intermediate results are generated and consumed within the same computational cycle to minimize memory storage requirements. The temporal coordination may enable efficient pipeline processing of tensor network operations while maintaining the computational benefits of bi-directional tensor contraction algorithms.

[0162] Although specific examples of fused parallel BTT are described above with reference to FIG. 11, alternative implementations are possible that are appropriate to the requirements of specific applications in accordance with various embodiments of the invention.

[0163] A computing device architecture for performing tensor-compressed training in accordance with an embodiment of the invention is illustrated in FIG. 12. Computing device systems in accordance with various embodiments of the invention provide comprehensive computational frameworks that enable systematic processing of tensor-compressed training operations, gradient computations, and parameter updates through integrated hardware and software architectures designed for memory-efficient transformer model training on resource-constrained platforms. A computing device 1200 incorporates multiple interconnected components that work together to manage complex tensor network contraction operations, network communications, and information storage requirements throughout tensor-compressed transformer training operations. The computing device 1200 includes a processor 1205 that provides computational capabilities for executing tensor network contractions, processing bi-directional tensor-train operations, and coordinating forward propagation, backward propagation, andparameter update activities during transformer training phases. The computing device 1200 further incorporates a network interface 1210 that enables communication with external devices, data sources, and distributed computing resources throughout tensor-compressed training operations. The computing device 1200 includes memory 1220 that provides data storage capabilities for maintaining tensor-compressed model parameters, training datasets, and intermediate computation results that support ongoing training operations and facilitate rapid access to tensor cores during forward and backward propagation activities.

[0164] Processors in accordance with various embodiments of the invention provide high-performance computational capabilities that enable systematic execution of tensor network contractions, bi-directional tensor-train computations, and gradient calculation operations throughout tensor-compressed transformer training activities. Processors may incorporate multiple processing cores, specialized instruction sets, and parallel processing capabilities that facilitate efficient execution of tensor decomposition algorithms, fused parallel contraction techniques, and complex gradient computation procedures during forward propagation, backward propagation, and parameter update phases. Processors in accordance with a number of embodiments coordinate with memory components to manage tensor core data, process intermediate contraction results, and generate parameter gradients based on loss computations and activation gradient analysis throughout tensor-compressed training operations. Processors in accordance with many embodiments include processing units such as central processing units (CPUs), graphical processing units (GPUs), field-programmable gate arrays (FPGAs), and dedicated tensor processing acceleration hardware. In certain embodiments, network interfaces provide comprehensive communication capabilities that enable the computing device to maintain connections with external data sources, distributed training resources, and monitoring systems while facilitating data exchange protocols that support coordination between training components. In selected embodiments, network interfaces may incorporate multiple communication protocols, bandwidth management capabilities, and data transfer features that ensure reliable transmission of training data and maintain consistent connectivity with external resources during varying network conditions and operational requirements encountered throughout tensor-compressed transformer training activities.

[0165] Memories in accordance with many embodiments provide comprehensive data storage capabilities that maintain organized repositories of information supporting ongoing tensor-compressed transformer training operations through systematic organization of tensor cores, training datasets, and intermediate computation results. The memory 1220 incorporates a training application 1222 that stores executable instructions for performing tensor-compressed transformer training operations, including algorithms for tensor network contractions, bi-directional tensortrain computations, and parameter update procedures that enable memory-efficient training on resource-constrained platforms. The memory 1220 further includes training data 1224 that maintains comprehensive collections of input sequences, labels, and associated information that may be accessed during training activities based on specific model requirements and training configurations. The coordinated operation of the processor 1205, the network interface 1210, and the memory 1220 enables the computing device 1200 to provide systematic processing capabilities that support tensor-compressed forward propagation, gradient computation, backward propagation, and parameter update operations throughout transformer training while maintaining all model parameters in compressed tensor formats across diverse training scenarios and operational requirements.

[0166] Although specific examples of computing device architectures are described above with reference to FIG. 12, alternative implementations are possible that are appropriate to the requirements of specific applications in accordance with various embodiments of the invention.

[0167] A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the disclosure. Accordingly, other implementations are within the scope of the following claims.

Claims

CLAIMS1. A method for training a transformer model on a resource-constrained device, comprising:encoding input tokens into hidden representations using tensor-compressed embedding tables stored in tensor-train-matrix (TTM) format;performing forward propagation on the hidden representations through tensor-compressed layers comprising tensor-train (TT) decomposed weight matrices to generate model outputs, wherein the forward propagation maintains all weight parameters in compressed tensor format without reconstructing dense weight matrices;computing gradients of loss from the model outputs;backpropagating the gradients through the tensor-compressed layers to compute gradients with respect to tensor cores of the TTM format embedding tables and the TT format weight matrices using tensor network contractions; andupdating the tensor cores of the tensor-compressed embedding tables and weight matrices directly in compressed format using the computed gradients with respect to the tensor cores.

2. The method of claim 1, wherein performing forward propagation comprises utilizing bidirectional tensor-train contraction that performs contractions from both left and right directions toward a middle point in parallel.

3. The method of claims 1 to 2, wherein the bi-directional tensor-train contraction reduces a total number of computation stages from 2d to d+1, where d represents a decomposition depth of the tensor-compressed layers.

4. The method of claims 1 to 3, wherein backpropagating the gradients comprises computing activation gradients and parameter gradients simultaneously through fused parallel tensor-train contraction techniques.

5. The method of claims 1 to 4, further comprising storing all tensor cores representing the tensor-compressed embedding tables and weight matrices in on-chip memory to eliminate off-chip memory access during parameter operations.

6. The method of claims 1 to 5, wherein the on-chip memory comprises block RAM (BRAM) with capacity less than 6MB and ultra RAM (URAM) with capacity of 22.5MB.

7. The method of claims 1 to 6, wherein performing forward propagation through tensor-compressed layers comprises:processing query, key, and value transformations through TT-format linear layers; and computing attention mechanisms using compressed representations without reconstructing dense weight matrices during matrix-vector multiplication operations.

8. The method of claims 1 to 7, wherein backpropagating the gradients comprises computing gradients for each tensor core Gk through tensor network contractions that eliminate a target core from a tensor network while maintaining connections between remaining tensor cores.

9. The method of claims 1 to 8, wherein updating the tensor cores comprises applying gradient descent operations directly to individual tensor cores using learning rates while maintaining all parameter representations in compressed tensor formats throughout optimization steps.

10. The method of claims 1 to 9, wherein the tensor-compressed format is maintained throughout forward propagation, backward propagation, and parameter updates without recovering original uncompressed matrices.

11. A tensorized transformer training accelerator, comprising:on-chip memory and computation kernels configured to store tensor cores representing tensor-compressed model parameters including TTM cores for embedding tables and TT cores for weight matrices;off-chip memory configured to store training data, activations, and labels;a forward propagation engine configured to:receive input tokens from the off-chip memory;encode the input tokens into hidden representations using the tensor-compressed embedding tables; andprocess the hidden representations through the tensor-compressed weight matrices to generate model outputs while maintaining all parameters in compressed format without reconstructing dense matrices;a back propagation engine configured to:receive gradients of loss computed from the model outputs; and backpropagate the gradients through tensor-compressed layers to compute gradients with respect to the tensor cores through tensor network contractions; and a parameter update kernel configured to update the tensor cores stored in the on-chip memory directly in compressed format using the gradients computed by the back propagation engine.

12. The tensorized transformer training accelerator of claim 11, wherein the forward propagation engine is configured to utilize bi-directional tensor-train contraction that performs contractions from both left and right directions toward a middle point in parallel.

13. The tensorized transformer training accelerator of claims 11 to 12, wherein the bidirectional tensor-train contraction reduces a total number of computation stages from 2d to d+1, where d represents a decomposition depth of the tensor-compressed layers.

14. The tensorized transformer training accelerator of claims 11 to 13, wherein the back propagation engine is configured to utilize fused parallel tensor-train contraction techniques that compute activation gradients and parameter gradients simultaneously while eliminating intermediate tensor storage requirements during gradient calculation.

15. The tensorized transformer training accelerator of claims 11 to 14, wherein the on-chip memory comprises block RAM (BRAM) with capacity less than 6MB and ultra RAM (URAM) with capacity of 22.5MB, wherein all tensor cores are stored entirely on-chip.

16. The tensorized transformer training accelerator of claims 11 to 15, wherein the on-chip memory and computation kernels implement array partitioning that partitions tensor core data into multiple smaller arrays mapped to separate BRAM blocks to enable parallel data access.

17. The tensorized transformer training accelerator of claims 11 to 16, wherein the forward propagation engine comprises:a TTM-FP kernel configured to process tensor-train-matrix embedding operations; a TT-FP kernel configured to process tensor-train linear layer operations through bidirectional tensor network contractions; andan MM kernel configured to perform matrix multiplication operations for attention mechanisms.

18. The tensorized transformer training accelerator of claims 11 to 17, wherein the back propagation engine comprises:a TTM-BP kernel configured to compute gradients for tensor-train-matrix embedding layers;a TT-BP kernel configured to compute gradients for tensor-train linear layers through bidirectional tensor network contractions; andan MM kernel configured to handle matrix multiplication operations during backward propagation for attention mechanisms.

19. The tensorized transformer training accelerator of claims 11 to 18, further comprising task scheduling circuitry configured to optimize parallel tensor operations by moving non-urgent operations to later time steps without increasing total latency and enabling hardware resource sharing through temporal multiplexing.

20. The tensorized transformer training accelerator of claims 11 to 19, wherein the tensor cores remain in compressed format throughout forward propagation, backward propagation, and parameter update operations.