A large language model deployment method and system based on a distributed architecture
By applying the Winograd algorithm and FlashAttention strategy on FPGA, combined with adaptive pipeline scheduling and the Aurora protocol, the resource and communication bottlenecks of large language models are solved, and efficient distributed deployment is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
- Filing Date
- 2025-12-17
- Publication Date
- 2026-06-19
AI Technical Summary
Existing single FPGA resources are insufficient to accommodate large language models, distributed deployment schemes suffer from uneven resource utilization, communication bandwidth and latency bottlenecks lead to low computing efficiency, and there is a lack of optimized development tools and debugging methods.
The Winograd algorithm is used to optimize the self-attention mechanism of Transformer. Combined with the FlashAttention strategy, adaptive pipeline scheduling and intelligent data prefetching algorithm are used to achieve inter-FPGA communication through the Aurora protocol and dynamically adjust the bandwidth utilization.
It significantly improves computational efficiency and resource utilization, solves resource and communication bottlenecks in traditional solutions, and enables efficient deployment of Transformer models.
Smart Images

Figure CN122240128A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence and hardware acceleration, and more specifically, to a method and system for deploying large language models based on a distributed architecture. Background Technology
[0002] Large Language Models (LLMs), such as the GPT and LLama series, have demonstrated powerful natural language understanding and generation capabilities through pre-training on massive amounts of text data. These models typically have billions to hundreds of billions of parameters, placing extremely high demands on computational and storage resources. In 2017, the Transformer architecture was proposed, laying the foundation for modern large language models. In 2018, the BERT model proved the effectiveness of pre-training. In 2019, GPT-2 demonstrated the potential of generative language models. In 2020, GPT-3 achieved a breakthrough in parameter size to 175B. In 2023, models such as LLama and ChatGPT propelled the practical application of these models.
[0003] FPGAs, with their reconfigurability, low power consumption, and high parallelism, have been widely used in deep learning acceleration. Their technological advantages include: flexible hardware reconfigurability for optimization of specific algorithms; lower power consumption compared to GPUs; and support for numerical computation of arbitrary precision, facilitating model quantization and optimization. Their development stages are: Stage 1 (2015-2018): Primarily focused on accelerator design for CNN models. Stage 2 (2018-2021): Expanding to sequence models such as RNNs and LSTMs. Stage 3 (2021-present): Beginning exploration of Transformer and large language model acceleration.
[0004] The paper "FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs" by Wang Yu's research group at Tsinghua University proposes that while large language models (LLMs) based on the Transformer architecture have achieved significant results in fields such as natural language processing, their massive number of parameters and computational requirements lead to extremely high computational and memory overhead. Although compression techniques (such as sparsity and quantization) can alleviate this problem, existing GPUs and general-purpose Transformer accelerators struggle to efficiently support compressed models, thus facing three challenges: 1. Low computational efficiency; 2. Insufficient memory bandwidth utilization; 3. High compilation overhead.
[0005] To address the above problems, this paper proposes... Figure 1Framework, core innovation: A complete FPGA mapping process is proposed, and efficient inference is achieved by combining FPGA-specific resources, including the following three key technologies: 1. Configurable sparse DSP chain: By stacking and configuring DSP48 units, it supports multiple sparse computing modes (such as block-wise, N:M sparse), improving computing efficiency by approximately 1.6×.
[0006] 2. Always-on-chip Decode: The activation value is kept on-chip during the decoding stage. Combined with mixed-precision quantization, memory bandwidth utilization is improved from ~35.6% to ~65.9%.
[0007] 3. Adaptive length compilation method: Instructions with different input token lengths are grouped and packaged, which significantly reduces instruction storage overhead by about 500× (from TB level to GB level).
[0008] Experimental results and performance verification: The model was evaluated on a Xilinx Alveo U280 FPGA using the LLaMA2-7B model, and significant performance improvements were achieved: an efficiency improvement of approximately 6× and a cost efficiency improvement of approximately 1.8×.
[0009] The Vitis AI development environment consists of the Vitis AI development kit, and its official homepage is as follows: Figure 2 As shown, this is a tool for AI inference on Xilinx hardware platforms, including edge devices and Alveo accelerator cards. It consists of optimized IP cores, tools, libraries, models, and example designs. Designed with efficiency and ease of use in mind, it fully leverages the potential of AI acceleration on Xilinx FPGAs and ACAPs. By abstracting the complexity of the underlying FPGA and ACAP devices, it allows users without FPGA knowledge to easily develop deep learning inference applications.
[0010] The paper by Seongmin Hong et al., DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation, has the following framework components: Figure 3As shown, this multi-FPGA inference acceleration solution for the GPT-2 model is particularly suitable for addressing the common serial computation bottlenecks in text generation tasks. The design distributes GPT-2 across four Xilinx Alveo U280 FPGAs using a model-parallel approach, achieving low latency and high throughput through optimized dataflow strategies and hardware-aware scheduling. In end-to-end inference, this system achieves approximately 5.58× speed improvement and 3.99× energy efficiency improvement compared to a four-NVIDIA V100 GPU platform, while also improving cost efficiency by 8.21 times.
[0011] In their paper "The Feasibility of Implementing Large-Scale Transformer Multi-FPGA Platforms," Yu Gao et al. explored the feasibility of using multiple FPGAs to accelerate large-scale Transformers. The authors implemented a scalable multi-FPGA architecture and mapping toolchain, prototyping it on six FPGAs using an encoder from the I-BERT Transformer as a base. This research not only helps validate the feasibility of multi-FPGA platforms in model segmentation and scheduling but also emphasizes the crucial importance of building "large model mapping tools" to support future system expansion at the LLM level.
[0012] Application of the Aurora protocol in FPGA interconnection, Aurora protocol introduction: Aurora is a point-to-point serial communication protocol developed by Xilinx, with the following features: support for transmission rates up to 40 Gbps; low latency and low power consumption design; flexible data width configuration.
[0013] In summary, the main problems with existing technologies include: Question 1: The contradiction between single FPGA resource constraints and large model parameter scale Existing single-chip FPGAs have limited resources in terms of logic units (LUTs), memory units (BRAM / URAM), and digital signal processing units (DSPs), making it difficult to directly accommodate and run large language models with 7 billion parameters (such as Llama2-7B), resulting in model inference not being able to be efficiently implemented on a single FPGA platform.
[0014] Question 2: Low efficiency in model deployment and computation. Existing distributed deployment solutions are mainly geared towards GPU clusters and lack optimization for the characteristics of FPGA platforms (such as reconfigurability and customized communication interfaces), resulting in problems such as unbalanced model partitioning, low resource utilization, and high power consumption, making it difficult to achieve efficient inference in resource-constrained and low-power scenarios.
[0015] Question 3: Inter-board communication bandwidth and latency bottlenecks Traditional multi-FPGA systems rely on PCIe or Ethernet interfaces for inter-board communication, which have limited bandwidth and high latency. This cannot meet the real-time transmission requirements of large-scale activation values and intermediate results during large language model inference, resulting in frequent waiting by computing units and a decrease in the overall system throughput.
[0016] The main drawbacks of existing technologies: Limitations of single FPGA solutions Resource bottlenecks: On-chip storage capacity is severely insufficient, unable to accommodate large models with hundreds of millions of units. The DSP and computing resources of a single chip are limited, and the parallelism is insufficient. External storage bandwidth becomes a performance bottleneck, affecting inference speed.
[0017] Poor scalability: The model size is constrained by the resources of a single FPGA, and performance cannot be linearly improved by increasing hardware resources. There is a lack of upgrade path when facing larger-scale models.
[0018] Low cost efficiency: High-end FPGA chips are required to achieve good performance, resource utilization is uneven, there are computing hotspots, single point of failure risk, and reliability needs to be improved.
[0019] The shortcomings of existing multi-FPGA solutions: Communication bottlenecks: PCIe and Ethernet have relatively low bandwidth, high latency, large communication protocol overhead, low effective bandwidth utilization, and lack of communication mechanisms optimized for deep learning.
[0020] System complexity: The programming model is complex, development is difficult, load balancing and synchronization mechanisms are difficult to design, and there is a lack of mature development tools and debugging methods.
[0021] Efficiency issues: Data transmission and computation cannot be effectively overlapped, the model partitioning strategy lacks theoretical guidance, and the overall system efficiency is lower than the theoretical value. Summary of the Invention
[0022] This invention provides a method and system for deploying large language models based on a distributed architecture, in order to solve the problem that traditional Aurora connections use static bandwidth allocation and cannot dynamically adjust according to the computational characteristics and data traffic of each layer of the LLaMA2 model.
[0023] According to an embodiment of the present invention, a method for deploying a large language model based on a distributed architecture is provided, comprising the following steps: The Winograd algorithm is introduced into the self-attention mechanism of Transformer. By transforming and performing element-wise multiplication, the number of QKT matrix multiplications is reduced, and the cross-correlation of multiple sensors is used to detect sensor anomalies. Winograd-optimized QKT computation is integrated into the block processing framework of FlashAttention to improve computational efficiency, long sequence processing capability and optimize memory in multi-head self-attention. Normal distribution and confidence level are used to determine whether the sensor has anomalies. By employing adaptive pipeline scheduling and intelligent data prefetching algorithms, computation time is predicted, communication timing and priority management are optimized in FPGA distributed inference environments, pipeline bubbles are reduced, and distributed LLM inference efficiency is improved.
[0024] Furthermore, by introducing the Winograd algorithm into the self-attention mechanism of Transformer, the data is first transformed, simple operations are performed in the transformation space, and then the result is transformed back.
[0025] Furthermore, in introducing the Winograd algorithm into the self-attention mechanism of Transformer, firstly, Q, K, V matrices are generated, where Q, K, and V correspond to query, key, and value in self-attention;
[0026] Where, X∈R L×D The input sequence embedding matrix is L: sequence length, D: hidden dimension; W Q W K W V ∈ R D×Dh It is the projection weight matrix, D h =D / H, where H: number of attention heads; The input is projected into a low-dimensional space to prepare data for each head; then Q, K, V are divided into multiple heads by H differences. Winograd-optimized QK T :
[0027]
[0028]
[0029]
[0030] Among them, K block ∈R m×m It is a block of K; U is the transformed block of K; Q block ∈R m×m It represents the block of Q; V is the transformed Q block; ⊙ represents element-wise multiplication; M is the intermediate result, with far fewer multiplications than the direct QK. T Inverse transformation yields Sblcok ∈R (m -1)×(m-1) ; The complete Winograd formula is as follows:
[0031] Integrate the above steps; for the entire matrix S, calculate and concatenate all blocks one by one.
[0032] Furthermore, first prepare the data, input: Q, K, V ∈ L × Dh D h =D / H; Blocking: Dividing the sequence into blocks of size T r and column block size T c ; Q is divided into rows and blocks: Q i ∈ Tr × Dh i=1 to L / T r ; K and V are divided into columns: K j V j ∈ Tc × Dh j=1 to L / T c ; Initialize the statistics vector: For each row block i, set l i = -∞, m i = 0, O i = 0; Calculate block fraction S block :
[0033] Temporary statistics within the computation block:
[0034]
[0035] Update global statistics:
[0036]
[0037] Update temporary output:
[0038] Fusion Output: Concatenate all line blocks for output: O = Concat(O 1 O 2 , ..., OL / Tr (Dimensions L × D) h ); Finally, the multiple heads are merged to obtain the output of self-attention: .
[0039] Furthermore, with a fixed 40Gbps Aurora bandwidth, performance is improved by optimizing data transmission timing, caching strategies, and computation scheduling.
[0040] Furthermore, computation time modeling is performed, specifically the computation time of the i-th layer:
[0041] Where B: batch size, S: sequence length, d model Model dimension, f FPGA η: FPGA operating frequency; Communication time modeling, data transmission time between FPGAs:
[0042] Among them, D ij =B S d model 4 bytes, BW eff =35 Gbps, L aurora =100 ns; Streamline efficiency analysis, defining the bubble time in the streamline:
[0043] For data transmission from layer i to layer j, the optimal prefetch start time is:
[0044] When multiple data transmission requests occur simultaneously, the prefetch priority is calculated as follows:
[0045] Among them, T slack(j) =max(0, ): Time margin; D max Maximum data transfer volume, used for normalization; : Critical path factor; W1, W2, W3: Weight coefficients, satisfying ∑W i =1; Based on historical execution data, the least squares method is used to optimize w:
[0046] The scheduling decision algorithm is used to obtain the scheduling result and then the scheduling is performed.
[0047] Furthermore, the Adaptive Pipeline Scheduling Algorithm (APSA) minimizes pipeline bubbles and improves overall system efficiency by predicting computation completion time and employing intelligent prefetching strategies.
[0048] According to another embodiment of the present invention, a large language model deployment system based on a distributed architecture is provided, comprising: The transformation unit is used to introduce the Winograd algorithm into the self-attention mechanism of the Transformer. By transforming and performing element-wise multiplication, it reduces the number of QKT matrix multiplications and uses the cross-correlation of multiple sensors to detect sensor anomalies. The integration unit is used to integrate Winograd-optimized QKT computation into the block processing framework of FlashAttention, improving computational efficiency, long sequence processing capability and optimizing memory in multi-head self-attention, and using normal distribution and confidence level to determine whether the sensor has anomalies. The prediction unit is used to predict computation time, optimize communication timing and priority management in FPGA distributed inference environments using adaptive pipeline scheduling and intelligent data prefetching algorithms, reduce pipeline bubbles, and improve distributed LLM inference efficiency.
[0049] A storage medium storing program files capable of implementing any of the above-mentioned large language model deployment methods based on a distributed architecture.
[0050] A processor for running a program, wherein the program executes any of the above-mentioned methods for deploying large language models based on a distributed architecture.
[0051] The large language model deployment method and system based on a distributed architecture in this invention embodiment is centered on an FPGA-based multi-head self-attention mechanism acceleration system. This invention integrates the Winograd algorithm and FlashAttention optimization strategy to achieve efficient Transformer model deployment. Through hardware-level optimization, this invention significantly reduces memory access overhead and computational complexity. FPGAs communicate via an innovative adaptive bandwidth allocation algorithm and the high-speed Aurora protocol, dynamically optimizing Aurora link bandwidth utilization based on real-time load and model characteristics. This solves the problem that traditional Aurora connections use static bandwidth allocation, which cannot dynamically adjust according to the computational characteristics and data flow of each layer of the LLaMA2 model. Attached Figure Description
[0052] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings: Figure 1 This is a diagram of the FlightLLM framework in the existing technology; Figure 2 Vitis AI diagram in existing technology; Figure 3 This is a structural framework diagram of DFX in the existing technology; Figure 4 This is a mathematical demonstration diagram illustrating Winograd's method for reducing the number of multiplications in this invention. Figure 5 This is a pseudocode diagram of the Winograd + flash-attention fusion in this invention; Figure 6 This is a diagram of the scheduling decision algorithm in this invention; Figure 7 This is a diagram showing the output results of the present invention. Detailed Implementation
[0053] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0054] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0055] Example 1 Based on the analysis of the shortcomings of existing technologies, the main objectives of this invention include: overcoming resource constraints and achieving efficient deployment of large models; and realizing a cost-effective edge AI solution.
[0056] The basic content of the technical solution of this invention is as follows: This invention proposes a deployment system for the large language model LLama2 based on a three-FPGA distributed architecture. The core is an FPGA-based multi-head self-attention mechanism acceleration system that integrates the Winograd algorithm and FlashAttention optimization strategy to achieve efficient Transformer model deployment. Through hardware-level optimization, the system significantly reduces memory access overhead and computational complexity. An innovative adaptive bandwidth allocation algorithm and high-speed Aurora protocol are used between FPGAs to dynamically optimize Aurora link bandwidth utilization based on real-time load and model characteristics. This solves the problem that traditional Aurora connections use static bandwidth allocation, which cannot dynamically adjust according to the computational characteristics and data flow of each layer of the LLama2 model.
[0057] The detailed description of the technical solution of this invention is as follows: 1. The innovative fusion of Winograd algorithm and Multi-Head Self-Attention Winograd is a mathematical technique originally used to accelerate convolutions in image processing (like a sliding window calculating on an image). Instead of performing direct multiplication, it first "transforms" the data (like encoding), performs simple operations in the transformed space, and then "inverse transforms" back to the result. The advantages are fewer multiplications (multiplication is slow, addition is fast), resulting in faster computation.
[0058] For example, in Figure 4 In the original algorithm, the topmost matrix multiplication operation requires 6 multiplications, as indicated by the arrow. However, with the Winograd algorithm, this is reduced to the bottommost operation, decreasing the number of multiplications from 6 to 4.
[0059] Technological innovation methods: First, generate the Q, K, V matrix (Q, K, V correspond to query, key, and value in self-attention).
[0060] (1) Where, X∈R L×D W is the input sequence embedding matrix (L: sequence length, D: hidden dimension). Q W K W V ∈ R D ×Dh It is the projection weight matrix (D) h =D / H, where H: number of attention heads). This step projects the input into a low-dimensional space, preparing data for each head. Then Q, K, V are divided into multiple heads by H differences.
[0061] Traditional QKT Calculation method: (2) (3) (4) Winograd-optimized QK T :
[0062]
[0063]
[0064]
[0065] Among them, K block ∈R m×m It is a block of K. U is the transformed K block (larger in size, such as 3×3 when m=2). Q block ∈R m×m This represents the block of Q. V is the transformed Q block (of the same size, such as 3×3). ⊙ indicates element-wise multiplication (Hadamard multiplication). M is the intermediate result, with far fewer multiplications than the direct QK. T The inverse transform yields S. blcok ∈R (m-1)×(m-1) .
[0066] The complete Winograd formula is as follows: (5) Integrate the above steps. For the entire matrix S, calculate and concatenate all blocks one by one. This maintains consistency with traditional QK. T Mathematically equivalent, but multiplication is greatly reduced (e.g., about 36% when m=4).
[0067] 2. Innovative Fusion of Flash-Attention and Multi-Head Self-Attention For a single attention head (all attention heads are processed in parallel): First, prepare the data. Input: Q, K, V ∈ L × Dh (D) h = D / H).
[0068] Blocking: Dividing the sequence into row blocks (size T) r ) and column blocks (size T) c Let T r = T c= 128 (This parameter is set according to the FPGA's BRAM size).
[0069] Q is divided into rows and blocks: Q i ∈ Tr × Dh (i=1 to L / T) r ).
[0070] K and V are divided into columns: K j V j ∈ Tc × Dh (j=1 to L / T) c ).
[0071] Initialize the statistics vector: For each row block i, set l i = -∞ (the vector with the maximum row size, T) r ), m i = 0 (row and vector, size T) r ), O i = 0 (Temporary output, T) r × D h ).
[0072] Calculate block fraction S block : (6) Temporary statistics within the computation block: (7) (8) Update global statistics: (9) (10) Update temporary output: (11) Fusion Output: Concatenate all line blocks for output: O = Concat(O 1 O 2 , ..., O L / Tr (Dimensions L × D) h ).
[0073] Finally, by merging multiple heads, we can obtain the output of self-attention: (12) pseudocode such as Figure 5 As shown.
[0074] 3. Innovative Optimization Scheme for LLaMA2 Distributed Inference Based on Aurora Core Innovation: Intelligent pipeline scheduling and data prefetching algorithm based on the Aurora high-speed protocol This solution proposes an Adaptive Pipeline Scheduling with Intelligent Prefetching (APSIP) algorithm, which improves performance by optimizing data transmission timing, caching strategies, and computation scheduling under a fixed 40Gbps Aurora bandwidth.
[0075] Computation time modeling, computation time of the i-th layer: (13) Where B: batch size, S: sequence length, d model Model dimensions (LLaMA2-7B is 4096), f FPGA η: FPGA operating frequency; η: computational efficiency.
[0076] Communication time modeling, data transmission time between FPGAs: (14) Among them, D ij =B S d model 4 bytes (FP32 precision), BW eff =35 Gbps (effective bandwidth considering protocol overhead), L aurora =100 ns (Aurora link average latency).
[0077] Streamline efficiency analysis, defining the bubble time in the streamline: (15) The Adaptive Pipeline Scheduling Algorithm (APSA) minimizes pipeline bubbles and improves overall system efficiency by predicting computation completion time and employing intelligent prefetching strategies.
[0078] For data transmission from layer i to layer j, the optimal prefetch start time is: (16) When multiple data transmission requests occur simultaneously, the prefetch priority is calculated as follows: (17) Among them, T slack (j) =max(0, (D) : Time margin.max Maximum data transfer volume, used for normalization. Critical path factor. W1, W2, W3: Weight coefficients, satisfying ∑W i =1.
[0079] Based on historical execution data, the least squares method is used to optimize w: (18) use Figure 6 The scheduling decision algorithm obtains the scheduling result and performs the scheduling.
[0080] The key points and areas to be protected in this invention mainly revolve around the following three aspects: 1. The Winograd algorithm is innovatively introduced into the self-attention mechanism of Transformer, significantly reducing the number of QKT matrix multiplications through transformation and element-wise multiplication, thereby accelerating computation. A method for detecting sensor anomalies using the cross-correlation of multiple sensors; 2. Winograd-optimized QKT computation is integrated into the block-based processing framework of FlashAttention, achieving a balance between improved computational efficiency, long sequence processing capability, and GPU memory optimization in multi-head self-attention. A normal distribution and confidence level are proposed to determine whether the sensor has anomalies.
[0081] 3. For FPGA distributed inference environments, an adaptive pipeline scheduling and intelligent data prefetching algorithm is proposed. By predicting computation time, optimizing communication timing and priority management, pipeline bubbles are minimized and distributed LLM inference efficiency is improved.
[0082] Compared with the prior art, the advantages of the present invention are: First, the Winograd algorithm significantly improves computational efficiency when applied to self-attention QKT computation. Traditional methods involve a large number of multiplication operations in QKT matrix multiplication using the self-attention mechanism, while Winograd greatly reduces the number of multiplications by transforming multiplication into element-wise operations (e.g., the description mentions a reduction of approximately 36%). This directly results in faster computation speed, while maintaining mathematical equivalence.
[0083] Secondly, the fusion of Winograd and FlashAttention demonstrates superior performance when processing long sequences. FlashAttention itself optimizes memory usage and computational efficiency, but combining it with Winograd further accelerates the most time-consuming QKT computation within its block-based processing framework. This means that this method can not only efficiently handle large models and long contexts, but also achieve further performance improvements on top of existing FlashAttention, making it particularly suitable for resource-constrained hardware platforms such as FPGAs that prioritize ultimate efficiency.
[0084] Finally, the Intelligent Pipeline Scheduling and Data Prefetching (APSIP) algorithm based on Aurora addresses the communication bottlenecks and inefficiencies in distributed inference. Compared to static or non-optimized scheduling schemes, APSIP minimizes pipeline bubbles (i.e., idle time for devices waiting for data or computation) by predicting computation time, optimizing communication timing, and using intelligent data prefetching. It can dynamically adjust scheduling strategies under fixed bandwidth (e.g., 40Gbps Aurora), prioritizing the transmission of critical data, thereby maximizing the utilization and throughput of computing resources in distributed FPGA clusters, which is crucial for distributed inference of large-scale models such as LLM.
[0085] In summary, this invention provides a full-stack solution from underlying computational optimization to upper-level system scheduling, aiming to achieve faster, more efficient, and more resource-saving AI model inference, especially in dedicated hardware and distributed system environments.
[0086] The feasibility of this invention has been proven through experiments, simulations, and applications.
[0087] The overall experimental architecture used in the experiments and simulations consisted of three Xilinx ZYNQ UltraScale+ XCZU15EG-2FFVB1156I FPGA boards, physically connected via SFP+ and fiber optic cables. First, the quantized and assigned weight files were placed on the SD card of each board. After powering on the boards, the weight files were read into the DDR memory, and then the bitstream was downloaded to each FPGA. Finally, the main control board code was executed. Given a prompt word, the result was output. Figure 7 As shown.
[0088] The technical solution proposed in this invention is not only applicable to the current FPGA model, but also to various other Xilinx (AMD) models, and the supported large language models can also be extended to the GPT series.
[0089] Example 2 According to another embodiment of the present invention, a large language model deployment system based on a distributed architecture is provided, comprising: The transformation unit is used to introduce the Winograd algorithm into the self-attention mechanism of the Transformer. By transforming and performing element-wise multiplication, it reduces the number of QKT matrix multiplications and uses the cross-correlation of multiple sensors to detect sensor anomalies. The integration unit is used to integrate Winograd-optimized QKT computation into the block processing framework of FlashAttention, improving computational efficiency, long sequence processing capability and optimizing memory in multi-head self-attention, and using normal distribution and confidence level to determine whether the sensor has anomalies. The prediction unit is used to predict computation time, optimize communication timing and priority management in FPGA distributed inference environments using adaptive pipeline scheduling and intelligent data prefetching algorithms, reduce pipeline bubbles, and improve distributed LLM inference efficiency.
[0090] The large language model deployment system based on a distributed architecture in this invention is fundamentally an FPGA-based multi-head self-attention mechanism acceleration system. This invention integrates the Winograd algorithm and FlashAttention optimization strategy to achieve efficient Transformer model deployment. Through hardware-level optimization, this invention significantly reduces memory access overhead and computational complexity. FPGAs communicate via an innovative adaptive bandwidth allocation algorithm and the high-speed Aurora protocol, dynamically optimizing Aurora link bandwidth utilization based on real-time load and model characteristics. This solves the problem of traditional Aurora connections using static bandwidth allocation, which cannot dynamically adjust according to the computational characteristics and data flow of each layer of the LLaMA2 model.
[0091] Based on the analysis of the shortcomings of existing technologies, the main objectives of this invention include: overcoming resource constraints and achieving efficient deployment of large models; and realizing a cost-effective edge AI solution.
[0092] The basic content of the technical solution of this invention is as follows: This invention proposes a deployment system for the large language model LLama2 based on a three-FPGA distributed architecture. The core is an FPGA-based multi-head self-attention mechanism acceleration system that integrates the Winograd algorithm and FlashAttention optimization strategy to achieve efficient Transformer model deployment. Through hardware-level optimization, the system significantly reduces memory access overhead and computational complexity. An innovative adaptive bandwidth allocation algorithm and high-speed Aurora protocol are used between FPGAs to dynamically optimize Aurora link bandwidth utilization based on real-time load and model characteristics. This solves the problem that traditional Aurora connections use static bandwidth allocation, which cannot dynamically adjust according to the computational characteristics and data flow of each layer of the LLama2 model.
[0093] The detailed description of the technical solution of this invention is as follows: 1. The innovative fusion of Winograd algorithm and Multi-Head Self-Attention Winograd is a mathematical technique originally used to accelerate convolutions in image processing (like a sliding window calculating on an image). Instead of performing direct multiplication, it first "transforms" the data (like encoding), performs simple operations in the transformed space, and then "inverse transforms" back to the result. The advantages are fewer multiplications (multiplication is slow, addition is fast), resulting in faster computation.
[0094] For example, in Figure 4 In the original algorithm, the topmost matrix multiplication operation requires 6 multiplications, as indicated by the arrow. However, with the Winograd algorithm, this is reduced to the bottommost operation, decreasing the number of multiplications from 6 to 4.
[0095] Technological innovation methods: First, generate the Q, K, V matrix (Q, K, V correspond to query, key, and value in self-attention).
[0096] (1) Where, X∈R L×D W is the input sequence embedding matrix (L: sequence length, D: hidden dimension). Q W K W V ∈ R D ×Dh It is the projection weight matrix (D) h =D / H, where H: number of attention heads). This step projects the input into a low-dimensional space, preparing data for each head. Then Q, K, V are divided into multiple heads by H differences.
[0097] Traditional QK T Calculation method: (2) (3) (4) Winograd-optimized QK T :
[0098]
[0099]
[0100]
[0101] Among them, K block ∈R m×m It is a block of K. U is the transformed K block (larger in size, such as 3×3 when m=2). Q block ∈R m×m This represents the block of Q. V is the transformed Q block (of the same size, such as 3×3). ⊙ indicates element-wise multiplication (Hadamard multiplication). M is the intermediate result, with far fewer multiplications than the direct QK. T The inverse transform yields S. blcok ∈R (m-1)×(m-1) .
[0102] The complete Winograd formula is as follows: (5) Integrate the above steps. For the entire matrix S, calculate and concatenate all blocks one by one. This maintains consistency with traditional QK. T Mathematically equivalent, but multiplication is greatly reduced (e.g., about 36% when m=4).
[0103] 2. Innovative Fusion of Flash-Attention and Multi-Head Self-Attention For a single attention head (all attention heads are processed in parallel): First, prepare the data. Input: Q, K, V ∈ L × Dh (D) h = D / H).
[0104] Blocking: Dividing the sequence into row blocks (size T) r ) and column blocks (size T) c Let T r = T c = 128 (This parameter is set according to the FPGA's BRAM size).
[0105] Q is divided into rows and blocks: Q i ∈ Tr × Dh (i=1 to L / T) r ).
[0106] K and V are divided into columns: K j V j ∈ Tc × Dh (j=1 to L / T) c ).
[0107] Initialize the statistics vector: For each row block i, set l i = -∞ (the vector with the maximum row size, T) r ), m i = 0 (row and vector, size T) r ), O i = 0 (Temporary output, T) r × D h ).
[0108] Calculate block fraction S block : (6) Temporary statistics within the computation block: (7) (8) Update global statistics: (9) (10) Update temporary output: (11) Fusion Output: Concatenate all line blocks for output: O = Concat(O 1 O 2 , ..., O L / Tr (Dimensions L × D) h ).
[0109] Finally, by merging multiple heads, we can obtain the output of self-attention: (12) pseudocode such as Figure 5 As shown.
[0110] 3. Innovative Optimization Scheme for LLaMA2 Distributed Inference Based on Aurora Core Innovation: Intelligent pipeline scheduling and data prefetching algorithm based on the Aurora high-speed protocol This solution proposes an Adaptive Pipeline Scheduling with Intelligent Prefetching (APSIP) algorithm, which improves performance by optimizing data transmission timing, caching strategies, and computation scheduling under a fixed 40Gbps Aurora bandwidth.
[0111] Computation time modeling, computation time of the i-th layer: (13) Where B: batch size, S: sequence length, d model Model dimensions (LLaMA2-7B is 4096), f FPGA η: FPGA operating frequency; η: computational efficiency.
[0112] Communication time modeling, data transmission time between FPGAs: (14) Among them, D ij =B S d model 4 bytes (FP32 precision), BW eff =35 Gbps (effective bandwidth considering protocol overhead), L aurora =100 ns (Aurora link average latency).
[0113] Streamline efficiency analysis, defining the bubble time in the streamline: (15) The Adaptive Pipeline Scheduling Algorithm (APSA) minimizes pipeline bubbles and improves overall system efficiency by predicting computation completion time and employing intelligent prefetching strategies.
[0114] For data transmission from layer i to layer j, the optimal prefetch start time is: (16) When multiple data transmission requests occur simultaneously, the prefetch priority is calculated as follows: (17) Among them, T slack (j) =max(0, (D) : Time margin. max Maximum data transfer volume, used for normalization. Critical path factor. W1, W2, W3: Weight coefficients, satisfying ∑W i =1.
[0115] Based on historical execution data, the least squares method is used to optimize w: (18) use Figure 6 The scheduling decision algorithm obtains the scheduling result and performs the scheduling.
[0116] The key points and areas to be protected in this invention mainly revolve around the following three aspects: 1. The Winograd algorithm is innovatively introduced into the self-attention mechanism of Transformer, significantly reducing the number of QKT matrix multiplications through transformation and element-wise multiplication, thereby accelerating computation. A method for detecting sensor anomalies using the cross-correlation of multiple sensors; 2. Winograd-optimized QKT computation is integrated into the block-based processing framework of FlashAttention, achieving a balance between improved computational efficiency, long sequence processing capability, and GPU memory optimization in multi-head self-attention. A normal distribution and confidence level are proposed to determine whether the sensor has anomalies.
[0117] 3. For FPGA distributed inference environments, an adaptive pipeline scheduling and intelligent data prefetching algorithm is proposed. By predicting computation time, optimizing communication timing and priority management, pipeline bubbles are minimized and distributed LLM inference efficiency is improved.
[0118] Compared with the prior art, the advantages of the present invention are: First, the Winograd algorithm significantly improves computational efficiency when applied to self-attention QKT computation. Traditional methods involve a large number of multiplication operations in QKT matrix multiplication using the self-attention mechanism, while Winograd greatly reduces the number of multiplications by transforming multiplication into element-wise operations (e.g., the description mentions a reduction of approximately 36%). This directly results in faster computation speed, while maintaining mathematical equivalence.
[0119] Secondly, the fusion of Winograd and FlashAttention demonstrates superior performance when processing long sequences. FlashAttention itself optimizes memory usage and computational efficiency, but combining it with Winograd further accelerates the most time-consuming QKT computation within its block-based processing framework. This means that this method can not only efficiently handle large models and long contexts, but also achieve further performance improvements on top of existing FlashAttention, making it particularly suitable for resource-constrained hardware platforms such as FPGAs that prioritize ultimate efficiency.
[0120] Finally, the Intelligent Pipeline Scheduling and Data Prefetching (APSIP) algorithm based on Aurora addresses the communication bottlenecks and inefficiencies in distributed inference. Compared to static or non-optimized scheduling schemes, APSIP minimizes pipeline bubbles (i.e., idle time for devices waiting for data or computation) by predicting computation time, optimizing communication timing, and using intelligent data prefetching. It can dynamically adjust scheduling strategies under fixed bandwidth (e.g., 40Gbps Aurora), prioritizing the transmission of critical data, thereby maximizing the utilization and throughput of computing resources in distributed FPGA clusters, which is crucial for distributed inference of large-scale models such as LLM.
[0121] In summary, this invention provides a full-stack solution from underlying computational optimization to upper-level system scheduling, aiming to achieve faster, more efficient, and more resource-saving AI model inference, especially in dedicated hardware and distributed system environments.
[0122] The feasibility of this invention has been proven through experiments, simulations, and applications.
[0123] The overall experimental architecture used in the experiments and simulations consisted of three Xilinx ZYNQ UltraScale+ XCZU15EG-2FFVB1156I FPGA boards, physically connected via SFP+ and fiber optic cables. First, the quantized and assigned weight files were placed on the SD card of each board. After powering on the boards, the weight files were read into the DDR memory, and then the bitstream was downloaded to each FPGA. Finally, the main control board code was executed. Given a prompt word, the result was output. Figure 7 As shown.
[0124] The technical solution proposed in this invention is not only applicable to the current FPGA model, but also to various other Xilinx (AMD) models, and the supported large language models can also be extended to the GPT series.
[0125] Example 3 A storage medium storing program files capable of implementing any of the above-mentioned large language model deployment methods based on a distributed architecture.
[0126] In an exemplary embodiment, a computer-readable storage medium is also provided, which stores a computer program that, when executed by a processor, implements the steps of deploying a large language model based on a distributed architecture. The computer storage medium can be any available medium or data storage device accessible by a computer, including but not limited to magnetic storage (e.g., floppy disks, hard disks, magnetic tapes, magneto-optical disks (MO), etc.), optical storage (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor storage (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND flash), solid-state drives (SSDs)).
[0127] Example 4 A processor for running a program, wherein the program executes any of the above-mentioned methods for deploying large language models based on a distributed architecture.
[0128] In an exemplary embodiment, a computer device is also provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the steps of deploying a large language model based on a distributed architecture. The processor may be a Central Processing Unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
[0129] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0130] In the above embodiments of the present invention, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0131] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The system embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection of units or modules may be electrical or other forms.
[0132] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0133] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0134] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.
[0135] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. A method for deploying large language models based on a distributed architecture, characterized in that, Includes the following steps: The Winograd algorithm is introduced into the self-attention mechanism of Transformer. By transforming and performing element-wise multiplication, the number of QKT matrix multiplications is reduced, and the cross-correlation of multiple sensors is used to detect sensor anomalies. Winograd-optimized QKT computation is integrated into the block processing framework of FlashAttention to improve computational efficiency, long sequence processing capability and optimize memory in multi-head self-attention. Normal distribution and confidence level are used to determine whether the sensor has anomalies. By employing adaptive pipeline scheduling and intelligent data prefetching algorithms, computation time is predicted, communication timing and priority management are optimized in FPGA distributed inference environments, pipeline bubbles are reduced, and distributed LLM inference efficiency is improved.
2. The method for deploying a large language model based on a distributed architecture according to claim 1, characterized in that, By introducing the Winograd algorithm into the self-attention mechanism of Transformer, the data is first transformed, simple operations are performed in the transformation space, and then the result is transformed back.
3. The method for deploying a large language model based on a distributed architecture according to claim 2, characterized in that, In introducing the Winograd algorithm into the self-attention mechanism of Transformer, firstly, Q, K, V matrices are generated, where Q, K, and V correspond to query, key, and value in self-attention; Where, X∈R L×D The input sequence embedding matrix is L: sequence length, D: hidden dimension; W Q W K W V ∈ R D×Dh It is the projection weight matrix, D h =D / H, where H: number of attention heads; The input is projected into a low-dimensional space to prepare data for each head; then Q, K, V are divided into multiple heads by H differences. Winograd-optimized QK T : Among them, K block ∈R m×m It is a block of K; U is the transformed block of K; Q block ∈R m×m It represents the block of Q; V is the transformed Q block; ⊙ represents element-wise multiplication; M is the intermediate result, with far fewer multiplications than the direct QK. T Inverse transformation yields S blcok ∈R (m -1)×(m-1) ; The complete Winograd formula is as follows: Integrate the above steps; for the entire matrix S, calculate and concatenate all blocks one by one. .
4. The method for deploying a large language model based on a distributed architecture according to claim 3, characterized in that, For a given attention head, process all attention heads in parallel: First, prepare the data. Input: Q, K, V ∈ L × Dh D h =D / H; Blocking: Dividing the sequence into blocks of size T r and column block size T c ; Q is divided into rows and blocks: Q i ∈ Tr × Dh i=1 to L / T r ; K and V are divided into columns: K j V j ∈ Tc × Dh j=1 to L / T c ; Initialize the statistics vector: For each row block i, set l i = -∞, m i = 0, O i = 0; Calculate block fraction S block : Temporary statistics within the computation block: Update global statistics: Update temporary output: Fusion Output: Concatenate all line blocks for output: O = Concat(O 1 O 2 , ..., O L / Tr (Dimensions L × D) h ); Finally, the multiple heads are merged to obtain the output of self-attention: 。 5. The method for deploying a large language model based on a distributed architecture according to claim 4, characterized in that, Performance is improved by optimizing data transmission timing, caching strategies, and computation scheduling under a fixed 40Gbps Aurora bandwidth.
6. The method for deploying a large language model based on a distributed architecture according to claim 5, characterized in that, Computation time modeling, computation time of the i-th layer: Where B: batch size, S: sequence length, d model Model dimension, f FPGA η: FPGA operating frequency; Communication time modeling, data transmission time between FPGAs: Among them, D ij =B S d model 4 bytes, BW eff =35 Gbps, L aurora =100 ns; Streamline efficiency analysis, defining the bubble time in the streamline: For data transmission from layer i to layer j, the optimal prefetch start time is: When multiple data transmission requests occur simultaneously, the prefetch priority is calculated as follows: Among them, T slack(j) =max(0, ): Time margin; D max Maximum data transfer volume, used for normalization; : Critical path factor; W1, W2, W3: Weight coefficients, satisfying ∑W i =1; Based on historical execution data, the least squares method is used to optimize w: The scheduling decision algorithm is used to obtain the scheduling result and then the scheduling is performed.
7. The method for deploying a large language model based on a distributed architecture according to claim 6, characterized in that, The Adaptive Pipeline Scheduling Algorithm (APSA) minimizes pipeline bubbles and improves overall system efficiency by predicting computation completion time and employing intelligent prefetching strategies.
8. A large language model deployment system based on a distributed architecture, characterized in that, include: The transformation unit is used to introduce the Winograd algorithm into the self-attention mechanism of the Transformer. By transforming and performing element-wise multiplication, it reduces the number of QKT matrix multiplications and uses the cross-correlation of multiple sensors to detect sensor anomalies. The integration unit is used to integrate Winograd-optimized QKT computation into the block processing framework of FlashAttention, improving computational efficiency, long sequence processing capability and optimizing memory in multi-head self-attention, and using normal distribution and confidence level to determine whether the sensor has anomalies. The prediction unit is used to predict computation time, optimize communication timing and priority management in FPGA distributed inference environments using adaptive pipeline scheduling and intelligent data prefetching algorithms, reduce pipeline bubbles, and improve distributed LLM inference efficiency.
9. A storage medium, characterized in that, The storage medium stores program files capable of implementing the large language model deployment method based on a distributed architecture as described in any one of claims 1 to 7.
10. A processor, characterized in that, The processor is used to run a program, wherein the program executes the large language model deployment method based on a distributed architecture as described in any one of claims 1 to 7.