Accelerator for generative pre-training model, accelerated inference method and electronic device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By designing an accelerator for generative pre-trained models, and employing in-situ computation and matrix operations to fuse nonlinear computation, the problems of resource waste and performance degradation of generative pre-trained models on FPGAs are solved, achieving efficient inference computation.

CN118228788BActive Publication Date: 2026-06-26TSINGHUA UNIVERSITY

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TSINGHUA UNIVERSITY
Filing Date: 2024-04-02
Publication Date: 2026-06-26

Application Information

Patent Timeline

02 Apr 2024

Application

26 Jun 2026

Publication

CN118228788B

IPC: G06N3/063; G06N3/0464; G06N3/047; G06N3/048

CPC: G06N3/063; G06N3/0464; G06N3/047; G06N3/048; Y02D10/00

AI Tagging

Technology Topics

Algorithm Theoretical computer science

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Inference of generative pre-trained models on FPGAs suffers from nonlinear computational resource waste and performance degradation. In particular, the computational cost of Softmax and LayerNorm increases with the length of input data and the embedding dimension, leading to computational resource waste and performance degradation. Existing technologies have not been able to effectively solve this problem.

Method used

Design an accelerator that includes multiple accelerated computing modules and a normalized computing module. Employ in-situ computing to calculate the mean, variance, or maximum value and exponential function value in real time. Integrate nonlinear calculations through matrix operations, optimize data flow transformation to adapt to the computing needs of different stages, reduce computational complexity, and improve resource utilization.

Benefits of technology

It significantly improves the efficiency of accelerators in performing nonlinear operations, reduces computational complexity, improves hardware resource utilization and computational performance, adapts to the computational characteristics of different stages of generative pre-trained models, and enhances inference efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN118228788B_ABST

Patent Text Reader

Abstract

The disclosure relates to an accelerator for a generative pre-training model, an accelerated inference method and an electronic device, the accelerator comprising: a plurality of accelerated computing modules and a normalization computing module, each accelerated computing module comprising: a first nonlinear computing unit configured to calculate a mean and a variance corresponding to an nth token according to a currently obtained nth token and a mean and a variance corresponding to an (n-1)th token which has been calculated, wherein the mean and the variance corresponding to the 0th token are 0, and k is a length of a first token sequence of a layer normalization to be calculated; and the normalization computing module is configured to perform normalization calculation on each token in the first token sequence according to the mean and the variance corresponding to the kth token output by the first nonlinear computing unit, to obtain a layer normalization calculation result corresponding to the first token sequence. In this way, the operation efficiency and performance of the accelerator on the nonlinear operator in the generative pre-training model can be improved, thereby improving the inference efficiency of the generative pre-training model.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of accelerator technology, and more particularly to an accelerator, accelerated inference method, and electronic device for generative pre-trained models. Background Technology

[0002] With the rapid development of deep learning technology, generative pre-trained models (such as GPT-2 and ChatGPT) have achieved remarkable results in fields such as natural language processing and computer vision. These models have extremely large parameter scales, typically requiring significant computational resources and time for training and inference. To improve the performance of these models in practical applications, researchers are constantly exploring optimization algorithms and hardware acceleration methods. Dataflow transformation methods and nonlinear operator design are two important directions in this research.

[0003] Dataflow transformation methods primarily focus on how to efficiently organize data flows to achieve high-performance model inference on computing accelerators. Existing dataflow transformation methods typically achieve efficient scheduling of the model computation graph through operations such as window sliding and block partitioning of the input data. However, these methods still suffer from problems such as wasted computational resources and data transmission bottlenecks when dealing with large-scale models.

[0004] Nonlinear operator design aims to improve inference speed by reducing the computational complexity of models through the introduction of scalable nonlinear operations. Existing nonlinear operator design methods mainly focus on how to map complex nonlinear operations to simpler ones, such as accelerating matrix multiplication by utilizing the shared properties of polynomial fitting. However, these methods still have certain limitations in practical applications, such as wasted computational resources and loss of accuracy in nonlinear operations.

[0005] It is known that generative pre-trained models (such as GPT-2) contain three types of nonlinear operators: the activation function GELU, the normalized exponential function Softmax, and layer normalization (hereinafter referred to as LayerNorm). These operators require dedicated computational modules to be implemented in FPGAs for optimization. GELU, as an activation function, can be efficiently implemented using lookup tables or piecewise polynomial function fitting. However, the computational latency of Softmax and LayerNorm is affected by model configuration and load characteristics, which poses challenges to pipeline design.

[0006] The computational complexity of both Softmax and LayerNorm increases exponentially with the total number of elements processed. Specifically, the number of elements in Softmax is equal to the current statement length, while the number of elements in LayerNorm is equal to the embedding dimension of the model. During the hardware design phase, we do not know the features of the model to be deployed or the characteristics of the runtime load. Therefore, nonlinear operators are typically designed based on the maximum statement length and model width, resulting in significant resource waste. Furthermore, the irregular latency and resource consumption of nonlinear operators increase the complexity of exploring the overall system design space.

[0007] In short, nonlinear computations (primarily Softmax and LayerNorm) in inference of generative pre-trained models lead to resource waste and performance degradation. In particular, the computational cost of Softmax and LayerNorm increases exponentially with statement length and embedding dimension. For resource-constrained processors, length-sensitive nonlinear operator design is unacceptable. Furthermore, irregular computational latency increases pipeline design complexity.

[0008] While existing technologies can use graphics processing units (GPUs) to accelerate nonlinear computations in model inference, GPU-based inference schemes exhibit disadvantages in terms of energy consumption and latency as model dimensionality and sentence generation length increase. In contrast, field-programmable gate arrays (FPGAs), with their reconfigurable characteristics, may be more suitable for generative model inference scenarios. On the one hand, FPGAs offer higher on-chip memory capacity than GPUs, which is crucial for generation tasks requiring a low compute-to-memory ratio. On the other hand, FPGAs enable fine-grained pipeline design and flexible operator fusion to handle the complex execution flow and nonlinear operations of Transformers.

[0009] A series of FPGA accelerator studies based on the Transformer architecture have been carried out, achieving excellent performance. Examples include the DFX accelerator (Dynamic Function eXchange, which is an accelerator that utilizes the flexibility of programmable logic devices to allow runtime modifications to the running hardware design) for GPT model acceleration, DFX for BERT model acceleration, and ViA (an accelerator architecture for Visual Transformer (ViT) designed to efficiently execute Transformer applications) for ViT model (a model that applies Transformer to image classification).

[0010] However, the advantages of FPGAs are not fully utilized in current GPT accelerators. Specifically, deploying generative pre-trained model inference on FPGAs presents challenges, particularly due to resource waste and performance degradation caused by nonlinear computations (primarily Softmax and LayerNorm) in generative pre-trained models. The computational cost of Softmax and LayerNorm increases exponentially with statement length and embedding dimension. For FPGAs with limited resources, length-sensitive nonlinear operator design is unacceptable. Therefore, designing an accelerator capable of efficiently performing nonlinear computations in generative pre-trained models is crucial for improving their inference performance. Summary of the Invention

[0011] In view of this, this disclosure proposes an accelerator, an accelerated inference method, and an electronic device for generative pre-trained models, which can improve the computational efficiency and performance of the accelerator for nonlinear operators in generative pre-trained models, and is beneficial to improving the inference efficiency of generative pre-trained models.

[0012] According to one aspect of this disclosure, an accelerator is provided, comprising: a plurality of accelerated computing modules and a normalization computing module, wherein each accelerated computing module comprises: a first nonlinear computing unit, configured to calculate the mean and variance corresponding to the nth word based on the currently acquired nth word and the mean and variance corresponding to the (n-1)th word already calculated, where 1≤n≤k, the mean and variance corresponding to the 0th word are 0, and k is the length of the first word sequence to be normalized; the normalization computing module is configured to perform normalization calculation on each word in the first word sequence based on the mean and variance corresponding to the kth word output by the first nonlinear computing unit, to obtain the layer normalization calculation result corresponding to the first word sequence.

[0013] In one possible implementation, the accelerated calculation module further includes: a second nonlinear calculation unit, used to calculate the maximum value, exponential function value, and exponential function summation value corresponding to the i-th word element based on the currently acquired i-th word element and the maximum value and exponential function summation value corresponding to the calculated (i-1)-th word element. The maximum value corresponding to the i-th word element includes the maximum value corresponding to the (i-1)-th word element and the maximum value among the i-th word elements. The maximum value corresponding to the 0th word element is the 1st word element, and the exponential function summation value corresponding to the 0th word element is 0. 1≤i≤h, where h is the length of the second word element sequence for which the normalized exponential function is to be calculated. The normalization calculation module is further used to normalize the exponential function values corresponding to each word element in the second word element sequence based on the maximum value and exponential function summation value corresponding to the h-th word element output by the second nonlinear calculation unit, and the maximum value corresponding to each word element in the second word element sequence, to obtain the normalized exponential function calculation result corresponding to the second word element sequence.

[0014] In one possible implementation, the accelerated computing module further includes: a matrix multiplication and addition unit, comprising a multiplier array and an addition tree; the multiplier array includes multiple multipliers, each multiplier including a multiplexed input terminal and at least two ordinary input terminals, each multiplier being used to perform matrix multiplication operations on a first data stream input from the multiplexed input terminal and at least two second data streams input from the at least two ordinary input terminals respectively; the addition tree includes multiple adders in a tree structure, used to perform matrix addition operations on the matrix multiplication results output by each multiplier in the multiplier array to obtain the matrix multiplication and addition results between the first data stream and the at least two second data streams respectively.

[0015] In one possible implementation, the accelerator is used to perform inference on a generative pre-trained model, the inference of which includes an encoding phase and a decoding phase for a lexical sequence; wherein, in the encoding phase, a first data stream input from the multiplexed input includes weight data, and a second data stream input from each ordinary input includes lexical data to be encoded, the lexical data including a lexical sequence or lexical units or portions thereof; in the decoding phase, a first data stream input from the multiplexed input includes lexical data to be decoded, and a second data stream input from each ordinary input includes weight data.

[0016] In one possible implementation, the generative pre-trained model comprises multiple layers, each layer including multi-head attention, a feedforward neural network, and residual connections; wherein the multi-head attention and the feedforward neural network each correspond to two matrix multiply-accumulate units; each multiplier in the matrix multiply-accumulate unit includes two ordinary input ports; wherein, for any current layer in the generative pre-trained model, one second data stream input to the current layer is loaded into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer to obtain a calculation result, and while the calculation result is input into the matrix multiply-accumulate unit corresponding to the feedforward neural network of the current layer, another second data stream input to the current layer is loaded into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer, so that the two matrix multiply-accumulate units corresponding to the multi-head attention and the feedforward neural network of the current layer alternately process the two second data streams; the current layer includes any attention mechanism layer or feedforward network layer; the accelerated computing module further includes a residual computing unit for calculating the residuals corresponding to the residual connections in each layer.

[0017] According to another aspect of this disclosure, an accelerated inference method for generative pre-trained models is provided. The generative pre-trained model includes multiple layer normalization operators. The method uses an accelerator to execute the layer normalization operators. The accelerator includes multiple accelerated computation modules and a normalization computation module. Each accelerated computation module includes a first nonlinear computation unit. The method includes: for any layer normalization operator, obtaining the nth word to be processed by the layer normalization operator; using the first nonlinear computation unit, calculating the mean and variance corresponding to the nth word based on the currently obtained nth word and the mean and variance corresponding to the (n-1)th word already calculated, where 1 ≤ n ≤ k, the mean and variance corresponding to the 0th word are 0, and k is the length of the first word sequence to be normalized; using the normalization computation module, performing normalization calculations on each word in the first word sequence based on the mean and variance corresponding to the kth word, to obtain the layer normalization calculation result corresponding to the first word sequence.

[0018] In one possible implementation, the generative pre-trained model further includes a normalized exponential function operator, the accelerated computation module includes a second nonlinear computation unit, and the method further includes: obtaining the i-th word to be processed by the normalized exponential function operator; using the second nonlinear computation unit to calculate the maximum value, exponential function value, and exponential function summation value corresponding to the i-th word based on the currently obtained i-th word and the maximum value, exponential function value, and exponential function summation value corresponding to the (i-1)-th word already calculated, wherein the maximum value corresponding to the i-th word includes the maximum value corresponding to the (i-1)-th word and the maximum value in the i-th word, the maximum value and exponential function summation value corresponding to the 0th word are 0, 1≤i≤h, and h is the length of the second word sequence for which the normalized exponential function is to be calculated; using the normalized computation module to normalize the exponential function value corresponding to each word in the second word sequence based on the maximum value and exponential function summation value corresponding to the h-th word and the maximum value corresponding to each word in the second word sequence, to obtain the normalized exponential function calculation result corresponding to the second word sequence.

[0019] In one possible implementation, the accelerated computing module further includes a matrix multiplication and addition unit; the matrix multiplication and addition unit includes a multiplier array and an addition tree; the multiplier array includes multiple multipliers, each multiplier including a multiplexed input terminal and at least two ordinary input terminals, each multiplier being used to perform matrix multiplication operations on a first data stream input from the multiplexed input terminal and at least two second data streams respectively input from the at least two ordinary input ports; the addition tree includes multiple adders in a tree structure, used to perform matrix addition operations on the matrix multiplication results output by each multiplier in the multiplier array to obtain the first... The method further includes: in the encoding phase of the generative pre-trained model, inputting weight data to the multiplexed input terminals of the multiplier array, and inputting lexical data to be encoded to each ordinary input terminal of the multiplier array, the lexical data including a lexical sequence or lexical units or parts of lexical units; in the decoding phase of the generative pre-trained model, inputting lexical data to be decoded to the multiplexed input terminals of the multiplier array, and inputting weight data to each ordinary input terminal of the multiplier array.

[0020] In one possible implementation, the generative pre-trained model includes multiple layers, each layer including multi-head attention, a feedforward neural network, and residual connections; wherein, the multi-head attention and the feedforward neural network each correspond to two matrix multiply-accumulate units; each multiplier in the matrix multiply-accumulate unit includes two ordinary input ports; the accelerated computing module further includes a residual computing unit; wherein, the method further includes: for any current layer in the generative pre-trained model, loading one second data stream input to the current layer into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer, obtaining the calculation result and inputting it into the matrix multiply-accumulate unit corresponding to the feedforward neural network of the current layer, while simultaneously loading another second data stream input to the current layer into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer, so that the two matrix multiply-accumulate units corresponding to the multi-head attention and the feedforward neural network of the current layer alternately process the two second data streams; and using the residual computing unit to calculate the residuals corresponding to the residual connections in each layer.

[0021] According to another aspect of this disclosure, an electronic device is provided, comprising: the aforementioned accelerator.

[0022] According to various aspects of this disclosure, by utilizing the first nonlinear computing unit to perform in-situ calculations, the mean and variance of the acquired word units are calculated in real time. Then, the normalization calculation module performs normalization calculations on the entire first word unit sequence based on the mean and variance output by the first nonlinear computing unit. This eliminates the need to wait for the entire first word unit sequence to be input before performing layer normalization calculations. This can significantly improve the computational efficiency of the accelerator in performing the nonlinear operation of layer normalization, reduce the computational complexity of layer normalization, and improve the hardware resource utilization and computational performance of the accelerator. Furthermore, by utilizing the second nonlinear computing unit to perform in-situ calculations, the maximum value, exponential function value, and sum of exponential functions of the acquired word units can be calculated in real time. Additionally, the normalization calculation module can be used to perform normalization calculations of the exponential function value of the entire second word sequence based on the maximum value, exponential function value, and sum of exponential functions output by the second nonlinear computing unit. This eliminates the need to wait for the entire first word sequence to be input before performing normalization exponential function calculations, significantly improving the computational efficiency of the accelerator in performing the nonlinear operation of normalization exponential function, reducing the computational complexity of normalization exponential function, and improving the hardware resource utilization and computational performance of the accelerator.

[0023] Other features and aspects of this disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description

[0024] The accompanying drawings, which are included in and form part of this specification, illustrate exemplary embodiments, features, and aspects of this disclosure together with the specification and serve to explain the principles of this disclosure.

[0025] Figure 1 A schematic diagram of an accelerator according to an embodiment of the present disclosure is shown.

[0026] Figure 2 A schematic diagram of the hardware structure of a first nonlinear computing unit according to an embodiment of the present disclosure is shown.

[0027] Figure 3 A schematic diagram of an accelerator according to an embodiment of the present disclosure is shown.

[0028] Figure 4 A schematic diagram of the hardware structure of a second nonlinear computing unit according to an embodiment of the present disclosure is shown.

[0029] Figure 5 A schematic diagram of an accelerator according to an embodiment of the present disclosure is shown.

[0030] Figure 6 A schematic diagram of the hardware structure of a matrix multiply-add unit according to an embodiment of the present disclosure is shown.

[0031] Figure 7 A schematic diagram of the hardware structure of a multiplier according to an embodiment of the present disclosure is shown.

[0032] Figure 8 A schematic diagram of an accelerator according to an embodiment of the present disclosure is shown.

[0033] Figure 9 The diagram shows a pipeline design for a data stream input according to an embodiment of the present disclosure. Detailed Implementation

[0034] Various exemplary embodiments, features, and aspects of this disclosure will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.

[0035] The term “exemplary” as used herein means “serving as an example, embodiment, or illustration.” Any embodiment illustrated herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.

[0036] It should be understood that the terms "first" and "second" in this disclosure are used only to distinguish different objects, not to describe a specific order, nor should they be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this disclosure, "multiple" means two or more, unless otherwise explicitly specified. Furthermore, the terms "comprising" and "including" indicate the presence of the described feature, whole, step, operation, element, and / or component, but do not exclude the presence or addition of one or more other features, wholes, steps, operations, elements, components, and / or sets thereof.

[0037] Furthermore, to better illustrate this disclosure, numerous specific details are set forth in the following detailed description. Those skilled in the art will understand that this disclosure can be practiced without certain specific details. In some instances, methods, means, components, and circuits well known to those skilled in the art have not been described in detail in order to highlight the main points of this disclosure.

[0038] As mentioned above, a series of FPGA accelerator studies based on the Transformer architecture have been carried out. For example, the NPE accelerator is an FPGA hardware accelerator for Natural Language Processing (NLP) tasks. It improves computational performance and efficiency by implementing operators customized for NLP algorithms. NPE can be configured for specific NLP tasks, such as word vector generation, machine translation, sentiment analysis, etc. In some scenarios, NPE can significantly reduce the inference time of neural network models and improve deployment speed. Although NPE improves the computational performance of NLP tasks to a certain extent, it still has the following disadvantages: (1) It is only suitable for short sentences. NPE simplifies nonlinear operations by using polynomial function fitting, but this strategy is only suitable for short sentences of a certain length and cannot be used for longer sentence workloads. (2) Waste of resources. NPE is customized for a single task, which may lead to waste of hardware resources. When the length of the task workload sentence is less than the longest designed sentence length, the utilization rate of hardware resources is not high. (3) Insufficient flexibility. NPE is designed for BERT networks, and the sentence length and model width are both designed deterministically. However, the scenarios faced by GPT often cannot predict the sentence length. At this time, the lack of flexibility of NPE is exposed. In other words, although NPE accelerators have high computing performance in specific scenarios, they still have certain limitations.

[0039] For example, Softermax is an efficient softmax approach proposed for generative pre-trained models (such as Transformers) through hardware / software co-design. Its aim is to improve model performance while reducing computational complexity, memory consumption, and computation speed. The main innovation of Softermax lies in utilizing hardware acceleration and software optimization to achieve efficient softmax computation. However, in practical applications, Softermax still lacks sufficient support for the growth of model embedding dimensions. Softmax only simplifies and calculates non-linear softmax computations, and while it has high support for changes in statement length, it does not consider the non-linear computations related to the growth of model embedding dimensions.

[0040] Furthermore, the ViA accelerator, by analyzing the data flow and computation flow of ViT, designed a suitable partitioning strategy to reduce the locality of image data and improve computational and memory access efficiency. Simultaneously, two accelerators with internal stream processing engines were designed to reduce path dependence caused by short-circuiting mechanisms and fully utilize hardware resources to efficiently execute Transformers. Although ViA outperforms related FPGA-based accelerators, computational and memory access bottlenecks still exist when processing large-scale image data. Moreover, when processing Natural Language Processing (NLP) tasks, ViA may not fully leverage its performance advantages due to differences in data and model structures. This performance degradation stems from ViA's lack of specific optimization for large-scale inputs and wide models, leading to inefficient nonlinear computations.

[0041] In summary, current hardware acceleration methods for accelerating inference in generative pre-trained models mainly suffer from the following problems: Nonlinear computations (such as Softmax and LayerNorm) in the Transformer structure of generative pre-trained models lead to significant waste of computational resources and performance degradation. As the length of input data increases, the computational cost of these nonlinear computations grows exponentially. For FPGAs with limited hardware resources, this design is unacceptable due to the resource waste and performance degradation caused by nonlinear computations. Simultaneously, irregular computational latency also increases the technical complexity of pipeline design.

[0042] Furthermore, generative pre-trained models comprise two distinct task phases: encoding and decoding (i.e., generation). These two phases exhibit significantly different computational load characteristics (e.g., compute-to-memory ratio), and this difference widens as model size increases. Existing techniques often focus on only one phase without specifically designing for the computational load characteristics of both phases. This leads to inefficient use of hardware resources, impacting overall model performance. Generative pre-trained models with two-stage processing and variable-length sentence processing differ significantly from traditional Transformer architectures. While existing solutions (such as DFX) focus on optimizing one phase, they do not adapt to changes in computational load characteristics. Moreover, like NPE or ViT methods, nonlinear optimizations (such as piecewise polynomial methods) in BERT or ViT are not suitable for generative pre-trained models with varying token lengths.

[0043] In view of this, embodiments of this disclosure provide an accelerator and an accelerated inference method for generative pre-trained models. For nonlinear computations, the method optimizes data flow transformation to reduce computational complexity and improve resource utilization and performance. For example, more efficient nonlinear operators, such as depthwise separable convolution and exponentially weighted averages, can be used to reduce wasted computational resources. A flexible data flow order adjustment method is also proposed to accommodate the computational needs of different stages, given the differences in computational characteristics between the encoding and decoding stages. Furthermore, dedicated accelerated computation modules can be designed for these two stages to achieve efficient computation with limited resources.

[0044] In other words, this disclosure provides an accelerator supporting data flow transformation for inference of generative pre-trained models. This accelerator can be an FPGA-based accelerator, and it proposes a fusion optimization of nonlinear operators. Regarding the optimization strategy for nonlinear operators, Softmax and LayerNorm are designed for in-situ computation, making the on-chip resource consumption of these Softmax and LayerNorm operators independent of the processed sequence length. Furthermore, matrix operations are used to cover the latency caused by nonlinear computation; that is, nonlinear computation and matrix operations can be fused, and the resulting latency can be covered, simplifying the complexity of design space exploration. Additionally, considering the different computational load characteristics of the encoding and decoding stages in generative pre-trained models, a two-stage data flow transformation method is proposed, and a two-stage alternating input pipeline is designed to reduce the number of off-chip communication accesses and on-chip cache overhead. Therefore, the accelerator proposed in this disclosure can overcome the shortcomings of existing technologies while achieving more efficient and general-purpose acceleration of inference computation for generative pre-trained models, improving the performance of generative pre-trained models in practical applications.

[0045] The following is passed Figures 1 to 8 The accelerator proposed in the embodiments of this disclosure will be described in detail.

[0046] Figure 1 A schematic diagram of the structure of an accelerator 00 according to an embodiment of the present disclosure is shown. Figure 1 As shown, the accelerator includes:

[0047] Multiple accelerated computing modules 11 and normalized computing modules 22, wherein each accelerated computing module 00 includes:

[0048] The first nonlinear calculation unit 01 is used to calculate the mean and variance corresponding to the nth word based on the currently acquired nth word and the mean and variance corresponding to the (n-1)th word that have been calculated, 1≤n≤k, the mean and variance corresponding to the 0th word are 0, and k is the length of the first word sequence of the normalized layer to be calculated.

[0049] The normalization calculation module 22 is used to perform normalization calculation on each word in the first word sequence based on the mean and variance of the k-th word output by the first nonlinear calculation unit 01, so as to obtain the layer normalization calculation result corresponding to the first word sequence.

[0050] It should be understood that generative pre-trained models typically include multiple LayerNorm operators. The first word sequence can be any word sequence required for computation by any LayerNorm operator during the inference process of the generative pre-trained model; that is, the first word sequence can be any normalized word sequence of the layer to be computed. In practical applications, the first word sequence may be in the form of a vector, matrix, etc., and includes multiple words. A word can be understood as an element in the word sequence. The length k of the first word sequence is also the number of words (i.e., the number of elements) contained in the first word sequence.

[0051] Figure 2 This diagram illustrates the hardware structure of a first nonlinear computing unit 01 provided in an embodiment of the present disclosure, as shown below. Figure 2 As shown, "×" represents a multiplier, "+" represents an adder, and "-" represents a subtractor. n M represents the nth word element. n-1 S represents the mean value corresponding to the (n-1)th word. n-1 M represents the variance corresponding to the (n-1)th lexical unit. n S represents the mean value corresponding to the nth word. n This represents the variance corresponding to the nth word, where multiplier 011 is used to calculate x. n and product Multiplier 012 is used to calculate M n-1 and product Adder 013 is used to multiply the outputs of two multipliers 011 and 012. and Add them together to get the mean M corresponding to the nth word. n Subtractor 014 is used to calculate x. n With M n-1 The difference (x) n -M n-1 Multiplier 015 is used to multiply the output of subtractor 014 by multiplying (x) by subtractor 014. n -M n-1 Multiplying x by itself yields the square value (x). n -M n-1 ) 2 Then square the value (x) n -M n-1 ) 2 and Multiply to get the product Adder 016 is used to add the output of multiplier 015. With S n-1 Add them together to get the variance S corresponding to the nth word. n .

[0052] The above Figure 2 The calculation process of the mean and variance corresponding to the nth word by the first nonlinear calculation unit 01 shown can be expressed as formula (1) and formula (2):

[0053]

[0054]

[0055] It should be understood that, n = 1, 2, 3, ..., k, the first nonlinear calculation unit 01 can iteratively execute the above calculation process for the 1st to the kth word, until the mean and variance corresponding to the kth word are calculated. The above iterative process can be represented by the following pseudocode:

[0056]

[0057] Understandably, the mean and variance of the k-th word obtained through the above iterative calculation are also the mean and variance of the entire first word sequence. Furthermore, the normalization calculation module 22, based on the mean and variance of the k-th word output by the first nonlinear calculation unit 01, performs normalization calculations on each word in the first word sequence to obtain the layer normalization calculation result corresponding to the first word sequence. This may include: setting the variance S corresponding to the k-th word... k Divide by k to get the quotient And on Taking the square root yields Then, the mean M corresponding to each word in the first word sequence and the k-th word is calculated. n Subtracting them, we get the difference (x) n -M n ) after Divide to obtain the layer normalization result for each word in the first word sequence. That is, the layer normalization calculation result corresponding to the first word sequence is obtained. The calculation process of the above normalization calculation module 22 can be expressed as formula (3):

[0058] in,

[0059] In practical applications, the normalization calculation module 22 may include a subtractor array and a divider array. The number of arrays can be consistent with the parallelism of the matrix calculation of the acceleration calculation module 11. This embodiment of the present disclosure does not limit the hardware structure of the normalization calculation module 22, as long as it can realize the normalization calculation shown in the above formula (3). This embodiment of the present disclosure does not limit this. Wherein, since n = 1, 2, 3, ..., k, the normalization calculation module 22 can perform the above calculation process on each word in the first word sequence in a loop. The loop calculation process can be represented as the following pseudocode:

[0060]

[0061] It should be understood that the embodiments of this disclosure refer to the above-mentioned terms, mean, variance, and... There are no restrictions on the bit width, precision, etc., of the data. For example, tokens and the mean can be represented using 6 integers plus 2 decimal places, and the variance and... It can be represented using 10 integer digits plus 6 decimal digits.

[0062] As is known, the calculation process of the LayerNorm operator mainly includes three calculations: calculating the mean of the word sequence, calculating the variance of the word sequence, and normalizing the word sequence. Traditional accelerators need to perform three full sequence accesses on the entire first word sequence, and the three calculations can only be performed after all words in the first word sequence have been obtained. The calculation results of the mean, variance, and normalization are dependent on each other. However, the accelerator provided by the present disclosure embodiment can reduce the number of full sequence accesses to one, that is, the first word sequence is only accessed in the normalization calculation module 22. The first nonlinear calculation unit 01 uses an in-situ calculation method to calculate the mean and variance of the entire first word sequence without waiting for the complete first word sequence to be calculated. This can improve the calculation efficiency of layer normalization and improve the hardware resource utilization of the accelerator.

[0063] According to the embodiments of this disclosure, by using the first nonlinear computing unit to perform in-situ calculations, the mean and variance of the acquired word units are calculated in real time. Then, the normalization calculation module performs normalization calculations on the entire first word unit sequence based on the mean and variance output by the first nonlinear computing unit. This eliminates the need to wait for the entire first word unit sequence to be input before performing layer normalization calculations, which can significantly improve the computational efficiency of the accelerator in performing the nonlinear operation of layer normalization, reduce the computational complexity of layer normalization, and improve the hardware resource utilization and computational performance of the accelerator.

[0064] As described above, the generative pre-trained model also includes the Softmax operator, a nonlinear operator. To improve the computational efficiency of the accelerator for the Softmax operator, embodiments of this disclosure also provide... Figure 3 An accelerator is shown, such as Figure 3 As shown, each accelerated computing module 11 in accelerator 00 may further include:

[0065] The second nonlinear calculation unit 02 is used to calculate the maximum value, exponential function value, and exponential function summation value corresponding to the i-th word based on the currently acquired i-th word and the maximum value and exponential function summation value corresponding to the calculated (i-1)-th word. The maximum value corresponding to the i-th word includes the maximum value corresponding to the (i-1)-th word and the maximum value in the i-th word. The maximum value corresponding to the 0th word is the 1st word. The exponential function summation value corresponding to the 0th word is 0. 1≤i≤h, where h is the length of the second word sequence to be calculated for the normalized exponential function.

[0066] The normalization calculation module 22 is also used to normalize the exponential function values corresponding to each word in the second word sequence based on the maximum value and the sum of the exponential function output by the second nonlinear calculation unit 02, as well as the maximum value corresponding to each word in the second word sequence, to obtain the normalized exponential function calculation result corresponding to the second word sequence.

[0067] It should be understood that a generative pre-trained model may include at least one layer of Softmax operator. The second word sequence can be the word sequence that the Softmax operator needs to calculate during the inference process of the generative pre-trained model. That is, the second word sequence can be any word sequence of the normalized exponential function to be calculated. In practical applications, the second word sequence may be in the form of vectors, matrices, etc. The second word sequence includes multiple words, and the length h of the second word sequence is the number of words contained in the first word sequence.

[0068] Figure 4 This diagram illustrates the hardware structure of a second nonlinear computing unit 02 provided in an embodiment of the present disclosure, as shown below. Figure 4As shown, "IntMax" represents the device used to take the maximum value, "+" represents an adder, "-" represents a subtractor, and "f(x) = 2" indicates that f(x) = 2. x " " represents a device used to calculate exponential functions to the base 2, and ">>" represents a device used to perform right shift operations, where shifting data to the right by a certain number of bits is equivalent to dividing the data by a power of 2. i Represents the i-th word element, m i e represents the maximum value corresponding to the i-th word element. i The sum represents the value of the exponential function corresponding to the i-th word. i m represents the summation of the exponential function corresponding to the i-th word. i-1 sum represents the maximum value corresponding to the (i-1)th word element. i-1 This represents the summation of the exponential function corresponding to the (i-1)th word. Buffer 024 is used to cache the maximum value corresponding to each word, and buffer 025 is used to cache the exponential function value corresponding to each word. IntMax is used to read the maximum value m corresponding to the (i-1)th word from buffer 024. i-1 And determine the maximum value m corresponding to the (i-1)th word element. i-1 With the i-th word element x i The maximum value in the range is obtained by finding the maximum value m corresponding to the i-th word. i And cached in cache 024; subtractor 021 is used to calculate the i-th word x. i The maximum value m corresponding to the i-th word element i The difference between (x) i -m i f(x) = 2 x Used to calculate base 2 (x i -m i The exponential function value corresponding to the i-th word is obtained by ) The data is cached in cache 025; subtractor 022 is used to calculate the maximum value m corresponding to the i-th word. i The maximum value m corresponding to the (i-1)th word element i-1 The difference between (m) i -m i-1 ), ">>" is used to sum the exponential function value corresponding to the (i-1)th word. i-1 Shift right (m) i -m i-1 () bits, to obtain the right-shifted result sum i-1 >>(m i -m i-1 Adder 023 is used to add the result of the right shift (sum). i-1 >>(m i -m i-1 The value of the exponential function corresponding to the i-th word. Add them together to get the sum of the exponential functions corresponding to the i-th word. i .

[0069] It should be noted that the above Figure 4 The proposed exponential function "f(x) = 2 x "This is one possible implementation provided by the present disclosure. In fact, those skilled in the art can select any known exponential function to calculate the exponential function value of the lexical unit, and the present disclosure does not limit this."

[0070] The above Figure 4 The calculation process of the second nonlinear calculation unit 01 for the maximum value, exponential function value and summation value of the exponential function corresponding to the i-th word can be expressed as formula (4), formula (5) and formula (6):

[0071] m i =IntMax(m i-1 ,x i (4)

[0072]

[0073]

[0074] Where i = 1, 2, 3, ..., h, the second nonlinear calculation unit 02 can perform the above calculation process iteratively for the 1st to the hth word, until the maximum value, exponential function value, and sum of exponential functions corresponding to the hth word are calculated. The above iterative process can be represented by the following pseudocode:

[0075]

[0076] It should be understood that the maximum value corresponding to the h-th word element is also the maximum value of the word elements in the entire second word element sequence, and the sum of the exponential functions corresponding to the h-th word element is also the sum of the exponential function values of all word elements in the entire second word element instruction. Furthermore, the normalization calculation module 22, based on the maximum value and the sum of the exponential functions corresponding to the h-th word element output by the second nonlinear calculation unit 02, and the maximum value corresponding to each word element in the second word element sequence, performs normalization calculation on the exponential function values corresponding to each word element in the second word element sequence to obtain the normalized exponential function calculation result corresponding to the second word element sequence. This may include: calculating the difference (m) between the maximum value corresponding to the h-th word element and each word element. h -m i Then, the exponential function value e corresponding to each word unit is... i Shift right (m) h -m i ) bits, to obtain the right shift result e i>>(m h -m i Then shift the result to the right, e. i >>(m h -m i () Divide by the sum of the exponential functions corresponding to the h-th word, and calculate the sum. h This yields the normalized exponential function value y corresponding to each word in the second word sequence. i That is, the normalized exponential function calculation result corresponding to the second word sequence is obtained. At this time, the calculation process of the normalization calculation module 22 can be expressed as formula (7):

[0077]

[0078] As described above, the normalization calculation module 22 may include a subtractor array and a divider array. The subtractor array can perform subtraction operations, and the divider array can perform right shift operations and division operations. The number of arrays can be consistent with the parallelism of the matrix calculation of the acceleration calculation module 11. This embodiment of the present disclosure does not limit the hardware structure of the normalization calculation module 22, as long as it can realize the normalization calculation shown in the above formula (7). This embodiment of the present disclosure does not limit this. Wherein, since i = 1, 2, 3, ..., h, the normalization calculation module 22 can perform the above calculation process cyclically on each word in the second word sequence. The cyclic calculation process implemented by the normalization calculation module 22 at this time can be represented as the following pseudocode:

[0079]

[0080] It should be understood that the embodiments of this disclosure do not impose limitations on the bit width, precision, etc., of the aforementioned terms, maximum values, exponential function values, exponential function summations, and normalized exponential function values. For example, terms and maximum values can be represented using 6 integers plus 2 decimal places, exponential function summations can be represented using 10 integers plus 6 decimal places, exponential function values can be represented using 1 integer plus 15 decimal places, and normalized exponential function values can be represented using 1 integer plus 7 decimal places, etc.

[0081] As is known, the calculation process of the Softmax operator mainly includes calculating the maximum value of the word sequence, calculating the exponential function value and the sum of the exponential functions of the word sequence, and normalizing the exponential function value of the word sequence. Traditional accelerators need to perform multiple full sequence accesses on the entire second word sequence. However, the accelerator provided by the present embodiment can reduce the number of full sequence accesses to one, that is, the full sequence access of the second word sequence is only performed in the normalization calculation module 22. The second nonlinear calculation unit 02 uses in-situ calculation to calculate the maximum value, exponential function value, and sum of the exponential functions of the entire second word sequence without waiting for the complete second word sequence to be calculated. This can improve the calculation efficiency of the normalized exponential function and improve the hardware resource utilization of the accelerator.

[0082] According to the embodiments of this disclosure, by utilizing the second nonlinear computing unit to perform in-situ calculations, the maximum value, exponential function value, and sum of exponential functions of the obtained word units are calculated in real time. Furthermore, the normalization calculation module performs normalization calculations on the exponential function value of the entire second word sequence based on the maximum value, exponential function value, and sum of exponential functions output by the second nonlinear computing unit. This eliminates the need to wait for the entire first word sequence to be input before performing the normalized exponential function calculation, significantly improving the computational efficiency of the accelerator in performing the nonlinear operation of the normalized exponential function, reducing the computational complexity of the normalized exponential function, and improving the hardware resource utilization and computational performance of the accelerator.

[0083] In summary, the accelerator 00 provided in this embodiment employs in-situ computation to efficiently accelerate the computation of two nonlinear operators (Softmax and LayerNorm). The normalization computation module 22 required for the two nonlinear computations can be implemented by constructing a shared divider array and subtractor array. The above optimizations for nonlinear operators ensure that the computational latency of the nonlinear operators increases linearly with the length of the processed sequence, rather than exponentially. Consequently, Softmax (excluding normalization) can be covered by a single-head attention matrix computation, and the mean and variance in LayerNorm can be covered by a single-layer attention or feedforward neural network. Compared with conventional accelerators executing the above nonlinear operators, the accelerator provided in this embodiment can save approximately 20% of on-chip hardware resources.

[0084] According to embodiments of this disclosure, by in-situ computation of LayerNorm and Softmax, and redesigning their algorithms and hardware implementations, over 20% of accelerator on-chip resources are saved, enabling a nonlinear operator design decoupled from sequence length. Furthermore, by in-situ computation of Softmax and LayerNorm, the resulting latency can be covered by matrix operations, optimizing performance for long statement tasks and potentially delivering up to a 1.62x performance improvement for long output scenarios.

[0085] As can be seen, in addition to the aforementioned nonlinear operators, generative pre-trained models also require matrix multiplication and matrix addition operations. Therefore, embodiments of this disclosure also provide... Figure 5 An accelerator 00 is shown, such as Figure 5 As shown, the accelerated computing module 11 may also include: matrix multiplication and addition unit 03;

[0086] like Figure 6 The matrix multiply-add unit 03 shown includes a multiplier array and an adder tree;

[0087] The multiplier array includes multiple multipliers (such as...) Figure 6 The multiple "×" indicate a series of multipliers, each multiplier including a multiplexed input and at least two ordinary inputs (e.g., ...). Figure 6 (Two ordinary input terminals in the multiplexed input terminal), each multiplier is used to perform matrix multiplication on the first data stream input from the multiplexed input terminal and at least two second data streams input from at least two ordinary input ports respectively;

[0088] Addition trees include multiple adders (such as...) that employ a tree structure. Figure 6 The multi-column tree-structured adders (indicated by multiple "+" signs) are used to perform matrix addition operations on the matrix multiplication results output by each multiplier in the multiplier array to obtain the matrix multiplication and addition results between the first data stream and at least two second data streams.

[0089] It should be understood that the present disclosure does not limit the number of multipliers and adders contained in the matrix multiply-add unit 03 or their connection relationship. Any matrix multiply-add unit constructed using an organization of multiplier array and adder tree is within the protection scope of the present disclosure.

[0090] Figure 7 This diagram illustrates a hardware structure of a multiplier for a matrix multiply-add unit 03 according to an embodiment of this disclosure. Figure 7As shown, each multiplier includes: a device "<<" for left shift operation, an adder "+", and a multiplier "×". "<<" shifts the second data stream B input from a normal input terminal to the left by multiple bits. The number of bits shifted can be the same as the number of bits in the data stream. For example, if the second data stream is 8 bits, it can be shifted 8 bits to the left. Then, the adder adds the left-shifted second data stream B to the second data stream C output from another normal input terminal. This process is equivalent to concatenating the second data stream B and the second data stream C. Then, the multiplier multiplies the first data stream A input from the multiplexed input terminal with the concatenated result. This allows for the simultaneous calculation of the first data stream and the other two second data streams (i.e., A×B, A×C), thus enabling matrix multiplication of one first data stream with the other two second data streams, thereby improving the computational efficiency of matrix multiplication operations on the accelerator.

[0091] Based on the above Figure 7 The multiplier shown, the matrix multiply-add unit 03 proposed in this embodiment of the present disclosure, adopts the organization of multiplier array and addition tree. It is equivalent to decomposing the digital signal processor (DSP) so that the multiplier can perform two 8-bit matrix multiplication operations at the same time and reuse the data stream input from one of the input terminals. For example, with three 8-bit elements A, B and C as input, it can output A×B and A×C.

[0092] It should be noted that, Figures 1 to 7 The number of accelerated computing modules 11 shown is an exemplary implementation provided by this disclosure. In practice, those skilled in the art can design the number of accelerated computing modules 11 included in the accelerator according to actual needs, and this disclosure does not limit this. Furthermore, those skilled in the art can also design the structure and layout of each unit and module in the accelerator according to actual needs. For example, the first nonlinear computing unit 0, the second nonlinear computing unit 02, and the matrix multiplication and addition unit can be independent of the accelerated computing module 11. The accelerated computing module 11 may also include other computing units, and other devices may be provided in the accelerator 00, such as... Figure 8 The accelerator shown may include a global cache 33. The global cache 33 can be used to cache the layer normalization calculation results and the normalization exponential function calculation results output by the normalization calculation module 22. It can also be used to cache the calculation results of each calculation unit in the acceleration calculation module 11 (such as the mean and variance output by the first nonlinear calculation unit 01, or the calculation results output by other calculation units, etc.). This embodiment of the present disclosure does not limit this.

[0093] As described above, the accelerator is used to implement inference of the generative pre-trained model. Inference of the generative pre-trained model includes two distinct task stages: encoding and decoding of the word sequence. The computational load characteristics (such as computation-to-memory ratio) of these two stages differ significantly, and this difference widens further as the model size increases, leading to inefficient utilization of hardware resources and consequently affecting the overall performance of the model. The design of the matrix multiply-accumulate unit 03 proposed in this embodiment allows for the reuse of one data path during matrix operations, providing a hardware foundation for data stream transformation during the encoding and decoding stages (i.e., the generation stage).

[0094] Therefore, in one possible implementation, during the encoding stage, the first data stream input to the multiplexed input of the multiplier includes weight data, and the second data stream input to each ordinary input includes the term data to be encoded, which includes a term sequence or a term or a portion of a term in the term sequence; and during the decoding stage, the first data stream input to the multiplexed input includes the term data to be decoded, and the second data stream input to each ordinary input includes weight data. It should be understood that this disclosure does not limit the bit width of the first and second data streams; for example, the first data stream can be 128 bits, and the two second data streams can total 256 bits.

[0095] The above method can be summarized as follows: In the matrix operations of the encoding stage, a column of weight data can be reused on multiple sets of input term data and multiplied simultaneously with multiple rows of input term data. In the matrix operations of the generation stage, the input data can be reused and multiplied simultaneously with multiple columns of weight data.

[0096] The above approach can be understood as a two-stage data flow transformation, or in other words, using different data flows for the encoding and generation stages to adapt to the different computation-to-memory ratios of the two stages. The motivation for this data flow optimization is that the computational density of the encoding stage is much higher than that of the generation stage. The encoding stage benefits from the parallelism brought about by the simultaneous input of multiple tokens, while the generation stage needs to keep the weight data as streaming as possible, thereby improving the computational performance of the accelerator and thus improving the inference speed of the generative pre-trained model.

[0097] According to embodiments of this disclosure, by optimizing the design of the matrix multiplication and addition unit, different data flows are adopted for the encoding stage and the generation stage to adapt to their different computation-to-memory ratios. That is, by adopting a two-stage data flow transformation strategy adapted to the generative pre-trained model, different data flow reuse strategies are adopted for the different computation-to-memory ratios of the encoding stage and the generation stage, which can bring about a speed improvement of about 1.37 times.

[0098] It is known that the generative pre-trained model includes multiple layers, each of which includes multi-head attention, a feedforward neural network, and residual connections. In practical applications, a LayerNorm layer is also set between the multi-head attention layer and the feedforward neural network layer. It should be understood that the layer normalization calculation in the LayerNorm layer can be performed using the first nonlinear calculation unit and the normalization calculation module in the aforementioned acceleration calculation module 11.

[0099] In the computational process of generative pre-trained models, the residual connection mechanism in each layer of the model causes path dependence in the entire model computation to affect the pipeline (i.e., the input of the data stream). To eliminate this impact, more on-chip resources are needed to store this additional data. However, since residual connections are regular in both the encoding and generation stages, this impact can be reduced by modifying the model's pipeline mapping and setting the residual computation unit. That is, the accelerated computation module 11 can also be configured with a residual computation module to calculate the residuals corresponding to the residual connections in each layer.

[0100] Specifically, multi-head attention and feedforward neural networks can be mapped to two matrix multiply-accumulate units respectively, that is, multi-head attention and feedforward neural networks can be mapped to two matrix multiply-accumulate units, or in other words, the operations of multi-head attention and feedforward neural networks can be implemented using two matrix multiply-accumulate units respectively.

[0101] Each multiplier in the matrix multiply-accumulate unit can include two ordinary input ports. Furthermore, for any current layer in the generative pre-trained model, one second data stream input to the current layer is loaded into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer to obtain the calculation result. At the same time, the calculation result is input into the matrix multiply-accumulate unit corresponding to the feedforward neural network of the current layer, and another second data stream input to the current layer is loaded into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer, so that the two matrix multiply-accumulate units corresponding to the multi-head attention and the feedforward neural network of the current layer alternately process the two second data streams. In addition, the residual calculation unit in the accelerated calculation module can calculate the residuals corresponding to the residual connections in each layer, wherein the residual calculation can be implemented synchronously with the calculation of the multi-head attention.

[0102] For example, the above process can be represented as follows: Figure 9 The diagram shown illustrates the pipeline design for the data stream input, as follows: Figure 9As shown, MHA represents Multi-Head Attention (MHA), and FFN represents Feedforward Neural Network (FFN). A two-stage pipeline is used, consisting of MHA and FFN. During the data input phase, a second data stream, input_B, is loaded into the MHA layer for computation. After computation, the output is passed to the FFN for further computation. Simultaneously, the MHA loads another second data stream, input_C, and so on. Subsequent layers of the MHA and FFN alternately process these two second data streams. By mapping the MHA and FFN to two matrix multiply-accumulate units respectively and setting residual units to calculate the residuals of each layer—especially by pre-calculating the residuals—path dependency issues caused by traditional residual connection calculations can be avoided, and on-chip resource overhead can be reduced. This approach is suitable for both the encoding and generation phases.

[0103] According to embodiments of this disclosure, by dividing the multi-head attention and feedforward neural network into two stages, a pipeline mapping of two-level alternating input data streams is adopted, which can adapt to the computational flow of generative pre-trained models with inter-layer residual connections and runtime-varying inter-layer computational latency, resulting in a speed improvement of 1.61 times.

[0104] Based on the accelerator proposed in the above embodiments of this disclosure, this disclosure also provides a flowchart of an accelerated inference method for generative pre-trained models, wherein the generative pre-trained model includes multiple layer normalization operators, the method uses an accelerator to execute the layer normalization operators, the accelerator includes multiple accelerated computing modules and a normalization computing module, wherein each accelerated computing module includes: a first nonlinear computing unit, and the accelerated inference method includes:

[0105] Step S101: For any layer normalization operator, obtain the nth word that the layer normalization operator needs to process;

[0106] Step S102: Using the first nonlinear calculation unit, calculate the mean and variance corresponding to the nth word based on the mean and variance corresponding to the currently acquired nth word and the calculated (n-1)th word, 1≤n≤k, the mean and variance corresponding to the 0th word are 0, and k is the length of the first word sequence of the normalized layer to be calculated.

[0107] Step S103: Using the normalization calculation module, normalization calculation is performed on each word in the first word sequence based on the mean and variance corresponding to the k-th word, to obtain the layer normalization calculation result corresponding to the first word sequence.

[0108] In one possible implementation, the generative pre-trained model further includes a normalized exponential function operator, the accelerated computation module includes a second nonlinear computation unit, and the method further includes: obtaining the i-th word to be processed by the normalized exponential function operator; using the second nonlinear computation unit to calculate the maximum value, exponential function value, and exponential function summation value corresponding to the i-th word based on the currently obtained i-th word and the maximum value, exponential function value, and exponential function summation value corresponding to the (i-1)-th word already calculated, wherein the maximum value corresponding to the i-th word includes the maximum value corresponding to the (i-1)-th word and the maximum value in the i-th word, the maximum value and exponential function summation value corresponding to the 0th word are 0, 1≤i≤h, and h is the length of the second word sequence for which the normalized exponential function is to be calculated; using the normalized computation module to normalize the exponential function value corresponding to each word in the second word sequence based on the maximum value and exponential function summation value corresponding to the h-th word and the maximum value corresponding to each word in the second word sequence, to obtain the normalized exponential function calculation result corresponding to the second word sequence.

[0109] In one possible implementation, the accelerated computing module further includes a matrix multiplication and addition unit; the matrix multiplication and addition unit includes a multiplier array and an addition tree; the multiplier array includes multiple multipliers, each multiplier including a multiplexed input terminal and at least two ordinary input terminals, each multiplier being used to perform matrix multiplication operations on a first data stream input from the multiplexed input terminal and at least two second data streams respectively input from the at least two ordinary input ports; the addition tree includes multiple adders in a tree structure, used to perform matrix addition operations on the matrix multiplication results output by each multiplier in the multiplier array to obtain the first... The method further includes: in the encoding phase of the generative pre-trained model, inputting weight data to the multiplexed input terminals of the multiplier array, and inputting lexical data to be encoded to each ordinary input terminal of the multiplier array, the lexical data including a lexical sequence or lexical units or parts of lexical units; in the decoding phase of the generative pre-trained model, inputting lexical data to be decoded to the multiplexed input terminals of the multiplier array, and inputting weight data to each ordinary input terminal of the multiplier array.

[0110] In one possible implementation, the generative pre-trained model includes multiple layers, each layer including multi-head attention, a feedforward neural network, and residual connections; wherein, the multi-head attention and the feedforward neural network each correspond to two matrix multiply-accumulate units; each multiplier in the matrix multiply-accumulate unit includes two ordinary input ports; the accelerated computing module further includes a residual computing unit; wherein, the method further includes: for any current layer in the generative pre-trained model, loading one second data stream input to the current layer into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer, obtaining the calculation result and inputting it into the matrix multiply-accumulate unit corresponding to the feedforward neural network of the current layer, while simultaneously loading another second data stream input to the current layer into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer, so that the two matrix multiply-accumulate units corresponding to the multi-head attention and the feedforward neural network of the current layer alternately process the two second data streams; and using the residual computing unit to calculate the residuals corresponding to the residual connections in each layer.

[0111] In some embodiments, the specific implementation of each step in the method provided in this disclosure can be specifically implemented with reference to the description of the accelerator embodiments above, and will not be repeated here for the sake of brevity.

[0112] This disclosure also proposes an electronic device, including: the accelerator described in the above-described embodiments of this disclosure.

[0113] In one possible implementation, the above-mentioned electronic device may further include: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-mentioned accelerated inference method when executing the instructions stored in the memory.

[0114] This disclosure also provides a computer program product, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code, wherein when the computer-readable code is run in a processor of an electronic device, the processor in the electronic device executes the above-described accelerated inference method.

[0115] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of accelerators and accelerated inference methods according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, unit, segment, or portion of an instruction, which contains one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0116] The various embodiments of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or technical improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. An accelerator, characterized in that, include: Multiple accelerated computing modules and a normalized computing module, wherein each accelerated computing module includes: The first nonlinear calculation unit is used to calculate the mean and variance corresponding to the nth word based on the currently acquired nth word and the mean and variance corresponding to the (n-1)th word that have been calculated, 1≤n≤k, the mean and variance corresponding to the 0th word are 0, and k is the length of the first word sequence of the normalized layer to be calculated. The normalization calculation module is used to perform normalization calculation on each word in the first word sequence based on the mean and variance corresponding to the kth word output by the first nonlinear calculation unit, so as to obtain the layer normalization calculation result corresponding to the first word sequence. The accelerated computing module further includes: The second nonlinear calculation unit is used to calculate the maximum value, exponential function value, and exponential function summation value of the i-th word based on the currently acquired i-th word and the maximum value and exponential function summation value of the (i-1)-th word that have been calculated. The maximum value of the i-th word includes the maximum value of the (i-1)-th word and the maximum value in the i-th word. The maximum value of the 0th word is the 1st word. The exponential function summation value of the 0th word is 0. 1≤i≤h, where h is the length of the second word sequence of the normalized exponential function to be calculated. The normalization calculation module is further configured to perform normalization calculation on the exponential function values corresponding to each word in the second word sequence based on the maximum value and the sum of the exponential function output by the second nonlinear calculation unit, and the maximum value corresponding to each word in the second word sequence, so as to obtain the normalized exponential function calculation result corresponding to the second word sequence.

2. The accelerator according to claim 1, characterized in that, The accelerated computing module also includes: Matrix multiplication and addition unit, including multiplier array and adder tree; The multiplier array includes multiple multipliers, each multiplier including a multiplexed input terminal and at least two ordinary input terminals, and each multiplier is used to perform matrix multiplication on a first data stream input from the multiplexed input terminal and at least two second data streams input from the at least two ordinary input ports respectively; The addition tree includes multiple adders in a tree structure, used to perform matrix addition operations on the matrix multiplication results output by each multiplier in the multiplier array, so as to obtain the matrix multiplication and addition results between the first data stream and the at least two second data streams respectively.

3. The accelerator according to claim 2, characterized in that, The accelerator is used to implement inference of the generative pre-trained model, which includes an encoding stage and a decoding stage for word sequences. In the encoding stage, the first data stream input by the multiplexed input terminal includes weight data, and the second data stream input by each ordinary input terminal includes word data to be encoded, wherein the word data includes a word sequence or a word in the word sequence or a part of a word; In the decoding stage, the first data stream input by the multiplexed input terminal includes the word data to be decoded, and the second data stream input by each ordinary input terminal includes weight data.

4. The accelerator according to claim 3, characterized in that, The generative pre-trained model comprises multiple layers, each layer including multi-head attention, a feedforward neural network, and residual connections; wherein, the multi-head attention and the feedforward neural network each correspond to two matrix multiply-accumulate units; each multiplier in the matrix multiply-accumulate unit includes two ordinary input ports; Specifically, for any current layer in the generative pre-trained model, one second data stream input to the current layer is loaded into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer to obtain a calculation result. Simultaneously, while the calculation result is input into the matrix multiply-accumulate unit corresponding to the feedforward neural network of the current layer, another second data stream input to the current layer is loaded into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer. This allows the two matrix multiply-accumulate units corresponding to the multi-head attention and feedforward neural network of the current layer to alternately process the two second data streams. The current layer includes any attention mechanism layer or feedforward network layer. The accelerated computing module also includes a residual calculation unit, which is used to calculate the residuals corresponding to the residual connections in each layer.

5. An accelerated inference method for generative pre-trained models, characterized in that, The generative pre-trained model includes multiple layer normalization operators. The method uses an accelerator to execute the layer normalization operators. The accelerator includes multiple acceleration computation modules and a normalization computation module, wherein each acceleration computation module includes: a first nonlinear computation unit. The method includes: For any layer normalization operator, obtain the nth word that the layer normalization operator needs to process; The first nonlinear calculation unit calculates the mean and variance of the nth word based on the mean and variance of the currently acquired nth word and the calculated (n-1)th word, where 1≤n≤k, the mean and variance of the 0th word are 0, and k is the length of the first word sequence of the normalized layer to be calculated. The normalization calculation module is used to perform normalization calculation on each word in the first word sequence based on the mean and variance corresponding to the k-th word, so as to obtain the layer normalization calculation result corresponding to the first word sequence. The generative pre-trained model further includes a normalized exponential function operator, the accelerated computation module includes a second nonlinear computation unit, and the method further includes: Obtain the i-th word that the normalization exponential function operator needs to process; The second nonlinear calculation unit calculates the maximum value, exponential function value, and exponential function summation value corresponding to the i-th word based on the currently acquired i-th word and the maximum value, exponential function value, and exponential function summation value corresponding to the (i-1)-th word that has already been calculated. The maximum value corresponding to the i-th word includes the maximum value corresponding to the (i-1)-th word and the maximum value in the i-th word. The maximum value and exponential function summation value corresponding to the 0th word are 0. 1≤i≤h, where h is the length of the second word sequence of the normalized exponential function to be calculated. The normalization calculation module uses the maximum value and the sum of the exponential function corresponding to the h-th word element, as well as the maximum value corresponding to each word element in the second word element sequence, to perform normalization calculation on the exponential function value corresponding to each word element in the second word element sequence, thereby obtaining the normalized exponential function calculation result corresponding to the second word element sequence.

6. The method according to claim 5, characterized in that, The accelerated computing module further includes a matrix multiplication and addition unit; the matrix multiplication and addition unit includes a multiplier array and an addition tree; the multiplier array includes multiple multipliers, each multiplier including a multiplexed input terminal and at least two ordinary input terminals, each multiplier being used to perform matrix multiplication operations on a first data stream input from the multiplexed input terminal and at least two second data streams input from the at least two ordinary input ports respectively; the addition tree includes multiple adders in a tree structure, used to perform matrix addition operations on the matrix multiplication results output by each multiplier in the multiplier array to obtain the matrix multiplication and addition results between the first data stream and the at least two second data streams respectively; The method further includes: In the encoding phase of the generative pre-trained model, weight data is input to the multiplexed input terminals of the multiplier array, and word data to be encoded is input to each ordinary input terminal of the multiplier array, the word data including a word sequence or a word or a part of a word sequence; In the decoding phase of the generative pre-trained model, the term data to be decoded is input to the multiplexed input terminals of the multiplier array, and weight data is input to each ordinary input terminal of the multiplier array.

7. The method according to claim 6, characterized in that, The generative pre-trained model comprises multiple layers, each layer including multi-head attention, a feedforward neural network, and residual connections; wherein the multi-head attention and the feedforward neural network each correspond to two matrix multiply-accumulate units; each multiplier in the matrix multiply-accumulate unit includes two ordinary input ports; the accelerated computing module further includes a residual computing unit; wherein the method further includes: For any current layer in the generative pre-trained model, one second data stream input to the current layer is loaded into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer. The calculation result is then input into the matrix multiply-accumulate unit corresponding to the feedforward neural network of the current layer. At the same time, another second data stream input to the current layer is loaded into the matrix multiply-accumulate unit corresponding to the multi-head attention of the current layer, so that the two matrix multiply-accumulate units corresponding to the multi-head attention and the feedforward neural network of the current layer alternately process the two second data streams. Furthermore, the residuals corresponding to the residual connections in each layer are calculated using the residual calculation unit.

8. An electronic device, characterized in that, include: The accelerator according to any one of claims 1 to 4.