Data processing method and related device
By sending intermediate results in batches to the matrix multiplication unit and vector operation unit during the attention mechanism computation of the transformer model, the problem of low computational efficiency caused by frequent data transmission is solved, and a more efficient computational pipeline layout and continuity are achieved.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-05-07
- Publication Date
- 2026-06-18
AI Technical Summary
In the attention mechanism computation of the transformer model, the excessive number of data transfers between the matrix multiplication unit and the vector operation unit leads to low computational efficiency.
By obtaining intermediate results for a batch of data and sending them all at once to the corresponding computation unit, the number of data transfers between the matrix multiplication unit and the vector operation unit is reduced, thereby improving the continuity and parallelism of computation.
It reduces the data transmission frequency, improves computational efficiency, and enhances the pipeline density and computational continuity of the attention mechanism computation.
Smart Images

Figure CN2025093093_18062026_PF_FP_ABST
Abstract
Description
A data processing method and related equipment
[0001] This application claims priority to Chinese Patent Application No. 202411493245.2, filed on October 23, 2024, entitled "A Data Processing Method and Related Equipment", the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of artificial intelligence (AI) technology, and in particular to a data processing method, apparatus, chip system, computing device, computing device cluster, computer-readable storage medium, and computer program product. Background Technology
[0003] With the continuous development of artificial intelligence (AI) technology, especially machine learning (ML) technology, the transformer model has emerged. The transformer model can process data sequences, such as text sequences, and perform natural language processing tasks.
[0004] Unlike traditional recurrent neural network (RNN) models, the transformer model is an attention-based model. Specifically, the transformer model consists of several linear layers with identical structures, each containing an attention mechanism computation unit used to perform attention mechanism calculations.
[0005] Attention mechanism computation involves matrix multiplication and vector operations, which are typically performed by different computational units. Furthermore, to improve computational speed, attention mechanism computation can usually be decomposed into multiple data blocks for separate computation; that is, for a given data sequence to be processed, attention mechanism computation is performed using a block-based approach.
[0006] In the process of performing attention mechanism computation using a block-based computing approach, the generated intermediate results need to be exchanged multiple times between different computational units, resulting in high data transmission costs and low computational efficiency. Summary of the Invention
[0007] This application provides a data processing method that reduces the number of data transfers between matrix multiplication units and vector operation units, resulting in a denser pipeline layout for attention mechanism computations and more continuous computation between matrix multiplication units and vector operation units. This application also provides a data processing apparatus, chip system, computing device, computing device cluster, computer-readable storage medium, and computer program product corresponding to this method.
[0008] In a first aspect, this application provides a data processing method. Specifically, it acquires a first intermediate result generated by a matrix multiplication unit in the attention mechanism calculation, and / or acquires a second intermediate result generated by a vector operation unit in the attention mechanism calculation. A first intermediate result is generated by performing a matrix multiplication operation on two block matrices, where the block matrices are obtained by dividing the data sequence to be processed into blocks. A second intermediate result is generated by performing a vector operation. In response to the number of first intermediate results reaching the batch processing quantity, the batch processing quantity of first intermediate results is sent to the vector operation unit, so that the vector operation unit performs vector operations in the attention mechanism calculation based on the batch processing quantity of first intermediate results. In response to the number of second intermediate results reaching the batch processing quantity, the batch processing quantity of second intermediate results is sent to the matrix multiplication unit, so that the matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the batch processing quantity of second intermediate results.
[0009] In this method, during the attention mechanism computation using a block-based computation approach, after the matrix multiplication unit and / or vector operation unit generate intermediate results corresponding to a batch of data blocks, the intermediate results for the entire batch are then sent together to the vector operation unit and / or matrix multiplication unit. This reduces the number of data transfers between the matrix multiplication unit and the vector operation unit, resulting in a denser pipeline for the attention mechanism computation and more continuous computation between the matrix multiplication unit and the vector operation unit.
[0010] In some possible implementations, the attention mechanism computation includes a first matrix multiplication task and a second matrix multiplication task, and / or, the attention mechanism computation includes a first vector operation task and a second vector operation task. The matrix multiplication unit generates the first intermediate result by: continuously executing the first matrix multiplication task to generate a batch number of the first intermediate results; continuously executing the second matrix multiplication task to generate a batch number of the first intermediate results.
[0011] The vector operation unit generates the second intermediate result through the following steps: continuously executing the first vector operation task to generate a batch of second intermediate results; continuously executing the second vector operation task to generate a batch of second intermediate results.
[0012] In this method, without changing the original data block partitioning logic, the matrix multiplication unit and the vector operation unit do not need to frequently retrieve different matrices from external storage units (such as external memory or external storage devices). For a single operation task, the next operation task will only be executed after N consecutive executions, making the calculations of the matrix multiplication unit and the vector operation unit more continuous.
[0013] In some possible implementations, the method includes a first attention mechanism calculation process and a second attention mechanism calculation process, with the second attention mechanism calculation process following the first attention mechanism calculation process. After the matrix multiplication unit completes part of the matrix multiplication operation in the first attention mechanism calculation process, at least part of the data to be processed in the second attention mechanism calculation process is sent to the matrix multiplication unit, so that the matrix multiplication unit begins to execute the matrix multiplication operation in the second attention mechanism calculation process before the first attention mechanism calculation process is completed.
[0014] In this method, the matrix multiplication unit preprocesses the matrix multiplication operations in the later-order second attention mechanism calculation process, so that the different attention mechanism calculation processes are arranged in an interleaved manner in the entire attention mechanism calculation, which further improves the parallelism of attention mechanism calculation, makes full use of computing resources, and improves computing efficiency.
[0015] In some possible implementations, the attention mechanism computation is a reverse computation of the attention mechanism. In response to the number of first intermediate results reaching the inter-core synchronization threshold, the first intermediate results of the inter-core synchronization threshold are sent to the vector operation unit; and / or, in response to the number of second intermediate results reaching the inter-core synchronization threshold, the second intermediate results of the inter-core synchronization threshold are sent to the matrix multiplication unit. Wherein, the number of inter-core synchronizations is less than the batch size.
[0016] In this method, by setting the number of inter-core synchronizations to be less than the batch size, the pipeline layout in the backward computation of the attention mechanism is more compact, shortening the idle time of matrix multiplication units or vector operation units and reducing latency. Furthermore, compared to performing inter-core synchronization once for each data block, flexibly setting the number of inter-core synchronizations and actively delaying inter-core synchronization reduces the frequency of inter-core synchronization, lowers inter-core synchronization overhead, and improves computational efficiency.
[0017] In some possible implementations, the attention mechanism computation is an attention mechanism forward computation. The matrix multiplication unit generates a batch of forward computation sub-results based on the second intermediate results of the batch number, and generates a forward computation result based on the batch number of forward computation sub-results, and stores the forward computation result in the global storage unit.
[0018] In this method, during the forward computation of the attention mechanism, the data accumulation capability of the matrix multiplication unit is used to accumulate the forward computation results. After accumulating the forward computation results of the batch, they are written to the global storage unit. In this way, by leveraging the data accumulation capability of the matrix multiplication unit, the number of times data is written in the attention mechanism computation is reduced.
[0019] In some possible implementations, the attention mechanism computation is the inverse computation of the attention mechanism. The matrix multiplication unit generates a batch of value vector gradient submatrices and a batch of key vector gradient submatrices based on the batch of second intermediate results. Based on the batch of value vector gradient submatrices and the batch of key vector gradient submatrices, it generates value vector gradient matrices and key vector gradient matrices, and stores the value vector gradient matrices and key vector gradient matrices in the global storage unit.
[0020] In this method, during the backward computation of the attention mechanism, the data accumulation capability of the matrix multiplication unit is used to accumulate the gradient matrices of the value vector and the gradient matrices of the key vector. After accumulating the gradient matrices of the value vector or the key vector in batches, they are written to the global storage unit together. In this way, by leveraging the data accumulation capability of the matrix multiplication unit, the overhead of writing data is reduced, and the trailing problem caused by the large overhead of writing data is reduced.
[0021] Secondly, this application provides a data processing apparatus, the apparatus comprising:
[0022] The acquisition module is used to acquire a first intermediate result generated by the matrix multiplication unit in the attention mechanism calculation, and / or acquire a second intermediate result generated by the vector operation unit in the attention mechanism calculation; wherein, a first intermediate result is generated by performing a matrix multiplication operation on two block matrices, the block matrices being obtained by dividing the matrix corresponding to the data sequence to be processed into blocks, and a second intermediate result is generated by performing a vector operation;
[0023] A communication module is configured to, in response to the number of the first intermediate results reaching the batch processing quantity, send the batch processing quantity of the first intermediate results to the vector operation unit, so that the vector operation unit performs vector operations in the attention mechanism calculation based on the batch processing quantity of the first intermediate results; and / or, in response to the number of the second intermediate results reaching the batch processing quantity, send the batch processing quantity of the second intermediate results to the matrix multiplication unit, so that the matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the batch processing quantity of the second intermediate results.
[0024] In some possible implementations, the attention mechanism computation includes a first matrix multiplication task and a second matrix multiplication task, and / or, the attention mechanism computation includes a first vector operation task and a second vector operation task;
[0025] The matrix multiplication unit generates the first intermediate result through the following steps:
[0026] The first matrix multiplication operation task is executed continuously to generate the batch number of first intermediate results;
[0027] The second matrix multiplication operation task is executed continuously to generate the first intermediate result of the batch processing quantity;
[0028] The vector operation unit generates the second intermediate result through the following steps:
[0029] The first vector operation task is executed continuously to generate the number of second intermediate results in the batch.
[0030] The second vector operation task is executed continuously to generate the second intermediate result of the batch number.
[0031] In some possible implementations, the method includes a first attention mechanism calculation process and a second attention mechanism calculation process, wherein the second attention mechanism calculation process is performed after the first attention mechanism calculation process; the communication module is further configured to:
[0032] After the matrix multiplication unit completes part of the matrix multiplication operation in the first attention mechanism calculation process, at least part of the data to be processed in the second attention mechanism calculation process is sent to the matrix multiplication unit so that the matrix multiplication unit can start executing the matrix multiplication operation in the second attention mechanism calculation process before the first attention mechanism calculation process is completed.
[0033] In some possible implementations, the attention mechanism is computed as a reverse of the attention mechanism, and the communication module is further used for:
[0034] In response to the number of the first intermediate results reaching the inter-core synchronization threshold, the first intermediate results of the inter-core synchronization threshold are sent to the vector processing unit; and / or,
[0035] In response to the number of the second intermediate results reaching the inter-core synchronization number, the second intermediate results of the inter-core synchronization number are sent to the matrix multiplication unit; wherein the inter-core synchronization number is less than the batch processing number.
[0036] In some possible implementations, the attention mechanism computation is an attention mechanism forward computation, and the matrix multiplication unit performs matrix multiplication operations in the attention mechanism computation based on the second intermediate result of the batch size, including:
[0037] The matrix multiplication unit generates the number of forward computation sub-results for the number of batches based on the second intermediate results for the number of batches, and generates the forward computation result based on the number of forward computation sub-results for the number of batches.
[0038] The communication module is also used for:
[0039] The forward calculation results are stored in a global storage unit.
[0040] In some possible implementations, the attention mechanism computation is a reverse attention mechanism computation, and the matrix multiplication unit performs matrix multiplication operations in the attention mechanism computation based on the second intermediate result of the batch size, including:
[0041] The matrix multiplication unit generates the value vector gradient submatrix and the key vector gradient submatrix of the batch based on the second intermediate result of the batch number, and generates the value vector gradient matrix and the key vector gradient matrix based on the value vector gradient submatrix and the key vector gradient submatrix of the batch number.
[0042] The communication module is also used for:
[0043] The value vector gradient matrix and the key vector gradient matrix are stored in a global storage unit.
[0044] Thirdly, this application provides a chip system including a processor and a power supply circuit, the power supply circuit being used to supply power to the processor, the processor being used to execute the data processing method as described in the first aspect or any implementation thereof.
[0045] Fourthly, this application provides a computing device, the computing device including a chip system, the chip system including a processor and a power supply circuit, the power supply circuit being used to supply power to the processor, the processor being used to execute the data processing method as described in the first aspect or any implementation thereof.
[0046] Fifthly, this application provides a computing device cluster, the computing device cluster including at least one computing device, the at least one computing device including a chip system, the chip system including a processor and a power supply circuit, the power supply circuit being used to supply power to the processor, the processor being used to execute the data processing method as described in the first aspect or any implementation thereof.
[0047] In a sixth aspect, this application provides a computer-readable storage medium including computer-readable instructions for implementing the data processing method as described in the first aspect or any implementation thereof.
[0048] In a seventh aspect, this application provides a computer program product comprising computer-readable instructions for implementing the data processing method as described in the first aspect or any implementation thereof.
[0049] Based on the implementation methods provided in the above aspects, this application can be further combined to provide more implementation methods. Attached Figure Description
[0050] To more clearly illustrate the technical method of this application, the accompanying drawings used will be briefly described below.
[0051] Figure 1 is a schematic diagram of the architecture of a shared memory AI chip provided in this application;
[0052] Figure 2 is a schematic diagram of forward computation of an attention mechanism provided in this application;
[0053] Figure 3 is a schematic diagram of an attention mechanism for reverse computation provided in this application;
[0054] Figure 4 is a schematic diagram of the architecture of a data processing device provided in this application;
[0055] Figure 5 is a schematic diagram of the architecture of a non-shared memory AI chip provided in this application;
[0056] Figure 6 is a schematic diagram of a software stack architecture provided in this application;
[0057] Figure 7 is a flowchart illustrating a data processing method provided in this application;
[0058] Figures 8A and 8B are schematic diagrams of a forward computation of an attention mechanism provided in this application;
[0059] Figures 9A and 9B are schematic diagrams of an attention mechanism for reverse computation provided in this application;
[0060] Figures 10A and 10B are schematic diagrams of forward computation of an attention mechanism provided in this application;
[0061] Figure 11 is a schematic diagram of an attention mechanism for reverse computation provided in this application;
[0062] Figure 12 is a schematic diagram of an attention mechanism for reverse computation provided in this application;
[0063] Figure 13 is a schematic diagram of the structure of a computing device provided in this application;
[0064] Figure 14 is a schematic diagram of the structure of a computing device cluster provided in this application;
[0065] Figure 15 is a schematic diagram of another computing device cluster provided in this application. Detailed Implementation
[0066] The terms "first" and "second" in this application are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, the features defined as "first" or "second" may explicitly or implicitly include one or more of that feature.
[0067] First, let's introduce some of the technical terms used in this application.
[0068] Machine learning (ML) is a core field of artificial intelligence (AI). It enables computing devices to learn from data and make intelligent decisions. The transformer model is one such machine learning model. It can process data sequences; for example, when the data sequence is text, it can process it to perform natural language processing tasks. Similarly, when the data sequence is an image sequence, it can process it to perform tasks such as video classification and action recognition.
[0069] The transformer model is an attention-based model. It consists of several linear layers with identical structures, each containing an attention mechanism computation unit.
[0070] The following describes the process of attention mechanism computation in the attention mechanism computation unit. Attention mechanism computation is divided into forward computation (also known as forward propagation) and backward computation (also known as backward propagation). Forward computation refers to the process of processing the data sequence to generate the forward computation result, while backward computation refers to the process of processing the forward computation result to generate the parameter gradient.
[0071] In the forward computation of the attention mechanism, the data sequence is first embedded to obtain a vector matrix, and then the vector matrix is coupled with three weight matrices (W). Q Matrix, W K Matrix and W V Multiplying the matrices yields the query vector matrix (Q matrix), the key vector matrix (K matrix), and the value vector matrix (V matrix). The Q, K, and V matrices are all high-dimensional tensor matrices, with dimensions related to the batch size (b), the number of attention heads (n), the data sequence length (s), and the attention head dimension (d).
[0072] Matrix multiplication of the Q matrix and the K matrix yields the attention score matrix (S matrix), where each element represents the correlation between elements in the Q matrix and elements in the K matrix. Next, the S matrix is normalized (softmax) to obtain the probability matrix (P matrix). Finally, matrix multiplication of the P matrix and the V matrix yields the forward computation result matrix (O matrix).
[0073] In the forward computation of the attention mechanism, the matrix dimension changes as follows: After multiplying Q(b,n,s,d) and K(b,n,d,s), we get S(b,n,s,s). After performing softmax processing on matrix S, we get P(b,n,s,s). After multiplying P(b,n,s,s) and V(b,n,s,d), we get O(b,n,s,d).
[0074] In the backpropagation of the attention mechanism, for the value vector gradient matrix (dV), the Q matrix and the K matrix are first multiplied to obtain the S matrix. The S matrix is then subjected to softmax processing to obtain the P matrix. Finally, the P matrix is multiplied with the gradient matrix (dO) from the forward computation to obtain the dV matrix. For the query vector gradient matrix (dQ) and the key vector gradient matrix (dK), the dO matrix and the V matrix are first multiplied to obtain the attention weight gradient matrix (dP). Then, the P matrix is multiplied element-wise with (dP-D) to obtain the attention score gradient matrix (dS). The matrix multiplication function is called element-wise matrix multiplication. Element-wise matrix multiplication means multiplying the elements at the same position in two matrices to obtain the element at that position in the product matrix. The rowsum() function is used to calculate the sum of the elements in each row of a matrix. Then, the dS matrix and the K matrix are multiplied to obtain the dQ matrix, and the dS matrix and the Q matrix are multiplied to obtain the dK matrix.
[0075] Whether it's forward or backward computation, the computational complexity and memory access complexity are quadratic with the length of the data sequence. Therefore, the attention mechanism computation has become a performance bottleneck in the training and inference scenarios of the transformer model, occupying a significant amount of computation time.
[0076] In attention mechanism computation, matrix multiplication and vector operations are involved, and these operations are typically performed by different computational units. That is, in attention mechanism computation, the matrix multiplication unit performs matrix multiplication operations, and the vector operation unit performs vector operations.
[0077] In practical applications, attention mechanism computations can be performed by AI chips. AI chips, also known as AI accelerators or AI computing cards, can be understood as modules used to process AI-related computations. AI chips can perform AI-related computations based on different architectures. For example, AI chips can perform AI-related computations based on general-purpose chips such as graphics processing units (GPUs), data processing units (DPUs), and neural processing units (NPUs). Alternatively, AI chips can perform AI-related computations based on application-specific integrated circuits (ASICs). Still another example is the use of field-programmable gate arrays (FPGAs).
[0078] The following explanation uses a specific AI chip as an example. Referring to Figure 1, which illustrates the architecture of a shared-memory AI chip, the AI chip 10 has independent computing units and multi-level storage units. The matrix multiplication unit and vector operation unit are directly connected through the first-level cache unit; that is, the matrix multiplication unit and vector operation unit share the storage space of the first-level cache unit. Furthermore, the storage space of the global storage unit is larger than that of the second-level cache unit, and the storage space of the second-level cache unit is larger than that of the first-level cache unit. The bandwidth of the global storage unit is smaller than that of the second-level cache unit, and the bandwidth of the second-level cache unit is smaller than that of the first-level cache unit. In other words, the closer to the matrix multiplication unit and vector operation unit, the smaller the storage space, but the larger the bandwidth.
[0079] In related technologies, to improve computational speed, attention mechanism computation can be decomposed into multiple data blocks for separate computation. That is, for a data sequence to be processed, attention mechanism computation is performed using a block-based computation approach. The forward computation process and the backward computation process of the attention mechanism are described below.
[0080] Referring to Figure 2, which illustrates a forward computation of an attention mechanism, the Q matrix and K matrix are partitioned to obtain Q. i and K j Q i and K j Perform matrix multiplication to obtain S ij And then targeting S ij Perform softmax processing to obtain P ij Then P ij With V j Perform matrix multiplication to obtain O i Next, using an iterative update algorithm, the O(n) values generated in this block calculation are updated. i Compared to the previous O i The process is repeated iteratively to obtain the O matrix.
[0081] During the forward computation of the attention mechanism performed using the AI chip 10 shown in Figure 1, the matrix multiplication unit and the vector operation unit transmit intermediate results (i.e., S) through the first-layer cache unit. ij and P ij In other words, S is generated in the matrix multiplication unit. ij Afterwards, S ij The data is transmitted to the first-level buffer unit, and then to the vector operation unit, where it performs vector operations to generate P. ij P is generated in the vector operation unit. ij Afterwards, P ijThe data is transmitted to the first-level buffer unit and then to the matrix multiplication unit so that the matrix multiplication unit can perform matrix multiplication operations.
[0082] Referring to Figure 3, a schematic diagram of a backward computation of an attention mechanism is shown. The Q matrix, K matrix, V matrix, dO matrix, and O matrix are partitioned to obtain Q. i K j V j dO i and O i For the dV matrix, first, Q... i and K j Perform matrix multiplication to obtain S ij And then targeting S ij Perform softmax processing to obtain P ij dO i and P ij Perform matrix multiplication, then combine with the previous dV. j The process involves cumulative updates to obtain the dV matrix. For the dQ and dK matrices, the dO matrix is first... i and V j Perform matrix multiplication to generate dP ij Then P ij and D i Perform element-wise matrix multiplication to generate dS ij dS ij and K j Perform matrix multiplication, then combine with the previous dQ. i After cumulative updates, the dQ matrix is finally obtained, and dS is then... ij and Q i Perform matrix multiplication, then combine with the previous dK. j The dK matrix is obtained by cumulatively updating the matrix.
[0083] Similar to the forward computation of the attention mechanism, during the reverse computation of the attention mechanism using the AI chip shown in Figure 1, the matrix multiplication unit and the vector operation unit transmit intermediate results (i.e., S) through the first-layer cache unit. ij P ij dS ij and dP ij Furthermore, compared to the attention mechanism's forward computation which uses s as global information, the attention mechanism's backward computation uses d as global information. The computation between each data block is relatively independent, and there is no need to use an iterative update algorithm for iterative updates.
[0084] As described above, when using a block-based computation approach to compute the attention mechanism, the intermediate results generated during the forward or backward computation of the attention mechanism for each data block need to be exchanged multiple times between the matrix multiplication unit and the vector operation unit. Therefore, completing the attention mechanism computation for each data block requires significant data transmission costs, resulting in low computational efficiency.
[0085] In view of this, this application provides a data processing method. Specifically, it obtains a first intermediate result generated by a matrix multiplication unit in the attention mechanism calculation, and / or obtains a second intermediate result generated by a vector operation unit in the attention mechanism calculation. A first intermediate result is generated by performing a matrix multiplication operation on two block matrices, where the block matrices are obtained by dividing the data sequence to be processed into blocks. A second intermediate result is generated by performing a vector operation. In response to the number of first intermediate results reaching the batch processing quantity, the batch processing quantity of first intermediate results is sent to the vector operation unit, so that the vector operation unit performs vector operations in the attention mechanism calculation based on the batch processing quantity of first intermediate results. In response to the number of second intermediate results reaching the batch processing quantity, the batch processing quantity of second intermediate results is sent to the matrix multiplication unit, so that the matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the batch processing quantity of second intermediate results.
[0086] In this method, during the attention mechanism computation using a block-based computation approach, after the matrix multiplication unit and / or vector operation unit generate intermediate results corresponding to a batch of data blocks, the intermediate results for the entire batch are then sent together to the vector operation unit and / or matrix multiplication unit. This reduces the number of data transfers between the matrix multiplication unit and the vector operation unit, resulting in a denser pipeline for the attention mechanism computation and more continuous computation between the matrix multiplication unit and the vector operation unit.
[0087] To make the technical solution of this application clearer and easier to understand, the system architecture of this application will be described below with reference to the accompanying drawings.
[0088] Referring to the architecture diagram of the data processing device shown in Figure 4, the data processing device 40 includes an acquisition module 401 and a communication module 402. Specifically, the acquisition module 401 is used to acquire the first intermediate result generated by the matrix multiplication unit 41 in the attention mechanism calculation, and / or to acquire the second intermediate result generated by the vector operation unit 42 in the attention mechanism calculation.
[0089] In this process, a first intermediate result is generated by performing a matrix multiplication operation on two block matrices, where the block matrices are obtained by dividing the data sequence to be processed into blocks. A second intermediate result is generated by performing a vector operation. In other words, the first and second intermediate results can be intermediate results generated during the attention mechanism computation using a block-based computation approach. For example, the first intermediate result may include S generated during the forward computation of the attention mechanism. ij S generated in the back-computation of the attention mechanism ij and dP ij The second intermediate result may include P generated in the forward computation of the attention mechanism. ij P generated in the back-computation of the attention mechanism ij and dS ij .
[0090] In response to the number of first intermediate results reaching the batch size, the communication module 402 sends the batch size of the first intermediate results to the vector operation unit 42, so that the vector operation unit 42 performs vector operations in the attention mechanism calculation based on the batch size of the first intermediate results, and / or, in response to the number of second intermediate results reaching the batch size, sends the batch size of the second intermediate results to the matrix multiplication unit 41, so that the matrix multiplication unit 41 performs matrix multiplication operations in the attention mechanism calculation based on the batch size of the second intermediate results.
[0091] In other words, in this embodiment, the first intermediate result of the batch number is sent to the vector operation unit 42 only after the matrix multiplication unit 41 performs attention mechanism calculation on the batch number of data blocks and generates the batch number of intermediate results. Similarly, the second intermediate result of the batch number is sent to the matrix multiplication unit 41 only after the vector operation unit 42 performs attention mechanism calculation on the batch number of data blocks and generates the batch number of intermediate results.
[0092] Thus, the vector operation unit 42 can perform subsequent vector operations on the first intermediate result of the batch, and the matrix multiplication unit 41 can perform subsequent matrix multiplication operations on the second intermediate result of the batch, effectively reducing the number of interactions between the matrix multiplication unit 41 and the vector operation unit 42. In the process of using block computing to perform attention mechanism calculations, the data transmission frequency between the matrix multiplication unit 41 and the vector operation unit 42 is reduced, and the calculations of the matrix multiplication unit 41 and the vector operation unit 42 are more continuous.
[0093] In some possible implementations, the matrix multiplication unit 41 and the vector operation unit 42 can be deployed in a non-shared memory AI chip. Referring to Figure 5, which shows a schematic diagram of a non-shared memory AI chip architecture, AI chip 50 is a non-shared memory AI chip, also known as a discrete architecture AI chip. The matrix multiplication unit 41 is connected to the second-level cache unit via the 0th-level cache unit and the 1st-level cache unit, while the vector operation unit 42 is connected to the second-level cache unit via the unified buffer (UB) unit. That is, the matrix multiplication unit 41 and the vector operation unit 42 do not directly share the second-level cache unit, but indirectly share it through other cache units.
[0094] When the AI chip 50 executes the data processing method provided in this application embodiment, the first intermediate result and the second intermediate result can be stored in the second-level cache unit. That is, the second-level cache unit allocates storage space corresponding to the batch number of first intermediate results for the matrix multiplication unit 41. After the matrix multiplication unit 41 generates the first intermediate result, the first intermediate result is transmitted to the second-level cache unit for storage through the 0th-level cache unit and the 1st-level cache unit. When the number of first intermediate results stored in the second-level cache unit reaches the batch number, the data processing device 40 sends the batch number of first intermediate results to the vector operation unit 42.
[0095] Similarly, the second-level cache unit allocates storage space for the second intermediate results corresponding to the batch number of vector operation units 42. After the vector operation unit 42 generates the second intermediate results, the second intermediate results are transmitted to the storage second-level cache unit for storage through the unified cache unit. When the second intermediate results stored in the second-level cache unit reach the batch number, the data processing device 40 sends the batch number of second intermediate results to the matrix multiplication unit 41.
[0096] Compared to the AI chip 10 with shared memory shown in Figure 1, the second-level cache unit in the AI chip 50 typically has a larger storage space than the first-level cache unit in the AI chip 10. For example, the storage space of the first-level cache unit in the AI chip 10 can be 164 kilobytes (KB), while the storage space of the second-level cache unit in the AI chip 50 can be 192 megabytes (MB).
[0097] Based on the above-mentioned characteristics of AI chip 50, the data processing method provided in this application embodiment can effectively utilize the large capacity advantage of the second-layer cache unit in AI chip 50 to store the first intermediate result and / or the second intermediate result in the second-layer cache unit, accumulate them to the batch processing quantity, and then transmit them to the vector operation unit and / or matrix multiplication unit together, which greatly reduces the number of times the intermediate result is transmitted between the matrix multiplication unit and the vector operation unit.
[0098] The data processing device 40 provided in this embodiment can be provided to the user in the form of an operator. Referring to Figure 6, which shows a schematic diagram of a software stack architecture, the software stack includes an operator platform. Specifically, the operator platform includes a basic operator library and a custom operator library. The data processing device 40 provided in this embodiment can be provided as a custom operator in the custom operator library to other platforms or computing devices in the software stack. In some embodiments, the data processing device 40 can be provided as a custom operator to an upper-layer model training platform or model inference acceleration library to improve the computational speed during the training or inference process of a transformer model (e.g., a large language model). In other embodiments, the data processing device 40 can be provided as a custom operator to a lower-layer AI chip, such as a discrete architecture AI chip, to accelerate high-performance computing and other tasks.
[0099] Based on the data processing apparatus 40 shown in FIG4, this application also provides a data processing method. The data processing method of this application will be described below with reference to embodiments.
[0100] Referring to the flowchart of the data processing method shown in Figure 7, the method includes the following steps:
[0101] S701: The data processing device 40 acquires the first intermediate result generated by the matrix multiplication unit in the attention mechanism calculation, and / or acquires the second intermediate result generated by the vector operation unit in the attention mechanism calculation.
[0102] In this embodiment, the matrix multiplication unit performs matrix multiplication operations in the attention mechanism computation, and the vector operation unit performs vector operations in the attention mechanism. The first intermediate result can be understood as an intermediate result generated by the matrix multiplication unit in the attention mechanism computation that is not the final output result, and the second intermediate result can be understood as an intermediate result generated by the vector operation unit in the attention mechanism computation that is not the final output result. That is, in the forward computation of the attention mechanism, the first or second intermediate result is not the forward computation result, and in the backward computation of the attention mechanism, the first or second intermediate result is not the parameter gradient.
[0103] In this embodiment, a first intermediate result is generated by performing a matrix multiplication operation on two block matrices. The block matrices are obtained by dividing the data sequence to be processed into blocks. A second intermediate result is generated by performing a vector operation. In other words, the attention mechanism calculation in this application uses a block-based computation method. The first intermediate result can be an intermediate result generated after performing a matrix multiplication operation in a block computation, and the second intermediate result can be an intermediate result generated after performing a vector operation in a block computation.
[0104] The following section explains the forward and backward computation processes of the attention mechanism. In the forward computation process, the block matrix can be the matrix obtained by partitioning the Q and K matrices (i.e., Q...). i and K j The first intermediate result can be S. ij The second intermediate result can be for S ij P generated after softmax processing ij .
[0105] During the backward computation of the attention mechanism, the block matrix can be the matrix obtained by dividing the Q matrix and the K matrix into blocks (i.e., Q). i and K j The first intermediate result can be S. ij The block matrix can also be the matrix obtained by dividing the dO matrix and the V matrix into blocks (i.e., dO... i and V j The first intermediate result can also be dP. ij The second intermediate result can be for S ij P generated after softmax processing ij It can also be used for dP ij and D i The dS generated after performing element-wise matrix multiplication ij .
[0106] Typically, after the matrix multiplication unit generates a first intermediate result or the vector operation unit generates a second intermediate result, the first or second intermediate result is transferred to a storage unit connected to both the matrix multiplication unit and the vector operation unit for storage. For example, when the matrix multiplication unit and the vector operation unit are deployed in an AI chip with non-shared memory as shown in Figure 5, the first intermediate result will be transferred to the second-level cache unit after passing through the 0th-level cache unit and the 1st-level cache unit, and the second intermediate result will be transferred to the second-level cache unit after passing through a unified cache unit. The data processing device 40 can obtain the first intermediate result and / or the second intermediate result from the second-level cache unit.
[0107] S702: In response to the number of first intermediate results reaching the batch processing quantity, the data processing device 40 sends the batch processing quantity of the first intermediate results to the vector operation unit, so that the vector operation unit performs vector operations in the attention mechanism calculation based on the batch processing quantity of the first intermediate results.
[0108] S703: In response to the number of second intermediate results reaching the batch quantity, the data processing device 40 sends the batch quantity of second intermediate results to the matrix multiplication unit so that the matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the batch quantity of second intermediate results.
[0109] In this embodiment, the batch number can be understood as the number of consecutive executions of the same computational task. For ease of description, in this embodiment, the batch number is denoted as N, where N is an integer greater than 1. The same computational task can be understood as a computational task involving different block matrices of the same matrix, for example, different block matrices (Q and K) of the same Q matrix and K matrix. i and K j ) The matrix multiplication tasks involved belong to the same operation task, consisting of different block matrices (S) of the same S matrix. ij The vector operations involved belong to the same operation task.
[0110] In other words, the matrix operation unit executes the same matrix multiplication operation task N times consecutively, generating N first intermediate results, and the vector operation unit executes the same vector operation task N times consecutively, generating N second intermediate results.
[0111] In the attention mechanism computation using a block-based computation approach, executing one computation task can be understood as executing the computation task corresponding to one data block. For example, in the forward computation of the attention mechanism, the matrix multiplication unit executes Q once. i and K j Matrix multiplication is performed as one operation task, while vector operation unit performs one operation on S. ij The softmax processing is performed as a single computation task. For example, in the backward computation of the attention mechanism, the matrix multiplication unit performs a Q-squared operation once. i and K j Matrix multiplication, or performing a dO operation. i and V j Matrix multiplication is a single operation performed by Q. i and K j Matrix multiplication and dO i and V j Matrix multiplication operations belong to different computational tasks. The vector operation unit performs one operation on S. ij The softmax processing, or the execution of P once.ij and D i Performing matrix element-wise multiplication is considered as executing one operation task for S. ij softmax processing and P ij and D i Performing element-wise matrix multiplication is a different computational task.
[0112] In some embodiments, the attention mechanism computation includes a first matrix multiplication task and a second matrix multiplication task, and / or, the attention mechanism computation includes a first vector operation task and a second vector operation task. In this case, the matrix multiplication unit generates a first intermediate result by continuously executing the first matrix multiplication task to generate a batch number of the first intermediate results, and continuously executing the second matrix multiplication task to generate a batch number of the first intermediate results. Similarly, the vector operation unit generates a second intermediate result by continuously executing the first vector operation task to generate a batch number of the second intermediate results, and continuously executing the second vector operation task to generate a batch number of the second intermediate results.
[0113] In other words, unlike traditional attention-based computation, where all computational tasks for one data block are completed before moving on to the next, in this embodiment, the matrix multiplication unit continuously executes the first matrix multiplication task for N data blocks (i.e., executes the first matrix multiplication task N times) before continuously executing the second matrix multiplication task for N data blocks. Similarly, the vector operation unit continuously executes the first vector operation task for N data blocks (i.e., executes the first vector operation task N times) before continuously executing the second vector operation task for N data blocks.
[0114] For example, in the backward computation of the attention mechanism, the matrix multiplication unit performs Q N times consecutively. i and K j (e.g., Q0 to Q) N-1 K0 to K N-1 ) matrix multiplication operation, and then perform N consecutive dO operations. i and V j (e.g., dO0 to dO) N-1 V0 to V N-1 Instead of performing matrix multiplication operations for a single data block, it performs Q-matrix operations. i and K j After performing matrix multiplication (such as Q0 and K0), execute the dO operation corresponding to that data block. i and V j Matrix multiplication operations (such as dO0 and V0).
[0115] Thus, without changing the original data block partitioning logic, the matrix multiplication unit and vector operation unit do not need to frequently retrieve different matrices from external storage units (such as external memory or external storage devices). For a single operation task, the next operation task will only be executed after N consecutive executions, making the calculations of the matrix multiplication unit and vector operation unit more continuous.
[0116] It should be noted that this application does not limit the number of batches. Different transformer models and different AI chips can be configured with different batch numbers. Furthermore, the forward computation and backward computation of the attention mechanism can be configured with different batch numbers.
[0117] In this embodiment, after the matrix multiplication unit performs matrix multiplication N times consecutively and generates N first intermediate results, the data processing device 40 sends all N first intermediate results to the vector operation unit. The vector operation unit can then perform subsequent vector operations based on the N first intermediate results. Similarly, after the vector operation unit performs vector operation N times consecutively and generates N second intermediate results, the data processing device 40 sends all N second intermediate results to the matrix multiplication unit. The matrix multiplication unit can then perform subsequent matrix multiplication operations based on the N first intermediate results.
[0118] In this way, the number of data interactions between the matrix multiplication unit and the vector operation unit is significantly reduced, thereby reducing the data transfer overhead between the matrix multiplication unit and the vector operation unit.
[0119] The following explanation will focus on the specific forward and backward computation processes of the attention mechanism. Refer to Figures 8A and 8B for schematic diagrams of the forward computation of the attention mechanism. Figure 8A illustrates the computational logic of the traditional forward computation of the attention mechanism, where the matrix multiplication unit performs operations on Q... i Generate S by performing matrix multiplication with K0 i0 Afterwards, S i0 It is transmitted to the vector processing unit for softmax processing to generate P. i0 Then P i0 It is transmitted to the matrix operation unit, and the matrix operation unit performs operations on P. i0 Generate O by performing matrix multiplication with V0. i0 For the remaining K j The same calculation logic is also executed, and then for each O ij Perform iterative updates to obtain the final forward computation result O.
[0120] Figure 8B illustrates the computational logic of the attention mechanism forward computation after applying the data processing method provided in the embodiments of this application. The matrix multiplication unit executes Q N times consecutively. i and Kj The matrix multiplication task generates N first intermediate results (i.e., S). i0 To S iN-1 The data processing device 40 will be powered by S i0 To S iN-1 The resulting matrix is transmitted together to the vector operation unit, which then processes the data from S... i0 To S iN-1 The matrix is processed by softmax to generate N second intermediate results (i.e., P). i0 To P iN-1 The matrix formed by P), the data processing device 40 pairs of P i0 To P iN-1 The resulting matrix is divided into blocks, and then the P blocks are... i0 To P iN-1 The data is transmitted to the matrix multiplication unit, which performs N consecutive P operations. ij and V j The matrix multiplication task generates O(n) values. i0 To O iN-1 Then for O i0 To O iN-1 Perform cumulative updates to obtain the O values corresponding to N data blocks. i .
[0121] In some possible implementations, the data sequence comprises 2N data blocks. After completing the forward computation of the attention mechanism for the data sequence, O0 and O1 are generated. The final forward computation result O can be generated based on the following formula: m = max(m0, m1). O = O' / l. Where m i For rowmax(P) i0 ), l i For rowsum(P) i0 The rowmax() function is used to calculate the maximum value of each element in a matrix.
[0122] Referring to Figures 9A and 9B, which illustrate the schematic diagrams of attention mechanism reverse computation, both Figures 9A and 9B describe the computational logic of attention mechanism reverse computation after applying the data processing method provided in the embodiments of this application. Attention mechanism reverse computation can be executed by an AI chip with non-shared memory as shown in Figure 5. Each row in Figures 9A and 9B represents a different data transmission or data processing process. Specifically, the first row indicates that data is transmitted from an external storage unit or a second-level cache unit to a first-level cache unit; the second row indicates that data is transmitted from a first-level cache unit to a zero-level cache unit; the third row indicates that data is processed by a matrix multiplication unit and the processing result is transmitted to a second-level cache unit for storage; and the fourth row indicates that data is processed by a vector operation unit and the processing result is transmitted to a second-level cache unit for storage.
[0123] In Figures 9A and 9B, each square represents a data block corresponding to the data to be processed or the processing result. Squares with the same label correspond to the same data block. Taking Figure 9A as an example, the batch size in Figure 9A is 2. The first data block corresponds to the block matrix of the Q matrix and the block matrix of the K matrix (e.g., Q...). i K0) is transferred from the external storage unit to the first-level cache unit, and then from the first-level cache unit to the 0th-level cache unit, where matrix multiplication is performed by the matrix multiplication unit to generate S. i0 , will S i0 Stored in the second-level cache unit. The matrix multiplication unit executes this matrix multiplication task continuously, and the second data block contains the block matrix of the Q matrix and the block matrix of the K matrix (e.g., Q). i K1) is transferred from the external storage unit to the first-level cache unit, and then from the first-level cache unit to the 0th-level cache unit. Matrix multiplication is then performed by the matrix multiplication unit to generate S. i1 , will S i1 It is stored in the second-level cache unit.
[0124] The matrix multiplication unit completes the Q corresponding to the first and second data blocks. i and K j After the matrix multiplication task, execute the dO operations corresponding to the first and second data blocks. i and V j The matrix multiplication task generates dP. i0 and dP i1 and dP i0 and dP i1 Stored in the second-level cache unit. Data transmission and processing are related to Q. i and K j The matrix multiplication task is similar and will not be described in detail here.
[0125] Data processing device 40 will S i0 and S i1 Transmit vector operation unit, vector operation unit for S i0 and S i1 Perform vector operations continuously to generate P i0 and P i1 , will P i0 and P i1 It is stored in the second-level cache unit. Next, the data processing device 40 will process the dP... i0 and dP i1 The data is transmitted to the vector processing unit, which then executes the P corresponding to the first and second data blocks consecutively. ij and D i The vector operation task generates dS i0and dS i1 dS i0 and dS i1 It is stored in the second-level cache unit.
[0126] P is generated in the vector operation unit. i0 and P i1 Then, the data processing device 40 will P i0 and P i1 Data is transferred from the second-level cache unit to the first-level cache unit, and then from the first-level cache unit to the zero-level cache unit. Matrix multiplication is then performed by the matrix multiplication unit to generate dV0 and dV1. Similarly, dS is generated in the vector operation unit. i0 and dS i1 Then, the data processing device 40 will dS i0 and dS i1 Data is transferred from the second-level cache unit to the first-level cache unit, and then from the first-level cache unit to the zero-level cache unit. Matrix multiplication is then performed by the matrix multiplication unit to generate dQ0, dQ1, dK0, and dK1. In this way, all computational tasks for the first and second data blocks are completed, generating the corresponding parameter gradients for the first and second data blocks.
[0127] In Figure 9B, the batch size is 4. The calculation logic in Figure 9B is similar to that in Figure 9A, and will not be repeated here.
[0128] As described above, in this embodiment of the application, for the attention mechanism calculation process, the first intermediate result of the batch number generated by the matrix multiplication unit is transmitted to the vector operation unit, and the second intermediate result of the batch number generated by the vector operation unit is transmitted to the matrix multiplication unit. In this way, frequent data interaction is eliminated between the matrix multiplication unit and the vector operation unit, improving the computational continuity between them and resulting in a denser pipeline arrangement in the attention mechanism calculation.
[0129] In some embodiments, considering that there is no dependency between the attention mechanism calculation processes corresponding to different data blocks when using block-based computing, that is, when performing attention mechanism calculations for different data blocks, no intermediate or final results from other data blocks are required, the different attention mechanism calculation processes can be cross-arranged to leverage the computing power of the AI chip.
[0130] Specifically, the data processing method may include a first attention mechanism calculation process and a second attention mechanism calculation process, with the second attention mechanism calculation process following the first attention mechanism calculation process. An attention mechanism calculation process can be understood as the process from acquiring the data to be processed to generating the final result. For example, an attention mechanism calculation process may be the process from acquiring a data sequence to generating a forward calculation result, or it may be the process from acquiring the forward calculation result to generating parameter gradients.
[0131] It should be noted that the first attention mechanism calculation process and the second attention mechanism calculation process in the embodiments of this application may both belong to the forward calculation of the attention mechanism, or both belong to the backward calculation of the attention mechanism.
[0132] In this embodiment, an attention mechanism calculation process can be the attention mechanism calculation process corresponding to a batch of data blocks in the forward or backward calculation of the attention mechanism. That is, an attention mechanism calculation process can be understood as the attention mechanism calculation process corresponding to different batches of data blocks. For example, an attention mechanism calculation process can be the process of generating the forward calculation result corresponding to a batch of data blocks in the forward calculation of the attention mechanism, or an attention mechanism calculation process can be the process of generating the parameter gradient corresponding to a batch of data blocks in the backward calculation of the attention mechanism.
[0133] In a specific implementation, after the matrix multiplication unit completes part of the matrix multiplication operation in the first attention mechanism calculation process, the data processing device 40 sends at least part of the data to be processed in the second attention mechanism calculation process to the matrix multiplication unit, so that the matrix multiplication unit can start executing the matrix multiplication operation in the second attention mechanism calculation process before the first attention mechanism calculation process is completed.
[0134] In other words, in this embodiment of the application, the matrix multiplication unit preprocesses the matrix multiplication operation in the second attention mechanism calculation process, which is later in the order, so that the different attention mechanism calculation processes are arranged in an interleaved manner in the entire attention mechanism calculation, further improving the parallelism of attention mechanism calculation, making full use of computing resources, and improving computing efficiency.
[0135] The following explanation will focus on the specific forward computation and backward computation processes of the attention mechanism. Referring to Figures 10A and 10B, which illustrate the forward computation of the attention mechanism, similar to Figures 9A and 9B, each row in Figures 10A and 10B represents a different data transmission or data processing process. Each square represents the data to be processed or the processing result corresponding to a data block, and squares with the same label correspond to the same data block.
[0136] Figure 10A illustrates the computational logic of applying the data processing method provided in the embodiments of this application, but without cross-arranging the computational processes of different attention mechanisms. The batch size in Figure 10A is 4, and the matrix multiplication unit executes Q 4 times consecutively. i and K j The matrix multiplication task generates S i0 To S i3 and S i0 To S i3 Stored in the second-level cache unit. The data processing device 40 will... i0 To S i3 Transmitted to the vector processing unit, the vector processing unit processes S i0 To S i3 Perform softmax processing to generate P i0 To P i3 and P i0 To P i3 Stored in the second-level cache unit. The data processing device 40 will... i0 To P i3 The data is transmitted to the matrix multiplication unit, which performs P operations four times consecutively. ij and V j The matrix multiplication task generates O(n) values. i0 To O i3 The first attention mechanism calculation process is completed for the first to fourth data blocks. Then, following a similar calculation logic, the second attention mechanism calculation process is completed for the fifth to eighth data blocks.
[0137] Figure 10B illustrates the computational logic of applying the data processing method provided in the embodiments of this application, and arranging the computational processes of different attention mechanisms in a cross-cutting manner. The batch size in Figure 10B is 4, and the matrix multiplication unit executes Q 4 times consecutively. i and K j The matrix multiplication task generates S i0 To S i3 and S i0 To S i3 Stored in the second-level cache unit. During S generation... i0 To S i3 Then, the data to be processed during the second attention mechanism calculation process (i.e., the block matrices of the Q and K matrices corresponding to the fifth to eighth data blocks) is transmitted to the matrix multiplication unit, so that the matrix multiplication unit performs Q four more times consecutively. i and K j The matrix multiplication task generates S i4 To S i7 and S i4 To S i7It is stored in the second-level cache unit.
[0138] Simultaneously, S is generated in the matrix multiplication unit. i0 To S i3 Then, the data processing device 40 will S i0 To S i3 Transmitted to the vector processing unit, the vector processing unit processes S i0 To S i3 Perform softmax processing to generate P i0 To P i3 and P i0 To P i3 Stored in the second-level cache unit. During P generation... i0 To P i3 Subsequently, since the matrix multiplication unit has already performed the second attention mechanism calculation process, Q... i and K j The matrix multiplication task, therefore, the vector operation unit can directly perform matrix multiplication on S. i4 To S i7 Perform softmax processing to generate P i4 To P i7 and P i4 To P i7 It is stored in the second-level cache unit.
[0139] Simultaneously, P is generated in the vector operation unit. i0 To P i3 Then, the data processing device 40 will P i0 To P i3 The data is transmitted to the matrix multiplication unit, which performs P operations four times consecutively. ij and V j The matrix multiplication task generates O(n) values. i0 To O i3 This completes the attention mechanism calculation process for the first to fourth data blocks, i.e., the first attention mechanism calculation process. P is generated in the vector operation unit. i4 To P i7 Then, the data processing device 40 will P i4 To P i7 The data is transmitted to the matrix multiplication unit, which performs P operations four times consecutively. ij and V j The matrix multiplication task generates O(n) values. i4 To O i7 The attention mechanism calculation process corresponding to the fifth to eighth data blocks is completed, which is the second attention mechanism calculation process.
[0140] Referring to Figure 11, which shows a schematic diagram of the attention mechanism for reverse computation, similar to Figures 9A and 9B, each row in Figure 11 represents a different data transmission or data processing process, and a square represents the data to be processed or the processing result corresponding to a data block. Squares with the same label correspond to the same data block.
[0141] Figure 11 illustrates the computational logic of applying the data processing method provided in the embodiments of this application, and arranging the computational processes of different attention mechanisms in a cross-cutting manner. The batch size in Figure 11 is 4, and the matrix multiplication unit executes Q 4 times consecutively. i and K j The matrix multiplication task generates S i0 To S i3 and S i0 To S i3 The data is stored in the second-level cache unit. Then, the matrix multiplication unit performs four consecutive dO operations. i and V j The matrix multiplication task generates dP. i0 to dP i3 .
[0142] In generating dP i0 to dP i3 Then, the data to be processed during the second attention mechanism calculation process (i.e., the block matrices of the Q and K matrices corresponding to the fifth to eighth data blocks, and the block matrices of the dO and V matrices corresponding to the fifth to eighth data blocks) are transmitted to the matrix multiplication unit, so that the matrix multiplication unit performs Q4 more times consecutively. i and K j The matrix multiplication task generates S i4 To S i7 and S i4 To S i7 Stored in the second-level cache unit, and after four consecutive DO operations. i and V j The matrix multiplication task generates dP. i4 to dP i7 and dP i4 to dP i7 It is stored in the second-level cache unit.
[0143] Simultaneously, S is generated in the matrix multiplication unit. i0 To S i3 Then, the data processing device 40 will S i0 To S i3 Transmitted to the vector processing unit, the vector processing unit processes S i0 To S i3 Perform softmax processing to generate P i0 To P i3 and Pi0 To P i3 It is stored in the second-level cache unit. Next, on the one hand, the data processing device 40 will... i0 to dP i3 The data is transmitted to the vector processing unit, which continuously executes P. ij and D i The vector operation task generates dS i0 To dS i3 and dS i0 To dS i3 It is stored in the second-level cache unit. On the other hand, the data processing device 40 will store P i0 To P i3 The data is transmitted to the matrix multiplication unit, which continuously executes P. ij and dO i The matrix multiplication task generates dV0 to dV3.
[0144] dS is generated in the vector operation unit. i0 To dS i3 Subsequently, since the matrix multiplication unit has already performed the second attention mechanism calculation process, Q... i and K j Matrix multiplication task and dO i and V j The matrix multiplication task, therefore, the vector operation unit can directly perform matrix multiplication on S. i4 To S i7 Perform softmax processing to generate P i4 To P i7 and P i4 To P i7 The data is stored in the second-level cache unit, and the vector operation unit can directly execute P. ij and D i The vector operation task generates dS i4 To dS i7 (Not shown in the figure).
[0145] Simultaneously, dS is generated in the vector operation unit. i0 To dS i3 Then, the data processing device 40 will dS i0 To dS i3 The data is input to the matrix multiplication unit, which continuously executes dS. ij and K j Matrix multiplication task and dS ij and Q i The matrix multiplication task generates dQ0 to dQ3 and dK0 to dK3, and completes the attention mechanism calculation process corresponding to the first data block to the fourth data block, i.e. the first attention mechanism calculation process.
[0146] Comparing Figures 10A and 10B, it can be seen that in the forward computation of the attention mechanism, P during the computation of the second attention mechanism... ij and V j The matrix multiplication task can be preprocessed by the matrix multiplication unit before the first attention mechanism's computation is completed. Comparing Figures 9B and 11, it can be seen that in the reverse computation of the attention mechanism, P during the second attention mechanism's computation... ij and V j Matrix multiplication task and dO i and V j The matrix multiplication task can be preprocessed by the matrix multiplication unit before the first attention mechanism computation is completed. In this way, the computational tasks of the matrix multiplication unit and the vector operation unit overlap further during the entire attention mechanism computation process, improving the utilization of computing resources, reducing the computation time of the attention mechanism computation, forming an interleaved pipeline arrangement, and giving full play to the advantages of parallel operation of the matrix multiplication unit and the vector operation unit.
[0147] Furthermore, in this embodiment, different inter-core synchronization numbers can be set. The inter-core synchronization number can be understood as the number of intermediate results synchronized when the matrix multiplication unit and the vector operation unit synchronize intermediate results. In other words, after the matrix multiplication unit generates a first intermediate result of the inter-core synchronization number, or after the vector operation unit generates a second intermediate result of the inter-core synchronization number, the matrix multiplication unit and the vector operation unit perform inter-core synchronization, transmitting the first intermediate result of the inter-core synchronization number to the vector operation unit, or transmitting the second intermediate result of the inter-core synchronization number to the first intermediate result.
[0148] For ease of description, in this embodiment, the number of inter-core synchronizations is denoted as M, where M is an integer greater than 1. In the forward computation of the attention mechanism, since the matrix multiplication unit generates S... ij Vector operation unit generates P ij And matrix multiplication unit generation O ij There is a clear sequential execution order between them; therefore, in the forward computation of the attention mechanism, M = N, meaning the number of inter-core synchronizations equals the number of batches. In the backward computation of the attention mechanism, the computation between each data block is more independent; therefore, in the backward computation of the attention mechanism, M ≤ N, meaning the number of inter-core synchronizations is less than or equal to the number of batches.
[0149] In other words, the second-layer cache unit of the AI chip allocates storage space for the first intermediate result corresponding to the batch number of matrix multiplication units and storage space for the second intermediate result corresponding to the batch number of vector operation units. However, in the actual attention mechanism reverse computation, inter-core synchronization can be performed when the second-layer cache unit stores the first intermediate result corresponding to the batch number of core synchronizations, or when the second-layer cache unit stores the second intermediate result corresponding to the batch number of core synchronizations.
[0150] Specifically, in response to the number of first intermediate results reaching the inter-core synchronization quantity, the data processing device 40 sends the first intermediate results of the inter-core synchronization quantity to the vector operation unit, and / or, in response to the number of second intermediate results reaching the inter-core synchronization quantity, the data processing device 40 sends the second intermediate results of the inter-core synchronization quantity to the matrix multiplication unit.
[0151] Referring to Figure 12, which illustrates the reverse computation of the attention mechanism, the number of inter-core synchronizations is 2. Similar to Figures 9A and 9B, each row in Figure 12 represents a different data transmission or data processing process, and a square represents the data to be processed or the processing result corresponding to a data block. Squares with the same label correspond to the same data block.
[0152] The matrix multiplication unit executes Q four times consecutively. i and K j The matrix multiplication task generates S i0 To S i3 In generating S i1 Afterwards, the data processing unit 40 performs inter-core synchronization, transferring S... i0 and S i1 The data is transmitted to the vector processing unit, which can then process S. i0 and S i1 Perform softmax processing to generate P i0 and P i1 In generating S i3 Afterwards, the data processing unit 40 performs inter-core synchronization again, transferring S... i2 and S i3 The data is transmitted to the vector processing unit, which can then process S. i2 and S i3 Perform softmax processing to generate P i2 and P i3 .
[0153] Similarly, P is generated in the vector operation unit. i1 Afterwards, the data processing unit 40 performs inter-core synchronization, transferring P... i0 and P i1 The data is transmitted to the matrix multiplication unit, which then performs P. ij and Di The vector operation task generates dS i0 and dS i1 In generating dS i3 Afterwards, the data processing unit 40 performs inter-core synchronization again, transferring the dS i2 and dS i3 The data is transmitted to the vector processing unit, which can then process dS. i2 and dS i3 Execute P ij and D i The vector operation task generates dS i2 and dS i3 The remaining calculation logic is similar to that in Figures 9A and 9B, and will not be repeated here.
[0154] Comparing Figures 9B and 12, it can be seen that by setting the number of inter-core synchronizations to be less than the batch size, the pipeline layout in the backward computation of the attention mechanism is more compact, shortening the idle time of matrix multiplication units or vector operation units and reducing latency. Furthermore, compared to traditional attention mechanism computation (i.e., inter-core synchronization is performed once for each data block), by flexibly setting the number of inter-core synchronizations and actively delaying inter-core synchronization, the frequency of inter-core synchronization is reduced, inter-core synchronization overhead is decreased, and computational efficiency is improved.
[0155] Typically, a matrix multiplication unit has data accumulation capabilities, meaning it can perform operations like A×B+C. In this embodiment, leveraging the data accumulation capability of the matrix multiplication unit, the final result of the batch processing is accumulated within the matrix multiplication unit and then written to the global storage unit.
[0156] When the attention mechanism computes as a forward computation, the number of batches O is executed in the matrix multiplication unit. i The accumulation. In specific implementation, the matrix multiplication unit is based on the second intermediate result (i.e., P) of the batch size. ij This generates a forward computation sub-result for the batch size (i.e., O). ij Based on the number of batches, the forward computation sub-results are used to generate the forward computation result (i.e., O). i Thus, by utilizing the data accumulation capability of the matrix multiplication unit, O(n) can be achieved. i =P ij V j +O i The final results are summed. After the summation is complete, the result of the forward calculation (i.e., O) is added together. i It is stored in the global storage unit.
[0157] When the attention mechanism is computed as the inverse of the attention mechanism, the number of batches dV is executed in the matrix multiplication unit. j and dKj The accumulation. In specific implementation, the matrix multiplication unit is based on the second intermediate result (i.e., P) of the batch size. ij and dS ij ), generating the gradient submatrix of the value vector of the batch size (i.e., P) T ij dO i ) and the gradient submatrix of the key vector of the batch size (i.e., dS T ij Q i Based on the value vector gradient submatrix and the key vector gradient submatrix of the batch size, a value vector gradient matrix (i.e., dV) is generated. j ) and the gradient matrix of the bond vector (i.e., dK) j ).
[0158] Thus, by utilizing the data accumulation capability of the matrix multiplication unit, dV is completed. j =P T ij dO i +dV j The final results are summed to complete dK. j =dS T ij Q i +dK j The final results are accumulated. After accumulation, the gradient matrix of the value vector (i.e., dV) is calculated. j ) and the gradient matrix of the bond vector (i.e., dK) j It is stored in the global storage unit.
[0159] In this embodiment, during the forward computation of the attention mechanism, the data accumulation capability of the matrix multiplication unit is used to accumulate the forward computation results. During the backward computation of the attention mechanism, the data accumulation capability of the matrix multiplication unit is used to accumulate the gradient matrix of the value vector and the gradient matrix of the key vector. After completing the accumulation of the forward computation results, the gradient matrix of the value vector or the gradient matrix of the key vector for a batch of processing, they are written to the global storage unit together. This reduces the number of times data is written in the attention mechanism computation, reduces the data writing overhead, and reduces the tailing problem caused by the large data writing overhead.
[0160] The effects of the data processing method provided in the embodiments of this application will be described below. When the data processing method provided in the embodiments of this application is applied to attention mechanism calculation, compared with traditional attention mechanism calculation, a speedup of 1.07 to 1.68 times can be obtained in the forward calculation of attention mechanism, and a speedup of 1.6 to 2 times can be obtained in the backward calculation of attention mechanism.
[0161] Furthermore, when the data processing method provided in this application is applied to attention mechanism computation, the longer the data sequence, the greater the savings in data migration and inter-core synchronization costs, the more scheduling space for pipeline layout, and the higher the computational efficiency. Moreover, the data processing method provided in this application performs better in attention mechanism computation in non-inverted triangle scenarios.
[0162] In some embodiments, the transformer model can be a large language model, and the data sequence can be a text sequence. The inference process of the large language model is divided into a prefill stage and a decoding stage. The prefill stage can be understood as the process of generating the first character (token) based on the text sequence, and the decoding stage can be understood as the process of generating the next token based on the text sequence and the generated token. In the decoding stage, the dimensions of the K and V matrices are the same as those in the prefill stage, both being (b,n,s,d), but the dimension of the Q matrix becomes (b,n,1,d). Therefore, the computational overhead of the vector operation unit performing softmax processing and the overhead of moving the second intermediate result are relatively small. The optimization goal of the decoding stage is to improve bandwidth utilization and reduce inter-core synchronization overhead.
[0163] When the data processing method provided in the embodiments of this application is applied to attention mechanism computation, in the decoding stage, S ij and P ij Its memory requirements are relatively small, therefore, S ij and P ij The cache is located in the second-level cache unit. The remaining space in the second-level cache unit is allocated to the K and V matrices, which can be partitioned according to the text sequence dimension. The number of attention heads is used as the pipeline scheduling dimension in the attention mechanism computation. Compared with traditional attention mechanism computation, a speedup of 1.89 to 4.83 times can be achieved in the decoding stage, effectively improving the efficiency of attention mechanism computation.
[0164] Based on the data processing method described above, this application also provides a data processing apparatus 40 as described above. The data processing apparatus 40 will now be described in conjunction with the accompanying drawings.
[0165] Referring to the structural schematic diagram of the data processing device 40 shown in Figure 4, the data processing device 40 includes:
[0166] The acquisition module 401 is used to acquire a first intermediate result generated by the matrix multiplication unit in the attention mechanism calculation, and / or acquire a second intermediate result generated by the vector operation unit in the attention mechanism calculation; wherein, a first intermediate result is generated by performing a matrix multiplication operation on two block matrices, the block matrices being obtained by dividing the matrix corresponding to the data sequence to be processed into blocks, and a second intermediate result is generated by performing a vector operation;
[0167] The communication module 402 is configured to, in response to the number of the first intermediate results reaching the batch processing quantity, send the batch processing quantity of the first intermediate results to the vector operation unit, so that the vector operation unit performs vector operations in the attention mechanism calculation based on the batch processing quantity of the first intermediate results; and / or, in response to the number of the second intermediate results reaching the batch processing quantity, send the batch processing quantity of the second intermediate results to the matrix multiplication unit, so that the matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the batch processing quantity of the second intermediate results.
[0168] Both the acquisition module 401 and the communication module 402 can be implemented in software or in hardware. For example, the implementation of the acquisition module 401 will be described below. Similarly, the implementation of the communication module 402 can be referenced to that of the acquisition module 401.
[0169] As an example of a software functional unit, module 401 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, or a container. Further, the aforementioned computing instance may be one or more. For example, module 401 may include code running on multiple hosts / virtual machines / containers. It should be noted that the multiple hosts / virtual machines / containers used to run the code may be distributed within the same region or in different regions. Further, the multiple hosts / virtual machines / containers used to run the code may be distributed within the same availability zone (AZ) or in different AZs, each AZ including one or more geographically proximate data centers. Typically, a region may include multiple AZs.
[0170] Similarly, multiple hosts / virtual machines / containers used to run this code can be distributed within the same Virtual Private Cloud (VPC) or across multiple VPCs. Typically, a VPC is set up within a region. Communication between two VPCs within the same region, as well as between VPCs in different regions, requires a communication gateway to be set up within each VPC to enable interconnection between VPCs.
[0171] As an example of a hardware functional unit, the acquisition module 401 may include at least one computing device, such as a server. Alternatively, the acquisition module 401 may be implemented using a central processing unit (CPU), an application-specific integrated circuit (ASIC), or a programmable logic device (PLD). The PLD may be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), a data processing unit (DPU), a neural network processing unit (NPU), a system-on-chip (SoC), an offload card, an accelerator card, or any combination thereof.
[0172] The multiple computing devices included in the acquisition module 401 can be distributed in the same region or in different regions. Similarly, the multiple computing devices included in the acquisition module 401 can be distributed in the same Availability Zone (AZ) or in different AZs. Likewise, the multiple computing devices included in the acquisition module 401 can be distributed in the same Virtual Private Cloud (VPC) or in multiple VPCs. These multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, GALs, DPUs, NPUs, SoCs, offloading cards, and accelerator cards.
[0173] It should be noted that, in other embodiments, the acquisition module 401 can be used to execute any step in the data processing method, and the communication module 402 can be used to execute any step in the data processing method. The steps implemented by the acquisition module 401 and the communication module 402 can be specified as needed. By implementing different steps in the data processing method through the acquisition module 401 and the communication module 402, all functions of the data processing device 40 can be realized.
[0174] In some possible implementations, the attention mechanism computation includes a first matrix multiplication task and a second matrix multiplication task, and / or, the attention mechanism computation includes a first vector operation task and a second vector operation task;
[0175] The matrix multiplication unit generates the first intermediate result through the following steps:
[0176] The first matrix multiplication operation task is executed continuously to generate the batch number of first intermediate results;
[0177] The second matrix multiplication operation task is executed continuously to generate the first intermediate result of the batch processing quantity;
[0178] The vector operation unit generates the second intermediate result through the following steps:
[0179] The first vector operation task is executed continuously to generate the number of second intermediate results in the batch.
[0180] The second vector operation task is executed continuously to generate the second intermediate result of the batch number.
[0181] In some possible implementations, the method includes a first attention mechanism calculation process and a second attention mechanism calculation process, wherein the second attention mechanism calculation process is performed after the first attention mechanism calculation process; the communication module 402 is further configured to:
[0182] After the matrix multiplication unit completes part of the matrix multiplication operation in the first attention mechanism calculation process, at least part of the data to be processed in the second attention mechanism calculation process is sent to the matrix multiplication unit so that the matrix multiplication unit can start executing the matrix multiplication operation in the second attention mechanism calculation process before the first attention mechanism calculation process is completed.
[0183] In some possible implementations, the attention mechanism is computed as a reverse calculation of the attention mechanism, and the communication module 402 is further configured to:
[0184] In response to the number of the first intermediate results reaching the inter-core synchronization threshold, the first intermediate results of the inter-core synchronization threshold are sent to the vector processing unit; and / or,
[0185] In response to the number of the second intermediate results reaching the inter-core synchronization number, the second intermediate results of the inter-core synchronization number are sent to the matrix multiplication unit; wherein the inter-core synchronization number is less than the batch processing number.
[0186] In some possible implementations, the attention mechanism computation is an attention mechanism forward computation, and the matrix multiplication unit performs matrix multiplication operations in the attention mechanism computation based on the second intermediate result of the batch size, including:
[0187] The matrix multiplication unit generates the number of forward computation sub-results for the number of batches based on the second intermediate results for the number of batches, and generates the forward computation result based on the number of forward computation sub-results for the number of batches.
[0188] The communication module 402 is also used for:
[0189] The forward calculation results are stored in a global storage unit.
[0190] In some possible implementations, the attention mechanism computation is a reverse attention mechanism computation, and the matrix multiplication unit performs matrix multiplication operations in the attention mechanism computation based on the second intermediate result of the batch size, including:
[0191] The matrix multiplication unit generates the value vector gradient submatrix and the key vector gradient submatrix of the batch based on the second intermediate result of the batch number, and generates the value vector gradient matrix and the key vector gradient matrix based on the value vector gradient submatrix and the key vector gradient submatrix of the batch number.
[0192] The communication module 402 is also used for:
[0193] The value vector gradient matrix and the key vector gradient matrix are stored in a global storage unit.
[0194] This application also provides a chip system including a processor and a power supply circuit. The power supply circuit supplies power to the processor, which executes the operation steps corresponding to the data processing method. For simplicity, further details are omitted here. The processor can be implemented using a GPU, or it can be implemented using computing devices such as a DPU, NPU, XPU, SoC, offload card, or accelerator card.
[0195] This application also provides a computing device 1300. As shown in FIG13, the computing device 1300 includes: a bus 1302, a processor 1304, a memory 1306, and a communication interface 1308. The processor 1304, the memory 1306, and the communication interface 1308 communicate with each other via the bus 1302. The computing device 1300 may be a server or a terminal device. It should be understood that this application does not limit the number of processors and memories in the computing device 1300.
[0196] Bus 1302 can be a Peripheral Component Interconnect Express (PCIe) bus, an Extended Industry Standard Architecture (EISA) bus, a Unified Bus (Ubus or UB), a Compute Express Link (CXL), a Cache Coherent Interconnect for Accelerators (CCIX), etc. The Unified Bus is also known as the Lingqu Bus. Buses can be categorized as address buses, data buses, control buses, etc. For ease of representation, only one line is used in Figure 13, but this does not imply that there is only one bus or one type of bus. Bus 1302 can include pathways for transmitting information between various components of computing device 1300 (e.g., memory 1306, processor 1304, communication interface 1308).
[0197] The processor 1304 may include any one or more of the following computing devices: central processing unit (CPU), graphics processing unit (GPU), microprocessor (MP) or digital signal processor (DSP), ASIC, FPGA, CPLD, NPU, SoC, offload card, accelerator card, etc.
[0198] Memory 1306 may include volatile memory, such as random access memory (RAM). Memory 1306 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard disk drive (HDD), or solid state drive (SSD). Furthermore, memory 1306 may also be implemented using storage class memory (SCM), phase change memory (PCM), or other types of storage media.
[0199] It is worth noting that the same type of storage medium can be configured in the same computing device to realize the function of memory 1306, or two or more types of storage media can be configured to realize the function of memory 1306. This application does not limit this.
[0200] The memory 1306 stores executable program code, and the processor 1304 executes the executable program code to implement the functions of the aforementioned acquisition module 401 and communication module 402, thereby realizing the data processing method. That is, the memory 1306 stores instructions for executing the data processing method.
[0201] The communication interface 1308 uses transceiver modules such as, but not limited to, network interface cards and transceivers to enable communication between the computing device 1300 and other devices or communication networks.
[0202] As one possible implementation, the computing device 1300 may also include a chip system, which includes a processor and a power supply circuit. The power supply circuit supplies power to the processor, and the processor executes the operation steps corresponding to the data processing method. For simplicity, further details are omitted here. The processor can be implemented using a GPU, or it can be implemented using computing devices or AI chips such as a DPU, NPU, XPU, SoC, offloading card, or accelerator card.
[0203] This application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
[0204] As shown in Figure 14, the computing device cluster includes at least one computing device 1300. The memory 1306 of one or more computing devices 1300 in the computing device cluster may store the same instructions for executing data processing methods.
[0205] In some possible implementations, the memory 1306 of one or more computing devices 1300 in the computing device cluster may also store partial instructions for executing data processing methods. In other words, a combination of one or more computing devices 1300 can jointly execute instructions for executing data processing methods.
[0206] It should be noted that the memory 1306 in different computing devices 1300 within the computing device cluster can store different instructions, each used to execute a portion of the functions of the data processing device. That is, the instructions stored in the memory 1306 of different computing devices 1300 can implement the functions of one or more modules in the acquisition module 401 and the communication module 402.
[0207] In some possible implementations, one or more computing devices in a computing device cluster can be connected via a network. This network can be a wide area network (WAN) or a local area network (LAN), etc. Figure 15 illustrates one possible implementation. As shown in Figure 15, two computing devices 1300A and 1300B are connected via a network. Specifically, they are connected to the network through communication interfaces in each computing device. In this type of possible implementation, the memory 1306 in computing device 1300A stores instructions for executing the functions of the acquisition module 401. Simultaneously, the memory 1306 in computing device 1300B stores instructions for executing the functions of the communication module 402.
[0208] The connection method between the computing device clusters shown in Figure 15 can be considered in this application, which requires the acquisition and transmission of intermediate results. Therefore, the function implemented by the acquisition module 401 is assigned to the computing device 1300A, and the function implemented by the communication module 402 is assigned to the computing device 1300B.
[0209] It should be understood that the functions of computing device 1300A shown in Figure 15 can also be performed by multiple computing devices 1300. Similarly, the functions of computing device 1300B can also be performed by multiple computing devices 1300.
[0210] This application also provides a computer program product containing instructions. The computer program product may be a software or program product containing instructions, capable of running on a computing device or stored on any usable medium. When the computer program product is run on at least one computing device, it causes the at least one computing device to perform a data processing method.
[0211] This application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium capable of being stored by a computing device, or a data storage device such as a data center containing one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state drive). The computer-readable storage medium includes instructions that instruct the computing device to perform a data processing method, or instruct the computing device to perform a data processing method.
[0212] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.
Claims
1. A data processing method, characterized in that, The method includes: Obtain a first intermediate result generated by the matrix multiplication unit in the attention mechanism calculation, and / or obtain a second intermediate result generated by the vector operation unit in the attention mechanism calculation; wherein, a first intermediate result is generated by performing a matrix multiplication operation on two block matrices, the block matrices being obtained by dividing the matrix corresponding to the data sequence to be processed into blocks, and a second intermediate result is generated by performing a vector operation; In response to the number of the first intermediate results reaching the batch size, the batch size of the first intermediate results is sent to the vector operation unit, so that the vector operation unit performs vector operations in the attention mechanism calculation based on the batch size of the first intermediate results; and / or, In response to the number of the second intermediate results reaching the batch size, the second intermediate results of the batch size are sent to the matrix multiplication unit so that the matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the second intermediate results of the batch size.
2. The method according to claim 1, characterized in that, The attention mechanism calculation includes a first matrix multiplication task and a second matrix multiplication task, and / or the attention mechanism calculation includes a first vector operation task and a second vector operation task; The matrix multiplication unit generates the first intermediate result through the following steps: The first matrix multiplication operation task is executed continuously to generate the batch number of first intermediate results; The second matrix multiplication operation task is executed continuously to generate the first intermediate result of the batch processing quantity; The vector operation unit generates the second intermediate result through the following steps: The first vector operation task is executed continuously to generate the number of second intermediate results in the batch. The second vector operation task is executed continuously to generate the second intermediate result of the batch number.
3. The method according to claim 1 or 2, characterized in that, The method includes a first attention mechanism calculation process and a second attention mechanism calculation process, wherein the second attention mechanism calculation process is performed after the first attention mechanism calculation process; the method further includes: After the matrix multiplication unit completes part of the matrix multiplication operation in the first attention mechanism calculation process, at least part of the data to be processed in the second attention mechanism calculation process is sent to the matrix multiplication unit so that the matrix multiplication unit can start executing the matrix multiplication operation in the second attention mechanism calculation process before the first attention mechanism calculation process is completed.
4. The method according to any one of claims 1 to 3, characterized in that, The attention mechanism calculation is a reverse attention mechanism calculation, and the method further includes: In response to the number of the first intermediate results reaching the inter-core synchronization threshold, the first intermediate results of the inter-core synchronization threshold are sent to the vector processing unit; and / or, In response to the number of the second intermediate results reaching the inter-core synchronization number, the second intermediate results of the inter-core synchronization number are sent to the matrix multiplication unit; wherein the inter-core synchronization number is less than the batch processing number.
5. The method according to any one of claims 1 to 4, characterized in that, The attention mechanism calculation is an attention mechanism forward computation. The matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the second intermediate result of the batch size, including: The matrix multiplication unit generates the number of forward computation sub-results for the number of batches based on the second intermediate results for the number of batches, and generates the forward computation result based on the number of forward computation sub-results for the number of batches. The method further includes: The forward calculation results are stored in a global storage unit.
6. The method according to any one of claims 1 to 4, characterized in that, The attention mechanism calculation is a reverse attention mechanism calculation. The matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the second intermediate result of the batch size, including: The matrix multiplication unit generates the value vector gradient submatrix and the key vector gradient submatrix of the batch based on the second intermediate result of the batch number, and generates the value vector gradient matrix and the key vector gradient matrix based on the value vector gradient submatrix and the key vector gradient submatrix of the batch number. The method further includes: The value vector gradient matrix and the key vector gradient matrix are stored in a global storage unit.
7. A data processing apparatus, characterized in that, The device includes: The acquisition module is used to acquire a first intermediate result generated by the matrix multiplication unit in the attention mechanism calculation, and / or acquire a second intermediate result generated by the vector operation unit in the attention mechanism calculation; wherein, a first intermediate result is generated by performing a matrix multiplication operation on two block matrices, the block matrices being obtained by dividing the matrix corresponding to the data sequence to be processed into blocks, and a second intermediate result is generated by performing a vector operation; A communication module is configured to, in response to the number of the first intermediate results reaching the batch processing quantity, send the batch processing quantity of the first intermediate results to the vector operation unit, so that the vector operation unit performs vector operations in the attention mechanism calculation based on the batch processing quantity of the first intermediate results; and / or, in response to the number of the second intermediate results reaching the batch processing quantity, send the batch processing quantity of the second intermediate results to the matrix multiplication unit, so that the matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the batch processing quantity of the second intermediate results.
8. The apparatus according to claim 7, characterized in that, The attention mechanism calculation includes a first matrix multiplication task and a second matrix multiplication task, and / or the attention mechanism calculation includes a first vector operation task and a second vector operation task; The matrix multiplication unit generates the first intermediate result through the following steps: The first matrix multiplication operation task is executed continuously to generate the batch number of first intermediate results; The second matrix multiplication operation task is executed continuously to generate the first intermediate result of the batch processing quantity; The vector operation unit generates the second intermediate result through the following steps: The first vector operation task is executed continuously to generate the number of second intermediate results in the batch. The second vector operation task is executed continuously to generate the second intermediate result of the batch number.
9. The apparatus according to claim 7 or 8, characterized in that, The method includes a first attention mechanism calculation process and a second attention mechanism calculation process, wherein the second attention mechanism calculation process is performed after the first attention mechanism calculation process; the communication module is further configured to: After the matrix multiplication unit completes part of the matrix multiplication operation in the first attention mechanism calculation process, at least part of the data to be processed in the second attention mechanism calculation process is sent to the matrix multiplication unit so that the matrix multiplication unit can start executing the matrix multiplication operation in the second attention mechanism calculation process before the first attention mechanism calculation process is completed.
10. The apparatus according to any one of claims 7 to 9, characterized in that, The attention mechanism calculation is a reverse calculation of the attention mechanism, and the communication module is also used for: In response to the number of the first intermediate results reaching the inter-core synchronization threshold, the first intermediate results of the inter-core synchronization threshold are sent to the vector processing unit; and / or, In response to the number of the second intermediate results reaching the inter-core synchronization number, the second intermediate results of the inter-core synchronization number are sent to the matrix multiplication unit; wherein the inter-core synchronization number is less than the batch processing number.
11. The apparatus according to any one of claims 7 to 10, characterized in that, The attention mechanism calculation is an attention mechanism forward computation. The matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the second intermediate result of the batch size, including: The matrix multiplication unit generates the number of forward computation sub-results for the number of batches based on the second intermediate results for the number of batches, and generates the forward computation result based on the number of forward computation sub-results for the number of batches. The communication module is also used for: The forward calculation results are stored in a global storage unit.
12. The apparatus according to any one of claims 7 to 10, characterized in that, The attention mechanism calculation is a reverse attention mechanism calculation. The matrix multiplication unit performs matrix multiplication operations in the attention mechanism calculation based on the second intermediate result of the batch size, including: The matrix multiplication unit generates the value vector gradient submatrix and the key vector gradient submatrix of the batch based on the second intermediate result of the batch number, and generates the value vector gradient matrix and the key vector gradient matrix based on the value vector gradient submatrix and the key vector gradient submatrix of the batch number. The communication module is also used for: The value vector gradient matrix and the key vector gradient matrix are stored in a global storage unit.
13. A chip system, characterized in that, The chip system includes a processor and a power supply circuit, the power supply circuit being used to supply power to the processor, the processor being used to execute the method of any one of claims 1 to 6.
14. A computing device, characterized in that, The computing device includes a chip system, the chip system including a processor and a power supply circuit, the power supply circuit being used to supply power to the processor, the processor being used to execute the method of any one of claims 1 to 6.
15. A computing device cluster, characterized in that, The computing device cluster includes at least one computing device, the at least one computing device includes a chip system, the chip system includes a processor and a power supply circuit, the power supply circuit is used to supply power to the processor, and the processor is used to execute the method of any one of claims 1 to 6.
16. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes computer-readable instructions for implementing the method according to any one of claims 1 to 6.
17. A computer program product, characterized in that, The computer program product includes computer-readable instructions for implementing the method according to any one of claims 1 to 6.