Computing method and device of neural network model, electronic equipment and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By generating a query matrix in the neural network model and sharing a key-value cache, and using matrix-matrix multiplication, the problem of low computational efficiency in neural network models is solved, resulting in more efficient computational results and better utilization of hardware resources.

CN117273084BActive Publication Date: 2026-06-26SHANGHAI BIREN TECH CO LTD

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHANGHAI BIREN TECH CO LTD
Filing Date: 2023-10-25
Publication Date: 2026-06-26

Application Information

Patent Timeline

25 Oct 2023

Application

26 Jun 2026

Publication

CN117273084B

IPC: G06N3/0464; G06N3/08

CPC: G06N3/0464; G06N3/08

AI Tagging

Technology Topics

Engineering Network model

Technical Efficacy Phrases

Improve computing efficiency reduce usage

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A method for evaluating reliability of a passive system of a nuclear power plant
CN116305704Bsimple designEasy to combine logicallyData processing applications Design optimisation/simulation
A library book personalized recommendation method and system based on conditional sharpening collaborative filtering
CN121980090Bencourage discoveryExplore EncouragementPersonalization Engineering
A model predictive static programming terminal guidance method based on control barrier function
CN116909309BSolve the small sizeImprove computing efficiencyNo-fly zoneAlgorithm
Multi-view vision depth perception method, system and spatial intelligent data product based on fixed pixel column anchoring
CN122289344AImprove computing efficiency small amount of calculation Computation complexity Point cloud
Bridge influence area identification method and device based on mobile vehicle driving strategy
CN115795975BImprove computing efficiencyConsider loading efficiencyGeometric CAD Design optimisation/simulation Mobile vehicle Simulation

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

The computational efficiency of existing neural network models is relatively low, especially when calculating attention for multiple queries, as they fail to fully utilize the characteristic that multiple queries share the same key and value, resulting in insignificant computational acceleration.

Method used

By generating a query matrix, attention heads in each query matrix share the same key-value cache, and matrix-matrix multiplication is used instead of matrix-vector multiplication for calculation. Combined with batch thresholds, a dynamic selection optimization strategy is adopted to make full use of hardware resources.

Benefits of technology

It improves the overall computational efficiency of neural network models, especially in cases of different batch sizes, enabling flexible processing of computation results and efficient utilization of hardware resources.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117273084B_ABST

Patent Text Reader

Abstract

The application provides a neural network model calculation method and device, electronic equipment and storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: for M data batches in the neural network model, at least one query matrix is generated based on the queries corresponding to each attention head in each data batch; the queries corresponding to each attention head in each query matrix share a same key-value cache, and the key-value cache comprises a key matrix and a value matrix; for each query matrix in the M data batches, the calculation result of the neural network model is determined based on the query matrix, the key matrix and the value matrix. Through the above method, since the queries corresponding to each attention head in each query matrix share a same key-value cache, the calculation between matrixes can be completed based on the query matrix, the key matrix and the value matrix for each query matrix in the M data batches, and the overall calculation efficiency of the neural network model is improved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a method, apparatus, electronic device, and storage medium for calculating a neural network model. Background Technology

[0002] Multi-group query attention refers to a large Transformer model where multiple attention heads correspond to a single key and a single value for each query, effectively saving storage space in the key-value cache.

[0003] In related technologies, multi-group query attention uses a single key and value for queries from multiple heads. However, each head still uses matrix-vector multiplication to perform attention calculations, which does not accelerate computation and results in low overall computational efficiency.

[0004] Therefore, how to improve the overall computational efficiency of neural network models is an urgent problem to be solved. Summary of the Invention

[0005] This invention provides a method, apparatus, electronic device, and storage medium for calculating neural network models, thereby addressing the shortcomings of low overall computational efficiency in existing neural network models and improving the overall computational efficiency of neural network models.

[0006] This invention provides a method for calculating a neural network model, comprising:

[0007] For M data batches in a neural network model, at least one query matrix is generated based on the query corresponding to each attention head in each data batch; the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix;

[0008] For each query matrix in M data batches, determine the computation result of the neural network model based on the query matrix, key matrix, and value matrix.

[0009] Optionally, for each query matrix in the M data batches, based on the query matrix, key matrix, and value matrix, the computational results of the neural network model are determined, including:

[0010] For each query matrix in M data batches, the computational unit in the neural network model multiplies the query matrix, the transpose of the key matrix, and the value matrix to obtain the calculation result.

[0011] Optionally, the computational units in the neural network model multiply the query matrix, the transpose of the key matrix, and the value matrix to obtain the computational result, including:

[0012] The optimization strategy is determined based on the batch threshold. The batch threshold is determined based on the number of attention heads in each data batch, the number of attention heads corresponding to queries that share the same key-value cache, and the number of computation units.

[0013] Based on the optimization strategy, the computational units in the neural network model are used to multiply the query matrix, the transpose of the key matrix, and the value matrix to obtain the calculation result.

[0014] Optionally, an optimization strategy is determined based on batch thresholds, including:

[0015] If the number of data batches is greater than or equal to the batch threshold, the optimization strategy is determined to be the first optimization strategy; the first optimization strategy is used to characterize the use of one computing unit to process one query matrix.

[0016] Optionally, based on the first optimization strategy, the computational units in the neural network model multiply the query matrix, the transpose of the key matrix, and the value matrix to obtain the computational result, including:

[0017] For each computational unit in the neural network model, the query matrix, the transpose of the key matrix, and the value matrix are multiplied by the computational unit to obtain the computational result.

[0018] Optionally, an optimization strategy is determined based on batch thresholds, including:

[0019] When the number of data batches is less than the batch threshold, the optimization strategy is determined to be the second optimization strategy; the second optimization strategy is used to characterize the use of multiple computing units to process a query matrix.

[0020] Optionally, based on the second optimization strategy, the computational units in the neural network model multiply the query matrix, the transpose of the key matrix, and the value matrix to obtain the computational result, including:

[0021] Based on the number of multiple computational units, the transpose of the key matrix is partitioned along the column dimension, and the value matrix is partitioned along the row dimension, resulting in multiple partitioned transpose matrices and multiple partitioned value matrices; each partitioned transpose matrix and partitioned value matrix corresponds to one computational unit among the multiple computational units.

[0022] The calculation result is obtained by multiplying a query matrix, the transpose of the partitioned matrix, and the value matrix simultaneously using each of the multiple computational units.

[0023] The present invention also provides a computing device for a neural network model, comprising:

[0024] The generation module is used to generate at least one query matrix for M data batches in the neural network model, based on the query corresponding to each attention head in each data batch; the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix.

[0025] The determination module is used to determine the computation result of the neural network model for each query matrix in M data batches, based on the query matrix, key matrix, and value matrix.

[0026] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement a calculation method for any of the neural network models described above.

[0027] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements a computation method for any of the neural network models described above.

[0028] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements a calculation method for any of the neural network models described above.

[0029] The neural network model calculation method, apparatus, electronic device, and storage medium provided by this invention generate at least one query matrix for each of the M data batches in the neural network model, based on the query corresponding to each attention head in each data batch. Since the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix, for each query matrix in the M data batches, matrix-matrix calculations can be performed based on the query matrix, key matrix, and value matrix, avoiding the use of matrix-vector multiplication, thereby obtaining the calculation result of the neural network model and improving the overall calculation efficiency of the neural network model. Attached Figure Description

[0030] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0031] Figure 1 This is a logical diagram illustrating the computation of multiple heads in related technologies;

[0032] Figure 2This is one of the flowcharts illustrating the calculation method of the neural network model provided by the present invention;

[0033] Figure 3 This is one of the logic diagrams for performing calculations on multiple heads provided by the present invention;

[0034] Figure 4 This is the second logical diagram of the present invention for performing calculations on multiple heads;

[0035] Figure 5 This is the second flowchart illustrating the calculation method of the neural network model provided by the present invention;

[0036] Figure 6 This is a schematic diagram of the structure of the computing device for the neural network model provided by the present invention;

[0037] Figure 7 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0038] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0039] To facilitate a clearer understanding of the various embodiments of this application, some relevant knowledge will be introduced as follows.

[0040] Transformer-type large models often employ multi-head attention computation, which is divided according to heads. Each head corresponds to a query, and each query corresponds to a key and value. By processing one head per processing unit, the hardware processing units can be effectively utilized for parallel processing.

[0041] The difference between multi&group query attention and multi-head attention lies in the fact that multiple heads share a single key and value for their queries, effectively saving storage space in the key-value cache. During multi&group query attention computation, both the key cache (KCache) and value cache (VCache) are copied, allowing queries from multiple heads to share a single key and value. Each head still performs matrix-vector multiplication to complete the attention computation. Compared to multi-head attention, multi&group query attention only requires storing one copy of the key-value cache, achieving storage savings.

[0042] Figure 1 This is a logical diagram illustrating computations performed on multiple heads in related technologies. Figure 1 In the table, the query corresponding to head 0 is Q0, and the query corresponding to head 1 is Q1. Q0 and Q1 are vectors that share the transpose of the same key matrix (K). T ). In QK T The calculation requires matrix-vector multiplication.

[0043] However, multi&group query attention does not fully utilize the characteristic of multiple queries sharing the same key and value during computation, and does not achieve computational acceleration. From the perspective of computational latency, it does not improve the overall computational efficiency.

[0044] In summary, to address the aforementioned problems, this invention provides a computational method, apparatus, electronic device, and storage medium for neural network models. By fully utilizing the characteristic that queries from different heads share the same KV cache, and replacing matrix-vector multiplication with matrix-matrix multiplication, the computational efficiency of the neural network model is improved. Furthermore, it flexibly handles different batch sizes, employing different implementation methods for large and small batch sizes, fully utilizing hardware resources to achieve dynamic optimization of multi-group query attention computation in large models.

[0045] The following is combined with Figures 2 to 5 The calculation method of the neural network model provided by this invention will be described in detail. Figure 2 This is one of the flowcharts illustrating the calculation method of the neural network model provided by this invention. See [link / reference]. Figure 2As shown, the method includes steps 201-202, wherein:

[0046] Step 201: For M data batches in the neural network model, generate at least one query matrix based on the query corresponding to each attention head in each data batch; the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix.

[0047] First, it should be noted that this invention is applicable to scenarios involving multi-group query attention. The subject of this invention can be any electronic device capable of performing neural network model calculations, such as a smartphone, smartwatch, desktop computer, laptop, or any other type.

[0048] In this embodiment of the invention, the neural network model is a Transformer model. Since in a multi-group query attention scenario, queries corresponding to multiple heads share a single key and value, a query matrix (Q matrix) is generated for each data batch in the Transformer model based on queries sharing the same KV Cache. Each Q matrix includes queries corresponding to multiple heads, and all queries share the same KV Cache.

[0049] For example, in a Transformer model, there are two batches, each containing 32 heads, and queries from four heads share the same key-value cache. Therefore, each batch contains eight Q-matrices, and the two batches together contain 16 Q-matrices.

[0050] Step 202: For each query matrix in the M data batches, determine the calculation result of the neural network model based on the query matrix, key matrix, and value matrix.

[0051] In this embodiment of the invention, for each Q matrix, matrix-matrix multiplication calculation is required based on the Q matrix, the key matrix (K matrix), and the value matrix (V matrix).

[0052] It should be noted that each Q matrix includes queries corresponding to multiple heads, and queries corresponding to multiple heads share the same KV Cache, which includes K and V matrices.

[0053] The computation method for the neural network model provided by this invention generates at least one query matrix for each of the M data batches in the neural network model, based on the query corresponding to each attention head in each data batch. Since the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix, for each query matrix in the M data batches, matrix-matrix computation can be completed based on the query matrix, key matrix, and value matrix, avoiding the use of matrix-vector multiplication, thereby obtaining the computation result of the neural network model and improving the overall computational efficiency of the neural network model.

[0054] Optionally, for each query matrix in the M data batches, the calculation result of the neural network model is determined based on the query matrix, key matrix, and value matrix. This can be achieved through the following steps:

[0055] For each query matrix in M data batches, the computational unit in the neural network model multiplies the query matrix, the transpose of the key matrix, and the value matrix to obtain the calculation result.

[0056] In this embodiment of the invention, the computational unit in the neural network model refers to the General Matrix Multiplication (GEMM) unit in the Transformer model. For each Q-matrix in each batch, GEMM is used to multiply the Q-matrix and K-matrix. T The matrix V is multiplied by matrix V to obtain the final result.

[0057] Specifically, for each Q matrix in each batch, the similarity between the Q matrix and the K matrix is first calculated, i.e., QK. T Calculate the Q matrix and K T The matrices are multiplied matrix-to-matrix, and then subjected to a softmax operation to obtain the similarity calculation result S. This result is then used as a weight to perform a dot product operation on the V matrix, yielding the final calculation result. In other words, the S matrix and the V matrix are multiplied matrix-to-matrix to obtain the final result.

[0058] In the above implementation, queries from different heads share the same KV cache; that is, each Q matrix corresponds to one K matrix and one V matrix. During QK... T When performing computation and SV computation, matrix-matrix multiplication can be performed on the computation unit, avoiding the use of matrix-vector multiplication, thereby improving the overall computational efficiency of the neural network model.

[0059] In this embodiment of the invention, the computing unit is used to process the Q matrix and K... TWhen performing matrix-matrix multiplication of matrices V and V, different optimization strategies need to be determined based on the batch threshold to fully utilize hardware resources. For different batch sizes, optimization strategies with higher hardware resource utilization can be dynamically selected based on the batch threshold to achieve matrix-matrix multiplication and lower latency.

[0060] In an optional implementation provided by this invention, the computational unit in the neural network model multiplies the query matrix, the transpose of the key matrix, and the value matrix to obtain the calculation result. Specifically, this can be achieved through the following steps 1)-2):

[0061] Step 1) Determine the optimization strategy based on the batch threshold; the batch threshold is determined based on the number of attention heads in each data batch, the number of attention heads corresponding to queries that share the same key-value cache, and the number of computation units.

[0062] Step 2) Based on the optimization strategy, the computational units in the neural network model are used to multiply the query matrix, the transpose of the key matrix, and the value matrix to obtain the calculation result.

[0063] In this embodiment of the invention, the batch threshold value can be specifically calculated using the following formula (1):

[0064] bs t =p n *h s / h (1)

[0065] Among them, bs t h is the batch threshold; h is the number of heads in each data batch; h s The number of heads sharing the same KVCache; p n This represents the number of calculation units.

[0066] After determining different optimization strategies based on the batch threshold, for different batch sizes, the optimization strategy with higher hardware resource utilization can be dynamically selected based on the batch threshold to achieve matrix-matrix multiplication and obtain lower latency.

[0067] Optionally, an optimization strategy is determined based on batch thresholds, including:

[0068] a) When the number of data batches is greater than or equal to the batch threshold, the optimization strategy is determined as the first optimization strategy; the first optimization strategy is used to characterize the use of one computing unit to process a query matrix.

[0069] Specifically, in order to make full use of hardware resources, when the number of data batches (i.e., batch size) is sufficiently large, that is, when the batch size is greater than or equal to 1 / 2500 bps... t In this case, the first optimization strategy is determined.

[0070] Under the first optimization strategy, for each computing unit, a single computing unit can process multiple heads, with the number of heads processed equal to the number of heads corresponding to queries sharing the same KV Cache. Other individual computing units process other multiple heads and different batches of the same head, thereby achieving full utilization of hardware resources.

[0071] Here, different batches with the same head means that the same head can correspond to data in different batches; for example, in batch 1, the data corresponding to head 1 is A; in batch 2, the data corresponding to head 1 is B.

[0072] In practical applications, there are 16 computing units and 2 batches; each batch contains 32 heads, and queries from 4 heads share the same KV cache; t The value is 2.

[0073] Therefore, each batch contains 8 Q matrices. When the batch size to be processed is greater than or equal to 2, only 8 computational units are needed to process all the Q matrices in one batch. The remaining 8 computational units can then be used to process the 8 Q matrices in another batch. And so on.

[0074] b) When the number of data batches is less than the batch threshold, the optimization strategy is determined to be the second optimization strategy; the second optimization strategy is used to characterize the use of multiple computing units to process a query matrix.

[0075] Specifically, when the batch size is small, that is, when the batch size is less than 1 / 250 lbs... t In this case, the first optimization strategy cannot fully utilize all hardware resources. Therefore, multiple computing units are used to process a single Q-matrix simultaneously.

[0076] For example, there are 16 computing units and 2 batches; each batch contains 32 heads, and queries from 4 heads share the same KV cache; t The value is 2.

[0077] Therefore, if a batch contains 8 Q matrices, and the batch size is 1, processing all Q matrices in a batch requires 8 computational units. At this point, the remaining 8 computational units are idle, thus failing to fully utilize all hardware resources. Therefore, processing one Q matrix requires 2 computational units, and 16 computational units work together to process all 8 Q matrices, thereby optimizing the utilization of hardware resources.

[0078] In the above implementation, when the batch size is greater than or equal to bs t In cases where the batch size is less than bs, the first optimization strategy is used; t In the case of [the above situation], the second optimization strategy is used; through the above methods, the hardware utilization is optimized.

[0079] Optionally, when the number of data batches is greater than or equal to the batch threshold, based on the first optimization strategy, the computational units in the neural network model multiply the query matrix, the transpose of the key matrix, and the value matrix to obtain the calculation result, specifically through the following steps 1)-2):

[0080] Step 1) For each computational unit in the neural network model, multiply the query matrix, the transpose of the key matrix, and the value matrix using the computational unit until the number of data batches is less than the batch threshold.

[0081] Step 2) Use multiple computing units to process a query matrix simultaneously to obtain the calculation result.

[0082] See details Figure 3 As shown, Figure 3 This is one of the logic diagrams for calculating multiple heads provided by the present invention. Figure 3 The system has 16 computing units and 2 batches; each batch contains 32 heads, and queries from 4 heads share the same key-value cache; t The value is 2.

[0083] Then, based on the first optimization strategy, the Q matrix and K... T Multiplying the matrix and the V matrix. Figure 3 Only the Q matrix and K are shown in the figure. T A diagram illustrating matrix multiplication, with matrix Q and matrix K. T The result S after matrix multiplication and softmax operation is similar to the result of matrix multiplication V, but is not shown in the figure.

[0084] One batch contains 8 Q matrices, and each Q matrix includes 4 heads (e.g., head0-head3). Therefore, processing one batch requires 8 computational units. Two batches contain a total of 16 Q matrices, and 16 computational units can simultaneously process all 16 Q matrices and K... T Multiplying the matrix and the V matrix. Figure 3 In this process, each of the computational units p0-p15 processes a Q matrix.

[0085] In another possible implementation of this invention, there are 16 computing units and 3 batches; each batch contains 32 heads, and queries from 4 heads share the same KV cache; t The value is 2. Therefore:

[0086] Firstly, based on the first optimization strategy, the Q matrix and K... T Multiplying the matrix and the V matrix. There are a total of 24 Q matrices in 3 batches, and 16 computation units can simultaneously process 16 Q matrices and K. T Multiply the matrix and the V matrix. Then there are 8 Q matrices left, which is 1 batch.

[0087] Since the batch size is less than bs at this time t Therefore, based on the second optimization strategy, the Q matrix and K matrix in this batch are... T Multiply the matrix and the V matrix.

[0088] Similarly, if queries from 8 heads share the same KV cache, then there will be 4 Q matrices in one batch. Therefore, processing one batch requires 4 computation units. When the batch size is greater than or equal to 4, the first optimization strategy is used; when it is less than 4, the second optimization strategy is used.

[0089] Optionally, when the number of data batches is less than the batch threshold, based on the second optimization strategy, the computational units in the neural network model multiply the query matrix, the transpose of the key matrix, and the value matrix to obtain the calculation result. This can be achieved through the following steps 1)-2):

[0090] Step 1) Based on the number of multiple computational units, the transpose of the key matrix is partitioned by column dimension, and the value matrix is partitioned by row dimension to obtain multiple partitioned transpose matrices and multiple partitioned value matrices; each partitioned transpose matrix and partitioned value matrix corresponds to one computational unit among the multiple computational units.

[0091] Step 2) Using each of the multiple computational units, multiply a query matrix, the transpose of the partitioned matrix, and the value matrix simultaneously to obtain the computation result.

[0092] See details Figure 4 As shown, Figure 4 This is the second logical diagram illustrating the calculation of multiple heads provided by this invention. Figure 4 The system has 16 computing units and 1 batch; the batch contains 16 heads, and queries from 4 heads share the same key-value cache; t The value is 2.

[0093] Therefore, with a batch size of 1, using the first optimization strategy would result in half of the computational units being wasted. Thus, multiple computational units are used to process a single Q-matrix simultaneously.

[0094] like Figure 4 As shown, each pair of computational units processes one Q matrix. When using two computational units to process one Q matrix simultaneously, it is necessary to reduce K... T The matrix is divided into two K partitions according to the column (seq length) dimension. T Matrix (e.g., p0, p1); Divide the V matrix into two V matrices (e.g., p'0, p'1) along the row dimension.

[0095] In Q*K T During the calculation, each calculation unit processes half of the column. For example, one calculation unit processes Q*p0 and obtains S1 after a softmax operation; another calculation unit processes Q*p1 and obtains S2 after a softmax operation.

[0096] When performing S*V calculations, each calculation unit processes half of the rows. For example, one calculation unit processes S1*p'0, and another calculation unit processes S2*p'1.

[0097] Using the above method, each computing unit calculates a partial result, and finally a reduction operation (e.g., reduce) is required to obtain the final result.

[0098] Figure 5 This is the second flowchart illustrating the calculation method of the neural network model provided by this invention. See also... Figure 5 As shown, the method includes steps 501-506, wherein:

[0099] Step 501: For M data batches in the neural network model, generate at least one query matrix based on the query corresponding to each attention head in each data batch; wherein, the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix.

[0100] Step 502: Determine the optimization strategy based on the batch threshold; wherein, the batch threshold is determined based on the number of attention heads in each data batch, the number of attention heads corresponding to queries that share the same key-value cache, and the number of computing units.

[0101] Step 503: Determine if the number of data batches is less than the batch threshold; if not, proceed to step 504; if yes, proceed to steps 505-506.

[0102] Step 504: For each computational unit in the neural network model, multiply the query matrix, the transpose of the key matrix, and the value matrix using the computational unit to obtain the computational result.

[0103] Step 505: Based on the number of multiple computing units, the transpose of the key matrix is divided by column dimension and the value matrix is divided by row dimension to obtain multiple transpose matrices and multiple value matrices after division; wherein each transpose matrix and each value matrix after division corresponds to one computing unit among the multiple computing units.

[0104] Step 506: Using each of the multiple computational units, multiply a query matrix, the transpose of the partitioned matrix, and the value matrix simultaneously to obtain the computation result.

[0105] The computing device for the neural network model provided by the present invention will be described below. The computing device for the neural network model described below can be referred to in correspondence with the computing method for the neural network model described above. Figure 6 This is a schematic diagram of the computing device for the neural network model provided by the present invention, as shown below. Figure 6 As shown, the computing device 600 for the neural network model includes: a generation module 601 and a determination module 602, wherein:

[0106] The generation module 601 is used to generate at least one query matrix for M data batches in the neural network model, based on the query corresponding to each attention head in each data batch; the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix.

[0107] The determination module 602 is used to determine the calculation result of the neural network model for each query matrix in M data batches, based on the query matrix, key matrix and value matrix.

[0108] The computational device for the neural network model provided by this invention generates at least one query matrix for each of the M data batches in the neural network model, based on the query corresponding to each attention head in each data batch. Since the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix, for each query matrix in the M data batches, matrix-matrix calculations can be performed based on the query matrix, key matrix, and value matrix, avoiding the use of matrix-vector multiplication, thereby obtaining the computational result of the neural network model and improving the overall computational efficiency of the neural network model.

[0109] Optionally, the determining module 602 is further used for:

[0110] For each query matrix in M data batches, the computational unit in the neural network model multiplies the query matrix, the transpose of the key matrix, and the value matrix to obtain the calculation result.

[0111] Optionally, the determining module 602 is further used for:

[0112] The optimization strategy is determined based on the batch threshold. The batch threshold is determined based on the number of attention heads in each data batch, the number of attention heads corresponding to queries that share the same key-value cache, and the number of computation units.

[0113] Based on the optimization strategy, the computational units in the neural network model are used to multiply the query matrix, the transpose of the key matrix, and the value matrix to obtain the calculation result.

[0114] Optionally, the determining module 602 is further used for:

[0115] If the number of data batches is greater than or equal to the batch threshold, the optimization strategy is determined to be the first optimization strategy; the first optimization strategy is used to characterize the use of one computing unit to process one query matrix.

[0116] Optionally, the determining module 602 is further used for:

[0117] For each computational unit in the neural network model, the query matrix, the transpose of the key matrix, and the value matrix are multiplied by the computational unit to obtain the computational result.

[0118] Optionally, the determining module 602 is further used for:

[0119] When the number of data batches is less than the batch threshold, the optimization strategy is determined to be the second optimization strategy; the second optimization strategy is used to characterize the use of multiple computing units to process a query matrix.

[0120] Optionally, the determining module 602 is further used for:

[0121] Based on the number of multiple computational units, the transpose of the key matrix is partitioned along the column dimension, and the value matrix is partitioned along the row dimension, resulting in multiple partitioned transpose matrices and multiple partitioned value matrices; each partitioned transpose matrix and partitioned value matrix corresponds to one computational unit among the multiple computational units.

[0122] The calculation result is obtained by multiplying a query matrix, the transpose of the partitioned matrix, and the value matrix simultaneously using each of the multiple computational units.

[0123] Figure 7 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 7 As shown, the electronic device may include a processor 710, a communications interface 720, a memory 730, and a communication bus 740. The processor 710, communications interface 720, and memory 730 communicate with each other via the communication bus 740. The processor 710 can call logical instructions in the memory 730 to execute a computation method for a neural network model. This method includes: for M data batches in the neural network model, generating at least one query matrix based on the query corresponding to each attention head in each data batch; the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix; for each query matrix in the M data batches, determining the computation result of the neural network model based on the query matrix, the key matrix, and the value matrix.

[0124] Furthermore, the logical instructions in the aforementioned memory 730 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0125] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the calculation method of the neural network model provided by the above methods. The method includes: for M data batches in the neural network model, generating at least one query matrix based on the query corresponding to each attention head in each data batch; the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix; for each query matrix in the M data batches, determining the calculation result of the neural network model based on the query matrix, the key matrix, and the value matrix.

[0126] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements a calculation method for the neural network model provided by the above methods. The method includes: for M data batches in the neural network model, generating at least one query matrix based on the query corresponding to each attention head in each data batch; the queries corresponding to each attention head in each query matrix share the same key-value cache, which includes a key matrix and a value matrix; for each query matrix in the M data batches, determining the calculation result of the neural network model based on the query matrix, the key matrix, and the value matrix.

[0127] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0128] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0129] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for calculating a neural network model, characterized in that, include: For M data batches in a neural network model, generate at least one query matrix based on the query corresponding to each attention head in each data batch; Each query in the query matrix that corresponds to each attention head shares the same key-value cache, which includes a key matrix and a value matrix; For each query matrix in the M data batches, the calculation result of the neural network model is determined based on the query matrix, the key matrix, and the value matrix. The step of determining the computation result of the neural network model by multiplying the query matrix, the key matrix, and the value matrix includes: An optimization strategy is determined based on the batch threshold, which is determined based on the number of attention heads in each data batch, the number of attention heads corresponding to queries that share the same key-value cache, and the number of computing units in the neural network model. Based on the optimization strategy, the query matrix, the key matrix, and the value matrix are multiplied by the computing unit in the neural network model to obtain the calculation result; Wherein, when the number of data batches is less than the batch threshold, the optimization strategy is to split the key matrix and the value matrix that share the same key-value cache, and use multiple computing units to multiply a query matrix and the split key matrix and the value matrix simultaneously; when the number of data batches is greater than or equal to the batch threshold, the optimization strategy is to use one computing unit to multiply the query matrix, the key matrix and the value matrix.

2. The calculation method for the neural network model according to claim 1, characterized in that, The step of multiplying the query matrix, the key matrix, and the value matrix using the computational units in the neural network model based on the optimization strategy to obtain the calculation result includes: Based on the optimization strategy, the computational unit in the neural network model multiplies the query matrix, the transpose of the key matrix, and the value matrix to obtain the computational result.

3. The calculation method for the neural network model according to claim 2, characterized in that, The process of determining the optimization strategy based on batch thresholds includes: If the number of data batches is greater than or equal to the batch threshold, the optimization strategy is determined to be a first optimization strategy; the first optimization strategy is used to characterize the use of one computing unit to process one query matrix.

4. The calculation method for the neural network model according to claim 3, characterized in that, When the optimization strategy is the first optimization strategy, the step of multiplying the query matrix, the transpose of the key matrix, and the value matrix using the computing unit in the neural network model based on the optimization strategy to obtain the calculation result includes: For each computational unit in the neural network model, the computational unit multiplies the query matrix, the transpose of the key matrix, and the value matrix to obtain the computational result.

5. The calculation method for the neural network model according to claim 2, characterized in that, The process of determining the optimization strategy based on batch thresholds includes: If the number of data batches is less than the batch threshold, the optimization strategy is determined to be the second optimization strategy; the second optimization strategy is used to characterize the use of multiple computing units to process one query matrix.

6. The calculation method for the neural network model according to claim 5, characterized in that, When the optimization strategy is the second optimization strategy, the step of multiplying the query matrix, the transpose of the key matrix, and the value matrix using the computing unit in the neural network model based on the optimization strategy to obtain the calculation result includes: Based on the number of the plurality of computing units, the transpose of the key matrix is partitioned by column dimension, and the value matrix is partitioned by row dimension to obtain a plurality of partitioned transpose matrices and a plurality of partitioned value matrices; each partitioned transpose matrix and each partitioned value matrix corresponds to one of the plurality of computing units. The calculation result is obtained by multiplying the query matrix, the transpose of the segmented matrix, and the value matrix simultaneously using each of the plurality of calculation units.

7. A computing device for a neural network model, characterized in that, include: The generation module is used to generate at least one query matrix for each of the M data batches in the neural network model, based on the query corresponding to each attention head in each data batch. Each query in the query matrix that corresponds to each attention head shares the same key-value cache, which includes a key matrix and a value matrix; The determination module is used to determine the calculation result of the neural network model for each query matrix in the M data batches, based on the query matrix, the key matrix, and the value matrix; The determining module is specifically used for: An optimization strategy is determined based on the batch threshold, which is determined based on the number of attention heads in each data batch, the number of attention heads corresponding to queries that share the same key-value cache, and the number of computing units in the neural network model. Based on the optimization strategy, the query matrix, the key matrix, and the value matrix are multiplied by the computing unit in the neural network model to obtain the calculation result; Wherein, when the number of data batches is less than the batch threshold, the optimization strategy is to split the key matrix and the value matrix that share the same key-value cache, and use multiple computing units to multiply a query matrix and the split key matrix and the value matrix simultaneously; When the number of data batches is greater than or equal to the batch threshold, the optimization strategy is to use one of the computing units to multiply the query matrix, the key matrix, and the value matrix.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the computation method of the neural network model as described in any one of claims 1 to 6.

9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the computation method of the neural network model as described in any one of claims 1 to 6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the computation method of the neural network model as described in any one of claims 1 to 6.