Data processing method and device, equipment and storage medium

By quantizing low-bit data types and performing similarity processing on the attention module of a large language model, the problem of high computational complexity of the attention module is solved, improving computational efficiency and resource utilization, and expanding the capabilities of LLM in long text processing.

CN122240831APending Publication Date: 2026-06-19BEIJING BAIDU NETCOM SCI & TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING BAIDU NETCOM SCI & TECH CO LTD
Filing Date
2024-12-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The attention module of existing large language models (LLMs) has high computational complexity, which leads to a significant increase in computation time and resource consumption when processing long texts, thus limiting the ability of LLMs to expand in context.

Method used

By extracting features from the target token sequence, the first, second, and third weight matrices are obtained. Attention processing is then performed on all target data types based on these matrices. Low-bit data types such as INT8 are used for quantization and similarity processing to reduce computational complexity and GPU memory usage.

Benefits of technology

It significantly improves the efficiency of attention computation, reduces the time requirement of large models during inference computation, saves GPU memory resources, and enhances the context expansion capability of long text processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240831A_ABST
    Figure CN122240831A_ABST
Patent Text Reader

Abstract

This disclosure provides data processing methods, apparatus, devices, and storage media, relating to the field of artificial intelligence technology, particularly deep learning and large-scale models. The specific implementation scheme is as follows: A target token sequence to be processed by attention is obtained; based on the target token sequence, a first weight matrix, a second weight matrix, and a third weight matrix required for attention processing are obtained; the first weight matrix represents the query matrix corresponding to the elements in the target token sequence; the second weight matrix represents the key matrix representing the similarity between elements in the target token sequence; the third weight matrix represents the value matrix representing the actual content of the elements in the target feature sequence; attention processing of all target data types is performed based on the first weight matrix, the second weight matrix, and the third weight matrix to obtain the attention processing result, wherein the attention processing result can represent the correlation between different elements in the target text.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of data processing technology, and in particular to the fields of artificial intelligence, deep learning, and large models. Background Technology

[0002] Large Language Models (LLMs) have achieved significant breakthroughs across various fields. Attention modules, as the foundation of LLMs, are used to capture the dependencies between different tokens in a sequence of text tokens. However, current attention modules have high computational complexity (the square of the length of the input text token sequence), which significantly increases computational time and resource consumption when processing long texts, thus limiting the scalability of LLMs in terms of context. Summary of the Invention

[0003] This disclosure provides a data processing method, apparatus, device, and storage medium.

[0004] According to one aspect of this disclosure, a data processing method is provided, comprising:

[0005] The target token sequence to be attention processed is obtained, wherein the target token sequence is the sequence obtained after feature extraction of the target text;

[0006] Based on the target token sequence, the first weight matrix, the second weight matrix, and the third weight matrix required for attention processing are obtained; the first weight matrix represents the query matrix corresponding to the elements in the target token sequence; the second weight matrix represents the key matrix of similarity between elements in the target token sequence; and the third weight matrix represents the value matrix of the actual content of the elements in the target feature sequence.

[0007] Attention processing is performed on all target data types based on the first weight matrix, the second weight matrix, and the third weight matrix to obtain the attention processing result, which can characterize the correlation between different elements in the target text.

[0008] According to another aspect of this disclosure, a data processing apparatus is provided, comprising:

[0009] The encoding unit is used to obtain the target token sequence to be processed by attention, wherein the target token sequence is the sequence obtained after feature extraction of the target text;

[0010] The attention processing unit is used to obtain the first weight matrix, the second weight matrix, and the third weight matrix required for attention processing based on the target token sequence. The first weight matrix represents the query matrix corresponding to the elements in the target token sequence; the second weight matrix represents the key matrix of the similarity between elements in the target token sequence; and the third weight matrix represents the value matrix of the actual content of the elements in the target feature sequence. Attention processing is performed on all target data types based on the first weight matrix, the second weight matrix, and the third weight matrix to obtain the attention processing result, wherein the attention processing result can represent the correlation between different elements in the target text.

[0011] According to another aspect of this disclosure, an electronic device is provided, comprising:

[0012] At least one processor; and

[0013] The memory is communicatively connected to the at least one processor; wherein,

[0014] The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform any of the methods described in the present disclosure.

[0015] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause the computer to perform any of the methods according to embodiments of this disclosure.

[0016] According to another aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements any of the methods according to embodiments of this disclosure.

[0017] Thus, the disclosed solution can perform attention processing on the first weight matrix, the second weight matrix, and the third weight matrix obtained based on the target token sequence, using the full target data type, to obtain the attention processing result. Since the attention processing is performed on the basis of the full target data type, the amount of computation for attention calculation can be effectively controlled by the target data type, thereby significantly improving the efficiency of attention calculation. At the same time, it provides strong support for reducing the time required for inference calculation of large models in the future.

[0018] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0019] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:

[0020] Figure 1 This is an illustrative flow diagram of a data processing method according to an embodiment of this application. Figure 1 ;

[0021] Figure 2 This is an illustrative flow diagram of a data processing method according to an embodiment of this application. Figure 2 ;

[0022] Figure 3 This is an illustrative flow diagram of a data processing method according to an embodiment of this application. Figure 3 ;

[0023] Figure 4 This is a schematic diagram of a 32-bit binary floating-point number according to an embodiment of this application;

[0024] Figure 5 This is a flowchart illustrating a data processing method according to an embodiment of this application in a specific example;

[0025] Figure 6 This is a schematic diagram of the encoder structure of a target model according to an embodiment of this application;

[0026] Figure 7 This is a schematic diagram of the structure of a data processing apparatus according to an embodiment of this application;

[0027] Figure 8 This is a block diagram of an electronic device used to implement the data processing method of the embodiments of this disclosure. Detailed Implementation

[0028] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0029] In this document, the term "and / or" merely describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent three cases: A alone, A and B simultaneously, and B alone. The term "at least one" in this document indicates any combination of at least two of a plurality of elements. For example, including at least one of A, B, and C can mean including any one or more elements selected from the set consisting of A, B, and C. The terms "first" and "second" in this document refer to and distinguish between multiple similar technical terms, not to restrict the order or to limit there to only two. For example, "first feature" and "second feature" refer to two categories / two features; the first feature can be one or more, and the second feature can also be one or more.

[0030] Furthermore, to better illustrate this disclosure, numerous specific details are set forth in the following detailed description. Those skilled in the art will understand that this disclosure can still be practiced even without certain specific details. In some instances, methods, means, components, and circuits well known to those skilled in the art have not been described in detail in order to highlight the main points of this disclosure.

[0031] The following describes the related technologies of the embodiments of this disclosure. The following related technologies are optional solutions and can be combined with the technical solutions of the embodiments of this disclosure in any way, and they all fall within the protection scope of the embodiments of this disclosure.

[0032] Large-scale language models (LLMs) have achieved significant breakthroughs across various fields. Attention modules are fundamental to LLMs, used to capture dependencies between different tokens in a text sequence. However, current attention modules have high computational complexity (its complexity is the square of the text sequence length, which can be denoted as O(L...)). 2 (where L is the length of the text sequence), which hinders the context expansion of LLM in long text processing.

[0033] Based on this, the present disclosure provides a data processing method that can significantly improve the computational efficiency of attention processing by performing attention processing on specified data types, significantly reduce the time required for large models to perform inference calculations, and save the GPU memory resources required for attention processing, laying the foundation for improving the contextual expansion capabilities of large models when processing long texts.

[0034] Specifically, Figure 1 This is an illustrative flow diagram of a data processing method according to an embodiment of this application. Figure 1 This method can be optionally applied to electronic devices, such as personal computers, servers, server clusters, and other electronic devices.

[0035] Furthermore, the method includes at least a portion of the following: For example... Figure 1 As shown, it includes:

[0036] Step S101: Obtain the target token sequence to be processed by attention.

[0037] Here, the target token sequence is the sequence obtained after feature extraction from the target text.

[0038] Step S102: Based on the target token sequence, obtain the first weight matrix, the second weight matrix, and the third weight matrix required for attention processing.

[0039] Here, the first weight matrix (e.g., represented by Q) represents the query matrix corresponding to the elements in the target token sequence; the second weight matrix (e.g., represented by K) represents the key matrix of similarity between elements in the target token sequence; and the third weight matrix (e.g., represented by V) represents the value matrix corresponding to the actual content of the elements in the target feature sequence.

[0040] Step S103: Perform attention processing on all target data types based on the first weight matrix, the second weight matrix, and the third weight matrix to obtain the attention processing result.

[0041] Here, the attention processing results can characterize the correlation between different elements in the target text.

[0042] It should be noted that the target data type can be specifically an integer type. Furthermore, the number of bits in the target data type is less than the number of bits in a half-floating-point number or a floating-point number. In other words, compared to a half-floating-point number or a floating-point number (both of which are high-bit data types), the target data type is a low-bit data type.

[0043] Furthermore, in a specific example, the "attention processing" described above can employ a fast attention mechanism. Here, the fast attention mechanism uses techniques such as tiling and recomputation, which effectively avoids frequent readings and writes of the attention matrix from high-bandwidth memory (HBM), thereby achieving efficient attention computation.

[0044] Thus, the disclosed solution can perform attention processing on the first weight matrix, the second weight matrix, and the third weight matrix obtained based on the target token sequence, using the full target data type, to obtain the attention processing result. Here, since the attention processing is performed on the basis of the full target data type, the amount of computation for attention calculation can be effectively controlled by the target data type, thereby significantly improving the efficiency of attention calculation. At the same time, it provides strong support for reducing the time required for inference calculation of large models in the future.

[0045] Furthermore, since the attention processing is performed on the basis of all target data types, and the amount of computation for attention calculation can be controlled by the target data type, it can effectively save the GPU memory resources required for attention processing. This effectively avoids the problem of limited context expansion capability when large models process long texts due to excessive GPU memory usage.

[0046] Furthermore, in a specific example, the above-described process of obtaining the first, second, and third weight matrices required for attention processing based on the target token sequence (e.g., step S102) can specifically include: when the target token sequence contains N token vectors, and each token vector has a dimension of d, obtaining a first weight matrix of dimension N×d and a second weight matrix of dimension N×d based on each token vector in the target token sequence; and obtaining a third weight matrix of dimension N×d based on the text content corresponding to each token vector in the target text. This provides data support for subsequent attention calculations.

[0047] It should be noted that in practical applications, when large models perform long text inference, the data types used in the attention processing of the model are usually high-bit data types (such as half-precision floating-point (FP16) type, octave precision floating-point (FP8) type, etc.), which makes the computation of attention processing more complex and very time-consuming. However, since the data types used in the attention processing of this disclosed solution are all target data types, such as low-bit data types, such as INT8, the computational complexity of attention processing can be greatly reduced, thereby improving computational efficiency.

[0048] In a specific example, the attention processing based on the first weight matrix, the second weight matrix, and the third weight matrix described above for all target data types specifically includes:

[0049] In the attention processing module of the encoder of the target model, attention processing for all target data types is performed based on the first weight matrix, the second weight matrix, and the third weight matrix.

[0050] Here, the target token sequence is the sequence obtained by the encoder's encoding module after feature encoding of the target text.

[0051] In other words, in this example, attention processing for all target data types is performed during the encoding phase. For instance, the target text is feature-encoded in the encoding module of the encoder in the target model to obtain a target token sequence. This target token sequence is then used to obtain the first, second, and third weight matrices. Finally, the attention processing module in the encoder performs attention processing on these three weight matrices for all target data types. In this way, the complexity of attention computation is adjusted by utilizing the target data type, thereby effectively reducing the complexity of attention computation in the target model and improving the model's processing efficiency.

[0052] Here, the target model may be a large model, or other models with text reasoning capabilities and an attention processing module; this disclosure does not limit this.

[0053] In this way, the present invention can perform attention processing on all target data types in the attention processing module of the encoder. Thus, by optimizing the data processing process, the time and resources required for the model to perform attention processing can be significantly reduced, thereby reducing the hardware resource consumption of the encoder's attention processing module.

[0054] In addition, since the attention processing for the above-mentioned full target data type is in the attention processing module of the encoder, rather than the attention processing module of the decoder, the quantization attention scheme of this disclosure is static statistical quantization, without introducing a dynamic quantization process in the inference stage, which effectively avoids the degradation of inference performance and further improves the user's model usage experience.

[0055] Figure 2 This is an illustrative flow diagram of a data processing method according to an embodiment of this application. Figure 2 This method can be optionally applied to electronic devices, such as personal computers, servers, and server clusters. It is understood that the above... Figure 1 The methods shown can also be applied to this example, and the related content will not be elaborated further in this example.

[0056] Furthermore, the method includes at least a portion of the following: For example... Figure 2 As shown, it includes:

[0057] Step S201: Obtain the target token sequence to be processed by attention.

[0058] Here, the target token sequence is the sequence obtained after feature extraction from the target text.

[0059] Step S202: Based on the target token sequence, obtain the first weight matrix, the second weight matrix, and the third weight matrix required for attention processing.

[0060] Here, the first weight matrix represents the query matrix corresponding to the elements in the target token sequence; the second weight matrix represents the key matrix of similarity between elements in the target token sequence; and the third weight matrix represents the value matrix of the actual content of the elements in the target feature sequence.

[0061] Step S203: During the attention processing, the attention matrix represented by the target data type is subjected to similarity processing to reduce the GPU memory resources occupied and obtain the attention processing result.

[0062] For example, during the attention processing process of the attention processing module of the encoder of the target model, the attention matrix represented by the target data type is subjected to similarity processing to reduce the GPU memory resources occupied and obtain the attention processing result.

[0063] Here, the attention processing results can characterize the correlation between different elements in the target text.

[0064] Furthermore, the attention matrix represented by the target data type is obtained based on the first weight matrix, the second weight matrix, or the third weight matrix.

[0065] In other words, this example provides a refined scheme for "attention processing of all target data types", that is, similarity processing is performed on the attention matrix represented by the target data type. In other words, the data type of the attention matrix for similarity processing is the target data type. In this way, the amount of computation for attention calculation can be effectively controlled by the target data type, thereby significantly improving the efficiency of attention calculation. At the same time, it provides strong support for reducing the time required for inference calculation of large models in the future.

[0066] For example, in one scenario, after obtaining the first, second, and third weight matrices, attention processing is performed using an attention mechanism (or a fast attention mechanism). In practical scenarios, the data type used for attention processing might be semi-floating or floating-point. This type of data consumes significant GPU memory resources and can hinder context expansion when processing long texts. Therefore, before performing attention processing using the attention mechanism (or fast attention mechanism), the data type can be converted to the target data type. This reduces the GPU memory resources consumed by computation and effectively solves the problem of hindering context expansion in long texts.

[0067] In this way, the proposed solution can perform similarity processing using attention matrices (e.g., two attention matrices) represented by the target data type during the attention processing process, and obtain the attention processing result. This effectively reduces the computational complexity of attention processing, thereby reducing the GPU memory resources occupied and significantly improving the efficiency of attention computation. This lays the foundation for improving the contextual expansion capability of large models when processing long texts.

[0068] Furthermore, in one example, the target data type is integer data, and more specifically, the target data type can be 8-bit integer (INT8). In this way, using low-bit data type for attention processing can significantly improve computational efficiency while saving the GPU memory resources required for computation, providing technical support for improving the inference performance and efficiency of the model in long texts.

[0069] Furthermore, since the present disclosure performs attention processing on all target data types during the attention processing process—in other words, the data type of each attention matrix that needs attention processing is the target data type (e.g., INT8)—compared to existing attention calculation schemes, the present disclosure reduces the complexity of attention calculation from O(L...) to O(L...). 2 The computational complexity is reduced from O(L) to O(L), which significantly reduces the dependence on and occupation of hardware resources (such as computing power and video memory), providing strong support for the rational allocation and efficient utilization of resources.

[0070] Furthermore, in a specific example, the attention processing result can be obtained in the following manner; specifically, that is, the attention matrix represented by the target data type is subjected to similarity processing as described above to reduce the memory resources occupied and to obtain the attention processing result (e.g., step S203), which can specifically include:

[0071] Step S203-1: Quantize the first weight matrix, the second weight matrix, and the third weight matrix to obtain the first weight matrix, the second weight matrix, and the third weight matrix, all represented by the target data type.

[0072] In other words, the first weight matrix, the second weight matrix, and the third weight matrix are quantized respectively to obtain the first weight matrix, the second weight matrix, and the third weight matrix represented by the target data type.

[0073] Step S203-2: Perform similarity processing based on the first weight matrix, the second weight matrix, and the third weight matrix, all of which are represented by the target data type, to reduce the GPU memory resources used and obtain the attention processing result.

[0074] In other words, during the attention processing, the first, second, and third weight matrices are quantized to convert the current data type (e.g., high-bit data type) of each weight matrix into the target data type (e.g., low-bit data type), thus obtaining the first, second, and third weight matrices, all represented by the target data type. Then, similarity processing is performed on the first, second, and third weight matrices, all represented by the target data type, to obtain the attention processing result.

[0075] In this way, the present invention can quantize the first weight matrix, the second weight matrix, and the third weight matrix to obtain each weight matrix represented by the target data type. Then, during the attention processing, similarity processing is performed on each matrix represented by the target data type to obtain the attention processing result. In other words, the present invention can first use model quantization technology to quantize any data type that needs to participate in attention processing into the required target data type before performing subsequent similarity processing. This effectively reduces the computational complexity of attention processing and thus greatly improves the efficiency of attention computation, thereby laying the foundation for improving the inference efficiency of the model in long texts.

[0076] Furthermore, in a specific example, the quantization processing of the first weight matrix, the second weight matrix, and the third weight matrix described above to obtain the first weight matrix, the second weight matrix, and the third weight matrix, all represented by the target data type (e.g., step S203-1), can specifically include:

[0077] Convert the first weight matrix from floating-point data (or half-floating-point data) to the target data type;

[0078] Convert the second weight matrix from floating-point data (or half-floating-point data) to the target data type;

[0079] Convert the third weight matrix from floating-point data (or semi-floating-point data) to the target data type.

[0080] In this way, the proposed solution can quantize each weight matrix from floating-point data (or semi-floating-point data) into a target data type, and then use the weight matrices represented by the target data type to perform similarity processing. This provides strong support for reducing the computational complexity of attention processing and improving the efficiency of attention computation.

[0081] Figure 3 This is an illustrative flow diagram of a data processing method according to an embodiment of this application. Figure 3 This method can be optionally applied to electronic devices, such as personal computers, servers, and server clusters. It is understood that the above... Figure 1 and Figure 2 The methods shown can also be applied to this example, and the related content will not be elaborated further in this example.

[0082] Furthermore, the method includes at least a portion of the following: For example... Figure 3 As shown, it includes:

[0083] Step S301: Obtain the target token sequence to be processed by attention.

[0084] Here, the target token sequence is the sequence obtained after feature extraction from the target text.

[0085] Step S302: Based on the target token sequence, obtain the first weight matrix, the second weight matrix, and the third weight matrix required for attention processing.

[0086] Here, the first weight matrix represents the query matrix corresponding to the elements in the target token sequence; the second weight matrix represents the key matrix of similarity between elements in the target token sequence; and the third weight matrix represents the value matrix of the actual content of the elements in the target feature sequence.

[0087] Step S303: Quantize the first weight matrix, the second weight matrix, and the third weight matrix to obtain the first weight matrix, the second weight matrix, and the third weight matrix, all represented by the target data type.

[0088] Step S304: Perform a dot product between the first weight matrix represented by the target data type and the second weight matrix represented by the target data type to obtain the attention weights.

[0089] For example, in one example, let Q be the first weight matrix represented by the target data type, K be the second weight matrix represented by the target data type, and S be the attention weight obtained by performing a dot product of Q and K. Then the expression for calculating the attention weight S is:

[0090]

[0091] here, This is the matrix transpose operator.

[0092] Step S305: Obtain the attention processing result based on the attention weights and the third weight matrix represented by the target data type.

[0093] Here, the attention processing results can characterize the correlation between different elements in the target text.

[0094] Thus, this disclosed solution provides a specific method for obtaining the attention processing result. Specifically, the first weight matrix, represented by the target data type, is first multiplied by the second weight matrix, also represented by the target data type. Then, the attention processing result is obtained by combining the result of the dot product with the third weight matrix, also represented by the target data type. This effectively reduces the computational complexity of attention processing, thereby reducing the GPU memory resources required for computation and significantly improving the efficiency of attention computation. This lays the foundation for improving the inference efficiency of the model in long texts.

[0095] In a specific example, the attention processing result can be obtained in the following way; specifically, that is, the attention processing result obtained based on the attention weights and the third weight matrix represented by the target data type as described above (e.g., step S305) can specifically include:

[0096] Step S305-1: Perform inverse quantization on the attention weights to convert them into floating-point data.

[0097] For example, when the target data type is INT8, the attention weight S obtained after performing a dot product of Q and K is a 32-bit integer (INT32). In this case, the attention weight S is lower bit-level data compared to floating-point data. Furthermore, to enable further processing at higher precision, such as activation processing, the attention weight S needs to be dequantized to convert it into floating-point data (e.g., FP16). This improves the efficiency of attention calculation while ensuring the reliability and accuracy of the data results.

[0098] Step S305-2: Based on the attention weights represented in floating-point data type and the dequantization parameters, obtain the activation feature matrix (i.e., the probability matrix).

[0099] Step S305-3: Based on the activation feature matrix and the third weight matrix represented by the target data type, obtain the attention processing result.

[0100] In this way, the proposed solution can first dequantize the attention weights involved in attention processing to obtain attention weights represented by floating-point data types. Then, based on the attention weights represented by floating-point data types and the dequantization parameters, an activation feature matrix is ​​obtained. Finally, based on the activation feature matrix and the third weight matrix represented by the target data type, the attention processing result is obtained. This effectively reduces the computation time and memory resources required for attention processing, improves attention computation efficiency, and lays the foundation for improving the inference efficiency of the model in long texts.

[0101] Furthermore, in a specific example, before calculating the attention processing result, in order to ensure efficient computation, the activation feature matrix can be quantized to obtain an activation feature matrix represented in the target data type.

[0102] Furthermore, the attention processing result obtained based on the activation feature matrix and the third weight matrix represented by the target data type (e.g., step S305-3) can specifically include:

[0103] The attention processing result is obtained by performing matrix multiplication on the activation feature matrix represented in the target data type and the third weight matrix represented in the target data type.

[0104] In other words, since the data type of the obtained activation feature matrix is ​​floating-point data type, in order to obtain the attention processing result, the activation feature matrix can also be quantized to convert the activation feature matrix represented by floating-point data type into the activation feature matrix represented by target data type. Then, the activation feature matrix represented by target data type and the third weight matrix represented by target data type are multiplied together to obtain the attention processing result.

[0105] In this way, the present invention enables the data in each computational stage of the attention processing to be represented by the target data type. In other words, the present invention can use data represented by low-bit data types for attention processing, which greatly reduces the time and memory resources required, improves the efficiency of attention computation, and lays the foundation for improving the inference efficiency of the model in long texts.

[0106] The following detailed explanation of this disclosed solution is illustrated with specific examples. Specifically, this example provides an attention processing scheme for all target data types, aiming to represent all parameters involved in attention processing in the form of the target data type (e.g., INT8). Specifically, the data types of the query matrix (which can be denoted as Q, corresponding to the first weight matrix mentioned above), key matrix (which can be denoted as K, corresponding to the second weight matrix mentioned above), probability matrix (which can be denoted as p, corresponding to the activation feature matrix mentioned above), and value matrix (which can be denoted as V, corresponding to the third weight matrix mentioned above) in the attention calculation are all converted to INT8 (corresponding to the target data type mentioned above), and... The inverse quantization process and the activation process (Softmax calculation process) are integrated together. In this way, compared with the existing attention processing based on floating-point data, the disclosed solution significantly reduces the time and resources required for attention calculation, providing strong support for realizing long text reasoning.

[0107] This disclosure includes three parts: the first part introduces the standard attention processing flow; the second part introduces the quantization processing of this disclosure; and the third part introduces the specific algorithm for attention processing in this disclosure.

[0108] (a) Standard attention processing

[0109] In one example, assuming the target token sequence contains N token vectors, each with dimension d, the standard attention calculation formula is as follows:

[0110]

[0111] Here, Q represents a query matrix of dimension N×d; K represents a key matrix of dimension N×d; V represents the transpose of the key matrix K; V represents the value matrix of dimension N×d; d k The scaling factor is used to prevent gradient vanishing due to excessively large dot product of Q and K at higher dimensions; softmax(·) represents the activation function used for normalization.

[0112] Furthermore, the attention matrix (also called attention weights) S can be denoted as follows: The attention matrix... At this point, the probability matrix p = softmax(S).

[0113] (II) Quantification

[0114] Any binary floating-point number Y can be represented as:

[0115] Y = (―1)S ×(1+M)×2 X Formula (2)

[0116] in,

[0117] X = E - bias

[0118] Here, S stands for Sign, which is used to indicate the positive or negative sign of the floating-point number Y. It is usually represented by a character in one bit (i.e., the sign bit), for example, 0 represents a positive number and 1 represents a negative number.

[0119] E stands for Exponent, which represents the order of magnitude of a floating-point number and determines its value. Furthermore, the exponent is located in memory as the exponent bit.

[0120] bias is the offset; for example, for a 32-bit floating-point number, the offset is 127.

[0121] X is the actual exponent, which is the value of the exponent minus the offset.

[0122] M stands for Mantissa, which represents the precision of a floating-point number. The location of this mantissa in memory is the mantissa bit.

[0123] For example, taking a 32-bit binary floating-point number (i.e., a single-precision floating-point number) as an example, such as Figure 4 As shown, bits 1 to 23 are the mantissa bits of the floating-point number, bits 24 to 31 are the exponent bits, and bit 32 (the most significant bit) is the sign bit. Based on the values ​​of each bit and the formula (2) above, it can be seen that... Figure 4 The 32-bit binary floating-point number shown is equal to 0.15635.

[0124] Furthermore, in one example, if we assume S = 0, M = 0, then Y = 2 X =2 E―bias Here, in order to calculate 2 X Simply shift X+bias left to the exponent bit in the floating-point memory. For example, shifting X+bias left by offset bits gives:

[0125] 2 X =2 offset ·(X+bias)

[0126] At this time, due to Let ln2 X =δ, then Xln2=δ, X=δ / ln2, further obtaining In this way, the exponent of e is converted to the exponent of 2, which facilitates subsequent calculations.

[0127] Furthermore, it should be noted that the above calculation may contain error terms. For example, if δ / ln2 is divisible, δ / ln2 + bias can be written to the exponent (i.e., the location of the exponent E) without any loss of precision. However, if δ / ln2 is not divisible, a loss of precision may occur. In this case, to avoid precision loss, the decimal part of the result of δ / ln2 can be shifted left by offset bits and then written to the mantissa (i.e., the location of the mantissa M). The actual floating-point number would then be:

[0128]

[0129] here, This is the floor function operator.

[0130] Furthermore, based on the above formula (3), a normal quantity C can be subtracted to correct the error term. At this time, formula (3) can be rewritten as:

[0131]

[0132] Here, C can be obtained using the least squares method according to the distribution of the actual data.

[0133] Continuing with the example of a 32-bit binary floating-point number (single-precision floating-point number), offset = 23, bias = 127. In this case, the above formula (4) can be rewritten as:

[0134] e δ =2 23 ·(scale·δ+bias) Formula (5)

[0135] Furthermore, in the standard attention processing, in this example, since both the query matrix Q and the key matrix K are quantized into matrices represented in INT8, and the query matrix Q and the key matrix K represented in INT8 are subjected to low-bit quantization using General Matrix-Matrix Multiplication (GEMM), and the calculated attention weights S are matrices represented in INT32, in order to obtain the probability matrix p, the attention weights S also need to be dequantized to obtain the attention weights S′ represented in floating-point data type. The specific expression is as follows:

[0136]

[0137] S′=(S―rowmax(S))·scale dequant Formula (6)

[0138] Here, ·scaledequant `rowmax` is the antiquantization factor, and `rowmax` is the operator for finding the maximum value in each row of the matrix.

[0139] Furthermore, the probability matrix p is calculated using the attention weights S′ represented by floating-point data type. The obtained probability matrix p can also be quantized to obtain a probability matrix p represented by INT8. * As shown below:

[0140] p * =p127=exp(S′)·127 Formula (7)

[0141] Furthermore, by rearranging formula (7) according to formulas (5) and (6), we can obtain:

[0142] p * =127·exp(S′)=exp(ln127·S′)=2 23 (ln127·S′·scale+bias)

[0143] =2 23 (ln127·(S―rowmax(S))·scale dequant ·scale+bias)

[0144] =(S―rowmax(S))·(ln127·scale dequant 2 23 )+(2 23 ·bias) formula (8)

[0145] Here, to facilitate the use of subsequent algorithms, the above formula (8) is rewritten to obtain:

[0146] p = exp(S′)

[0147] =[(S―rowmax(S))·(ln127·scale dequant ·2 23 )+(2 23 ·bias)] / 127

[0148] =bitoffset(S·scale+bias) Formula (9)

[0149] p * =p·127 Formula (10)

[0150] It should be noted that in formula 9 above, "ln127·scale" dequant ·2 23 "and "2 23"bias" is a known quantity, and both scale and bias are INT8 data types.

[0151] This completes the inverse quantization of the GEMM results of the query matrix Q and the key matrix K, as well as the quantization of the Softmax results into the INT data type.

[0152] (III) The algorithm of this disclosure—Low-bit quantization Attention

[0153] Specifically, the inputs to this disclosed scheme are the query matrix Q, the value matrix V, and the key matrix K ∈ int8. N×d The number of blocks P of Q r The number of blocks P in K,V c The inverse quantization parameters scale of Q and K int and bias int Here, the query matrix Q, value matrix V, and key matrix K are all derived from the token sequence extracted from the target text, and are all matrices represented using the int8 data type. The output of this disclosed scheme is O∈int32. N×d Here, the quantization result O is the attention processing result obtained by performing attention processing based on the query matrix Q, the value matrix V, and the key matrix K, i.e., O = Attention(Q,K,V).

[0154] Furthermore, such as Figure 5 As shown, the algorithm steps include:

[0155] Step S501: Partition (e.g., horizontally partition) the query matrix Q into P r Given matrix blocks, we obtain matrix blocks Q1, Q2, ..., Q. Pr Q i (i = 1, ..., P) r The matrix dimension is B. r ×d; Divide the key matrix K into P (e.g., horizontally). c Given matrix blocks K1, K2, ..., K j (j=1,…,P c The matrix dimension is B. c ×d; Divide the value matrix V into P (e.g., horizontally). c From these matrix blocks, we obtain V1, V2, ..., V j (j=1,…,P c The matrix dimension is B. c ×d.

[0156] Step S502: Transfer Q i (i = 1, ..., P) rLoad from High Bandwidth Memory (HBM) to Static Random-Access Memory (SRAM).

[0157] Step S503: Allocate the following register variables: And initialized to 0; Initialize to 0; Initialize to the minimum value of int32.

[0158] Step S504: Place K j V j (j=1,…,P c They are loaded from SRAM into HBM respectively.

[0159] Step S505: Calculate the attention weights Right now

[0160] Step S506: Calculate Right now

[0161] Step S507: Based on Formula 9 and attention weights get To complete the attention weight The inverse quantization process and the calculation of exp (i.e., softmax) are...

[0162]

[0163] Step S508: For Perform quantization to obtain a representation in int8 data type. Right now

[0164]

[0165] Step S509: Calculate Right now

[0166] Here, rowsum is the operator for summing the elements in each row of a matrix.

[0167] Step S510: Calculate based on Q i ,K j V j The resulting attention processing Right now

[0168]

[0169] Here, diag(·) is the operator for constructing a diagonal matrix.

[0170] Step S511: Determine whether the current value of j is less than P. c If so, then let j take the value j+1, and set j+1, Return to step S504; otherwise, proceed to step S512.

[0171] Step S512: Based on the obtained and Calculation yields O i ,Right now

[0172]

[0173] And O i Save it to HBM.

[0174] Step S513: Determine whether the current value of i is less than P. r If so, then i takes the value of i+1 and j is initialized to 0, returning to step S502; otherwise, proceed to step S514.

[0175] Step S514: Based on the O1, O2, ... stored in HBM The attention processing result based on the query matrix Q, value matrix V, and key matrix K is obtained, i.e., O∈int32. N×d .

[0176] It should be noted that the above scheme can be applied to the encoder. In other words, the quantization process of the full INT8Attention in this disclosed scheme is static statistical quantization, without introducing dynamic quantization calculation during the inference stage. This effectively avoids performance degradation during inference. For example, ... Figure 6 As shown, the encoder of the target model contains multiple Transformer Blocks. Each Transformer Block can use full INT8 Attention (full 8-bit integer attention). Moreover, the processing result of full INT8 Attention in each Transformer Block can be stored in the key-value cache as a 4-bit integer (INT4) type.

[0177] In summary, this disclosed solution provides a highly efficient and innovative low-bit attention quantization inference technique. Through sophisticated optimization algorithms and data processing methods, it significantly reduces the time consumption required for model inference computation, thereby greatly improving the smoothness and user satisfaction of model usage. Simultaneously, this disclosed solution can also significantly reduce the dependence on and consumption of hardware resources (such as computing power and storage space) while ensuring model inference performance, providing strong support for the rational allocation and efficient utilization of resources.

[0178] Furthermore, the Attention described in this disclosure can be specifically Flash-Attention, and further, techniques such as tiling and recomputation of Flash-Attention can be used to further reduce memory usage.

[0179] This disclosure also provides a data processing apparatus, such as... Figure 7 As shown, it includes:

[0180] The encoding unit 701 is used to obtain the target token sequence to be attention processed, wherein the target token sequence is the sequence obtained after feature extraction of the target text;

[0181] Attention processing unit 702 is used to obtain a first weight matrix, a second weight matrix, and a third weight matrix required for attention processing based on the target token sequence; the first weight matrix represents the query matrix corresponding to the elements in the target token sequence; the second weight matrix represents the key matrix of similarity between elements in the target token sequence; the third weight matrix represents the value matrix of the actual content of the elements in the target feature sequence; attention processing of all target data types is performed based on the first weight matrix, the second weight matrix, and the third weight matrix to obtain the attention processing result, wherein the attention processing result can represent the correlation between different elements in the target text.

[0182] In a specific example of the disclosed solution, the attention processing unit is specifically used for:

[0183] In the attention processing module of the encoder of the target model, attention processing of all target data types is performed based on the first weight matrix, the second weight matrix, and the third weight matrix; wherein, the target token sequence is the sequence obtained by the encoding module of the encoder after feature encoding of the target text.

[0184] In a specific example of the disclosed solution, the attention processing unit is specifically used for:

[0185] During the attention processing, the attention matrix, represented by the target data type, is subjected to similarity processing to reduce the GPU memory usage and obtain the attention processing result.

[0186] The attention matrix, represented by the target data type, is obtained based on the first weight matrix, the second weight matrix, or the third weight matrix.

[0187] In a specific example of the disclosed solution, the attention processing unit is specifically used for:

[0188] The first weight matrix, the second weight matrix, and the third weight matrix are quantized to obtain the first weight matrix, the second weight matrix, and the third weight matrix, all represented by the target data type.

[0189] Similarity processing is performed based on the first weight matrix, the second weight matrix, and the third weight matrix, all of which are represented by the target data type, in order to reduce the GPU memory resources used and obtain the attention processing results.

[0190] In a specific example of the disclosed solution, the attention processing unit is specifically used for:

[0191] Convert the first weight matrix from floating-point data to the target data type;

[0192] Convert the second weight matrix from floating-point data to the target data type;

[0193] Convert the third weight matrix from floating-point data to the target data type.

[0194] In a specific example of the disclosed solution, the attention processing unit is specifically used for:

[0195] The attention weights are obtained by performing a dot product between the first weight matrix, which is represented in the target data type, and the second weight matrix, which is also represented in the target data type.

[0196] The attention processing result is obtained based on the attention weights and the third weight matrix represented by the target data type.

[0197] In a specific example of the disclosed solution, the attention processing unit is specifically used for:

[0198] The attention weights are dequantized to convert them into floating-point data.

[0199] The activation feature matrix is ​​obtained based on the attention weights represented in floating-point data type and the inverse quantization parameters;

[0200] The attention processing result is obtained by performing matrix multiplication based on the activation feature matrix and the third weight matrix represented by the target data type.

[0201] In a specific example of the disclosed solution, the attention processing unit is specifically used for:

[0202] The activation feature matrix is ​​quantized to obtain the activation feature matrix represented in the target data type;

[0203] The attention processing result is obtained by performing matrix multiplication on the activation feature matrix represented in the target data type and the third weight matrix represented in the target data type.

[0204] In a specific example of the scheme disclosed herein, the target data type is integer data.

[0205] In a specific example of the scheme disclosed herein, the target data type is an 8-bit integer INT8.

[0206] For a description of the specific functions and examples of each unit of the apparatus in this disclosure embodiment, please refer to the relevant descriptions of the corresponding steps in the above method embodiments, which will not be repeated here.

[0207] The acquisition, storage, and application of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0208] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0209] Figure 8 A schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0210] like Figure 8As shown, device 800 includes a computing unit 801, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 802 or a computer program loaded from storage unit 808 into random access memory (RAM) 803. RAM 803 may also store various programs and data required for the operation of device 800. The computing unit 801, ROM 802, and RAM 803 are interconnected via bus 804. Input / output (I / O) interface 805 is also connected to bus 804.

[0211] Multiple components in device 800 are connected to I / O interface 805, including: input unit 806, such as keyboard, mouse, etc.; output unit 807, such as various types of monitors, speakers, etc.; storage unit 808, such as disk, optical disk, etc.; and communication unit 809, such as network card, modem, wireless transceiver, etc. Communication unit 809 allows device 800 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0212] The computing unit 801 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as data processing methods. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and / or installed on device 800 via ROM 802 and / or communication unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform data processing methods by any other suitable means (e.g., by means of firmware).

[0213] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0214] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0215] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0216] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0217] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0218] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.

[0219] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0220] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A data processing method, comprising: The target token sequence to be attention processed is obtained, wherein the target token sequence is the sequence obtained after feature extraction of the target text; Based on the target token sequence, the first weight matrix, the second weight matrix, and the third weight matrix required for attention processing are obtained; the first weight matrix represents the query matrix corresponding to the elements in the target token sequence; the second weight matrix represents the key matrix of similarity between elements in the target token sequence; and the third weight matrix represents the value matrix of the actual content of the elements in the target feature sequence. Attention processing is performed on all target data types based on the first weight matrix, the second weight matrix, and the third weight matrix to obtain the attention processing result, which can characterize the correlation between different elements in the target text.

2. The method according to claim 1, wherein, The attention processing based on the first weight matrix, the second weight matrix, and the third weight matrix for all target data types includes: In the attention processing module of the encoder of the target model, attention processing of all target data types is performed based on the first weight matrix, the second weight matrix, and the third weight matrix; wherein, the target token sequence is the sequence obtained by the encoding module of the encoder after feature encoding of the target text.

3. The method according to claim 1 or 2, wherein, The attention processing based on the first weight matrix, the second weight matrix, and the third weight matrix is ​​performed on all target data types to obtain the attention processing result, including: During the attention processing, the attention matrix, represented by the target data type, is subjected to similarity processing to reduce the GPU memory usage and obtain the attention processing result. The attention matrix, represented by the target data type, is obtained based on the first weight matrix, the second weight matrix, or the third weight matrix.

4. The method according to claim 3, wherein, The process of performing similarity processing on the attention matrix represented by the target data type to reduce the amount of GPU memory used and to obtain the attention processing result includes: The first weight matrix, the second weight matrix, and the third weight matrix are quantized to obtain the first weight matrix, the second weight matrix, and the third weight matrix, all represented by the target data type. Similarity processing is performed based on the first weight matrix, the second weight matrix, and the third weight matrix, all of which are represented by the target data type, in order to reduce the GPU memory resources used and obtain the attention processing results.

5. The method according to claim 4, wherein, The quantization process of the first weight matrix, the second weight matrix, and the third weight matrix to obtain first weight matrix, second weight matrix, and third weight matrix, all represented in the target data type, includes: Convert the first weight matrix from floating-point data to the target data type; Convert the second weight matrix from floating-point data to the target data type; Convert the third weight matrix from floating-point data to the target data type.

6. The method according to claim 4, wherein, The similarity processing based on the first weight matrix, the second weight matrix, and the third weight matrix, all represented by the target data type, reduces the amount of GPU memory used and yields the attention processing result, including: The attention weights are obtained by performing a dot product between the first weight matrix, which is represented in the target data type, and the second weight matrix, which is also represented in the target data type. The attention processing result is obtained based on the attention weights and the third weight matrix represented by the target data type.

7. The method according to claim 6, wherein, The attention processing result, obtained based on the attention weights and a third weight matrix represented by the target data type, includes: The attention weights are dequantized to convert them into floating-point data. The activation feature matrix is ​​obtained based on the attention weights represented in floating-point data type and the inverse quantization parameters; The attention processing result is obtained based on the activation feature matrix and the third weight matrix represented by the target data type.

8. The method according to claim 7, further comprising: The activation feature matrix is ​​quantized to obtain the activation feature matrix represented in the target data type; The attention processing results are obtained by performing matrix multiplication based on the activation feature matrix and the third weight matrix represented in the target data type, including: The attention processing result is obtained by performing matrix multiplication on the activation feature matrix represented in the target data type and the third weight matrix represented in the target data type.

9. The method according to any one of claims 1-8, wherein, The target data type is integer.

10. The method according to claim 9, wherein, The target data type is 8-bit integer INT8.

11. A data processing apparatus, comprising: The encoding unit is used to obtain the target token sequence to be processed by attention, wherein the target token sequence is the sequence obtained after feature extraction of the target text; The attention processing unit is used to obtain the first weight matrix, the second weight matrix, and the third weight matrix required for attention processing based on the target token sequence. The first weight matrix represents the query matrix corresponding to the elements in the target token sequence; the second weight matrix represents the key matrix of the similarity between elements in the target token sequence; and the third weight matrix represents the value matrix of the actual content of the elements in the target feature sequence. Attention processing is performed on all target data types based on the first weight matrix, the second weight matrix, and the third weight matrix to obtain the attention processing result, wherein the attention processing result can represent the correlation between different elements in the target text.

12. The apparatus according to claim 11, wherein, The attention processing unit is specifically used for: In the attention processing module of the encoder of the target model, attention processing of all target data types is performed based on the first weight matrix, the second weight matrix, and the third weight matrix; wherein, the target token sequence is the sequence obtained by the encoding module of the encoder after feature encoding of the target text.

13. The apparatus according to claim 11 or 12, wherein, The attention processing unit is specifically used for: During the attention processing, the attention matrix, represented by the target data type, is subjected to similarity processing to reduce the GPU memory usage and obtain the attention processing result. The attention matrix, represented by the target data type, is obtained based on the first weight matrix, the second weight matrix, or the third weight matrix.

14. The apparatus according to claim 13, wherein, The attention processing unit is specifically used for: The first weight matrix, the second weight matrix, and the third weight matrix are quantized to obtain the first weight matrix, the second weight matrix, and the third weight matrix, all represented by the target data type. Similarity processing is performed based on the first weight matrix, the second weight matrix, and the third weight matrix, all of which are represented by the target data type, in order to reduce the GPU memory resources used and obtain the attention processing results.

15. The apparatus according to claim 14, wherein, The attention processing unit is specifically used for: Convert the first weight matrix from floating-point data to the target data type; Convert the second weight matrix from floating-point data to the target data type; Convert the third weight matrix from floating-point data to the target data type.

16. The apparatus according to claim 14, wherein, The attention processing unit is specifically used for: The attention weights are obtained by performing a dot product between the first weight matrix, which is represented in the target data type, and the second weight matrix, which is also represented in the target data type. The attention processing result is obtained based on the attention weights and the third weight matrix represented by the target data type.

17. The apparatus according to claim 16, wherein, The attention processing unit is specifically used for: The attention weights are dequantized to convert them into floating-point data. The activation feature matrix is ​​obtained based on the attention weights represented in floating-point data type and the inverse quantization parameters; The attention processing result is obtained by performing matrix multiplication based on the activation feature matrix and the third weight matrix represented by the target data type.

18. The apparatus according to claim 17, wherein, The attention processing unit is specifically used for: The activation feature matrix is ​​quantized to obtain the activation feature matrix represented in the target data type; The attention processing result is obtained by performing matrix multiplication on the activation feature matrix represented in the target data type and the third weight matrix represented in the target data type.

19. The apparatus according to any one of claims 11-18, wherein, The target data type is integer.

20. The apparatus according to claim 19, wherein, The target data type is 8-bit integer INT8.

21. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-10.

23. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-10.