Quantization-based data processing method and apparatus, and device and medium
By processing the parameters of the hybrid expert model using multiple quantization bit methods and selecting the optimal method for quantization, the problem of low accuracy in quantization processing is solved, thereby improving the stability and performance of the model.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- TENCENT TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2025-07-07
- Publication Date
- 2026-06-18
Smart Images

Figure CN2025107278_18062026_PF_FP_ABST
Abstract
Description
Quantization-based data processing methods, devices, equipment, and media
[0001] This application claims priority to Chinese Patent Application No. 202411010046.1, filed on July 26, 2024, entitled “Data Processing Method, Apparatus, Device and Medium Based on Quantization”, the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of computer technology, and more specifically, to a quantization-based data processing method, a quantization-based data processing device, an electronic device, and a computer-readable medium. Background Technology
[0003] In related technologies, two schemes are typically used to determine the quantization bit scheme for different functional networks in a Mixture of Experts (MoE) model. In Scheme 1, the user selects the quantization bit scheme for each functional network based on human experience. In Scheme 2, the quantization bit scheme is selected based on the task type of the functional network. For example, a higher bit quantization bit scheme (e.g., fp16 or int8) is selected for functional networks requiring higher accuracy for their task type, while a lower bit quantization bit scheme (e.g., int4 or int2) is selected for functional networks requiring lower accuracy for their task type.
[0004] In Option 1, insufficient human experience may lead to incorrect selection of the quantization bit method for the functional network, reducing the accuracy of quantization and thus affecting model performance. In Option 2, the sensitivity of the functional network can vary even for the same task type due to different models or different training stages. Therefore, selecting the quantization bit method based on the task type of the functional network is relatively fixed, and the problem of incorrect quantization bit method selection also exists, reducing the accuracy of quantization and thus affecting model performance.
[0005] Therefore, improving the accuracy of quantization to enhance model performance is an urgent problem to be solved. Summary of the Invention
[0006] The embodiments of this application provide a quantization-based data processing method, apparatus, device, and medium, which improves the accuracy of quantization processing and improves model performance.
[0007] In a first aspect, embodiments of this application provide a quantization-based data processing method, the method comprising: quantizing the original model parameters corresponding to a Model Functional Network (MFN) using multiple quantization bit methods to obtain quantized model parameters corresponding to each quantization bit method; obtaining a first output result of the MFN under the original model parameters, and obtaining a second output result of the MFN under each quantized model parameter; and selecting a target quantization bit method from the multiple quantization bit methods based on the first output result and the multiple second output results, wherein the target quantization bit method is used to quantize the original model parameters during the inference phase of the MFN.
[0008] Secondly, embodiments of this application provide a quantization-based data processing apparatus, the apparatus comprising: a quantization processing module configured to quantize the original model parameters corresponding to a Model Functional Network (MFN) using multiple quantization bit methods to obtain quantized model parameters corresponding to each quantization bit method; an acquisition module configured to acquire a first output result of the MFN under the original model parameters, and to acquire a second output result of the MFN under each quantized model parameter; and a selection module configured to select a target quantization bit method from the multiple quantization bit methods based on the first output result and the multiple second output results, wherein the target quantization bit method is used to quantize the original model parameters during the inference stage corresponding to the MFN.
[0009] Thirdly, embodiments of this application provide an electronic device, including one or more processors; and a memory for storing one or more computer programs, which, when executed by the one or more processors, cause the electronic device to implement the quantization-based data processing method described above.
[0010] Fourthly, embodiments of this application provide a computer-readable medium having a computer program stored thereon, which, when executed by a processor, implements the quantization-based data processing method described above.
[0011] Fifthly, embodiments of this application provide a computer program product, including computer instructions, which, when executed by a processor, implement the quantization-based data processing method described above.
[0012] In the technical solution provided by the embodiments of this application, the original model parameters corresponding to the model are quantized using multiple quantization bit methods to obtain quantized model parameters corresponding to each quantization bit method. The target quantization bit method is selected from the multiple quantization bit methods by utilizing the first output result of the Model Functional Network (MFN) under the original model parameters and the second output result of the MFN under each quantized model parameter. This achieves accurate selection of the quantization bit method for the MFN, resulting in high accuracy in quantization processing. High accuracy in quantization processing helps improve the model's generalization ability beyond training data, enabling it to better adapt to new data distributions and scenarios. It also enhances the model's stability and robustness, maintaining good performance under complex conditions. Furthermore, it allows for faster inference, reduces unnecessary resource consumption, and improves resource utilization. Therefore, it greatly improves model performance, enabling tasks in various application scenarios to be executed effectively through the model.
[0013] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application.
[0014] Brief description of the attached figures
[0015] Figure 1 is a schematic diagram of an exemplary implementation environment in which the technical solutions of the embodiments of this application can be applied.
[0016] Figure 2 is a flowchart illustrating a quantization-based data processing method in an exemplary embodiment of this application.
[0017] Figure 3a is a schematic diagram of a data channel including channel values, illustrating an exemplary embodiment of this application.
[0018] Figure 3b is a schematic diagram illustrating a data channel including channel values, which is another exemplary embodiment of this application.
[0019] Figure 4 is a flowchart illustrating a quantization-based data processing method in another exemplary embodiment of this application.
[0020] Figure 5 is a schematic diagram illustrating the generation of a target quantization scaling factor in an exemplary embodiment of this application.
[0021] Figure 6 is a flowchart illustrating a quantization-based data processing method in another exemplary embodiment of this application.
[0022] Figure 7 is a schematic diagram illustrating a method for generating target quantized bits in an exemplary embodiment of this application.
[0023] Figure 8 is a block diagram illustrating a quantization-based data processing apparatus according to an exemplary embodiment of this application.
[0024] Figure 9 is a schematic diagram of the structure of a computer system suitable for implementing the electronic device of the present application. Detailed Implementation
[0025] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments identical to those described in this application. Rather, they are merely examples of apparatuses and methods identical to some aspects of this application as detailed in the appended claims.
[0026] In this application embodiment, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.
[0027] The block diagrams shown in the accompanying drawings are merely functional entities and do not necessarily correspond to physically independent entities. That is, these functional entities can be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.
[0028] The flowcharts shown in the accompanying drawings are merely illustrative and do not necessarily include all content and operations / steps, nor do they necessarily have to be performed in the described order. For example, some operations / steps can be broken down, while others can be combined or partially combined; therefore, the actual execution order may change depending on the specific circumstances.
[0029] It should be noted that "multiple" in the embodiments of this application refers to two or more. "And / or" describes the relationship between associated objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following associated objects have an "or" relationship.
[0030] Before introducing the technical solutions of the embodiments of this application, the nouns and terms involved in the embodiments of this application will be explained first. The nouns and terms involved in the embodiments of this application are subject to the following interpretations.
[0031] A Mixture of Experts (MoE) model is a neural network architecture for solving complex tasks. It divides an artificial intelligence (AI) model into separate sub-networks (or expert networks), each specializing in processing a subset of the input data to collectively perform the task. A MoE model comprises multiple expert networks and a gating network. Each expert network is responsible for handling a different aspect or sub-task of the input data; for example, in image processing, one expert network might handle edge detection while another handles texture recognition. The gating network dynamically selects one of the multiple expert networks to process the current input data. By integrating gating networks and multiple expert networks, MoE models effectively improve the model's ability to handle complex tasks and enhance its performance.
[0032] Hybrid bit quantization refers to the mixed processing of analog and digital signals. By segmenting the input signal into multiple levels and compressing each segment using different quantization methods, a digital signal composed of binary codes of varying bit lengths is obtained. Hybrid bit quantization can effectively improve data compression ratio and signal quality. It is also a method for optimizing model inference, applicable to hybrid expert models. It allows the use of different quantization precipitates within the same model to balance computational efficiency and accuracy requirements. Specifically, during model inference, different expert networks (also called functional networks) can employ different quantization precipitates. For example, some functional networks may use int8 (an 8-bit integer representation), some int4 (a 4-bit integer representation), some int2 (a 2-bit integer representation), and some fp16 (Fixed Point 16-bit, a 16-bit floating-point representation), etc.
[0033] In related technologies, the two schemes described above are used to select quantization bit methods for functional networks, but the accuracy of quantization processing is not high, which affects model performance. Therefore, in order to improve the accuracy of quantization processing and thus improve model performance, this application provides a quantization-based data processing scheme. Please refer to Figure 1, which is a schematic diagram of an implementation environment provided by this application embodiment. The implementation environment includes a terminal device 101 and a server 102.
[0034] Terminal devices 101 include, but are not limited to, smartphones, computers (tablets, laptops, desktop computers, etc.), smart home devices (televisions, refrigerators, air conditioners, washing machines, robot vacuums, etc.), and smart wearable devices (wristbands, watches, etc.).
[0035] Server 102 can be a standalone physical server, or a server cluster or distributed system consisting of multiple physical servers. The server cluster or distributed system includes cloud servers used to provide basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.
[0036] In this embodiment, a communication connection is established between the terminal device 101 and the server 102 via a wired or wireless network. Exemplarily, the wireless or wired network uses standard communication technologies and / or protocols. The network is typically the Internet, but can also be any other network, including but not limited to a Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless networks, private networks, or any combination of virtual private networks.
[0037] In the embodiments of this application, the quantization-based data processing method can be executed by server 102. Specifically, server 102 quantizes the original model parameters corresponding to the Model Functional Network (MFN) using multiple quantization bit methods to obtain quantized model parameters corresponding to each quantization bit method; then, it obtains the first output result of the MFN under the original model parameters and the second output result of the MFN under each quantized model parameter; then, based on the first output result and multiple second output results, it selects a target quantization bit method from the multiple quantization bit methods. The target quantization bit method is used to quantize the original model parameters during the inference phase of the MFN.
[0038] In the embodiments of this application, the quantization-based data processing method can be executed by the terminal device 101 alone or by the terminal device 101 and the server 102 together. In practical applications, the executing entity of the quantization-based data processing method can be flexibly adjusted according to the specific application scenario.
[0039] The number of terminal devices 101 and servers 102 shown in Figure 1 is merely illustrative; any number of terminal devices 101 and servers 102 can be used as needed.
[0040] It should be noted that in the specific implementation of this application, user-related data is involved. When the embodiments of this application are applied to specific products or technologies, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.
[0041] The following describes in detail the various implementation details of the technical solutions in the embodiments of this application.
[0042] Please refer to Figure 2, which is a flowchart illustrating a quantization-based data processing method according to an embodiment of this application. As shown in Figure 2, the quantization-based data processing method includes at least steps S201 to S203, which are described in detail below.
[0043] S201 quantizes the original model parameters corresponding to the model function network using multiple quantization bit methods to obtain the quantized model parameters corresponding to each quantization bit method.
[0044] In the embodiments of this application, the quantization bit method includes, but is not limited to, at least one of the following: integer representation of arbitrary bits, floating-point representation of arbitrary bits, etc., wherein arbitrary bits can be 2, 3, 4, 6, 8, 16, 32, 64, 128, etc.
[0045] In this embodiment, the original model parameters refer to the model parameters of the hybrid expert model's functional network (hereinafter referred to as the model functional network) before quantization processing using quantization bits. Correspondingly, the quantized model parameters refer to the model parameters of the model functional network after quantization processing using quantization bits. It is understood that model parameters include, but are not limited to, at least one of weight values, activation input values, and key-value cache values. In this embodiment, taking model parameters as weight values as an example, the original model parameters are the original weight values, the quantized model parameters are the quantized weight values, and so on.
[0046] In this embodiment, multiple quantization bit methods are used to quantize the original model parameters respectively, thereby obtaining quantized model parameters corresponding to the multiple quantization bit methods.
[0047] For example, suppose there are multiple quantization bit methods, namely int2, int4, and int8, and w0 represents the original weight value, and winti represents the quantization weight value corresponding to the quantization bit method inti; then the quantization weight value corresponding to int2 is wint2 = int2(w0), the quantization weight value corresponding to int4 is wint4 = int4(w0), and the quantization weight value corresponding to int8 is wint8 = int8(w0).
[0048] S202, obtain the first output result of the Model Functional Network under the original model parameters, and obtain the second output result of the Model Functional Network under each quantized model parameter.
[0049] In this embodiment of the application, after obtaining the quantization model parameters corresponding to each quantization bit method, the first output result of the model functional network under the original model parameters and the second output result of the model functional network under each quantization model parameter can be obtained.
[0050] In this embodiment of the application, the first output result refers to the result output by the model functional network under the original model parameters, and the second output result refers to the result output by the model functional network under the quantized model parameters.
[0051] In the embodiments of this application, the original model parameters include original weight values, and the quantized model parameters include quantized weight values; correspondingly, the process of obtaining the first output result of the model functional network under the original model parameters and the second output result of the model functional network under each quantized model parameter in S202 may include:
[0052] Obtain the activation input values of the model's functional network;
[0053] The activation input value and the original weight value are multiplied to obtain the first output of the model functional network, and the activation input value and each quantized weight value are multiplied to obtain multiple second outputs of the model functional network.
[0054] For example, continuing from the previous example, let 'a' represent the activation input value, 'out0' represent the first output result, and 'outinti' represent the second output result corresponding to the quantization bit mode 'inti'. Then, the first output result is 'out0' = 'a' × 'w0', the second output result corresponding to 'int2' is 'outint2' = 'a' × 'wint2', the second output result corresponding to 'int4' is 'outint4' = 'a' × 'wint4', and the second output result corresponding to 'int8' is 'outint8' = 'a' × 'wint8'.
[0055] Through this example, the first output result and multiple second output results can be obtained easily and accurately.
[0056] In the embodiments of this application, the Model Functional Network corresponds to multiple data channels, and each data channel corresponds to one or more channel values. In specific implementations, each data channel typically corresponds to multiple channel values, for example, at least three. Channel values refer to key parameters used to distinguish different types of channel data when reading channel data.
[0057] In this embodiment, the process of obtaining the activation input values of the model functional network may include:
[0058] Detect whether there are outlier channel values in each of the multiple data channels; where an outlier channel value refers to at least one channel value that differs significantly from the other channel values, for example, the difference between the channel values and the other channel values is greater than a preset threshold.
[0059] If an outlier channel value is detected, all detected outlier channel values are compared, and the data channel corresponding to the largest outlier channel value is selected as the target data channel. The channel value corresponding to the target data channel is then used as the activation input value of the Model Functional Network (MFN), and the original weight value and quantized weight value corresponding to the target data channel are obtained. For example, continuing from the previous example, let 'a' represent the activation input value, where 'a' is a vector matrix [N, M], where N is a parameter related to the input of the MFN, and M is a pre-set size of the hidden layer of the MFN. In this embodiment, N and M are integers greater than 1. N can be the length of the text sequence input to the MFN determined based on the channel value corresponding to the target data channel. For example, if the text sequence length is 4096 characters, then N is equal to 4096. M can be a pre-set value, such as 2048, 4096, etc.
[0060] Accordingly, the process of multiplying the activation input value and the original weight value to obtain the first output result of the model functional network, and multiplying the activation input value and each quantized weight value to obtain multiple second output results of the model functional network, may include:
[0061] The first output of the model functional network is obtained by multiplying the channel value corresponding to the target data channel with the original weight value, and multiple second outputs of the model functional network are obtained by multiplying the channel value corresponding to the target data channel with each quantized weight value.
[0062] In this embodiment, outlier channel values refer to data channels. For example, please refer to Figure 3a, which shows data channel c1 with multiple channel values. As shown in Figure 3a, a-c1-301 and a-c1-302 are both outlier channel values. Please refer to Figure 3b, which shows data channel c2 with multiple channel values. As shown in Figure 3b, a-c2-301, a-c2-302, and a-c3-303 are all outlier channel values.
[0063] In this embodiment, the largest channel value is selected from multiple outlier channel values, and the data channel corresponding to the largest channel value is taken as the target data channel. For example, continuing from the previous example, assuming that a-c1-301 is the largest among a-c1-301, a-c1-302, a-c2-301, a-c2-302, and a-c3-303, then data channel c1 is the target data channel.
[0064] In this embodiment, there may be only one outlier channel value, and the data channel corresponding to that outlier channel value is the target data channel.
[0065] In this embodiment, the channel value corresponding to the target data channel is used as the activation input value of the model functional network; for example, continuing from the previous example, as shown in Figure 3a, all the channel values included in data channel c1 are the activation input values of the model functional network.
[0066] In this embodiment, the channel value corresponding to the target data channel and the original weight value corresponding to the target data channel are multiplied to obtain the first output result of the target data channel. At this time, the first output result of the target data channel is the first output result of the model functional network. In addition, the channel value corresponding to the target data channel and each quantization weight value corresponding to the target data channel are multiplied to obtain multiple second output results of the target data channel. At this time, the multiple second output results of the target data channel are the multiple second output results of the model functional network.
[0067] For example, taking the aforementioned data channel c1 as the target data channel, let a-c1 represent the multiple channel values corresponding to the target data channel c1 (i.e., the multiple points in Figure 3a). If the size of data channel c1 is 2048, then c1 has 2048 channel values. Then a-c1 is a vector matrix [N, M] determined according to the size of data channel c1, where N is 2048 and M is the pre-set size of the hidden layers of the model functional network. Therefore, the first output result of the target data channel c1 is out0-c1 = (a-c1) × w0; the second output result using int2 under the target data channel c1 is outint2-c1 = (a-c1) × wint2; the second output result using int4 under the target data channel c1 is outint4-c1 = (a-c1) × wint4; and the second output result using int8 under the target data channel c1 is outint8-c1 = (a-c1) × wint8.
[0068] Thus, through this embodiment, since the largest outlier channel value can reflect the sensitivity of the model functional network to a certain extent, when the data channel corresponding to the largest outlier channel value in the model functional network is used as the target time data channel, the accuracy of the calculated first output result and multiple second output results is higher.
[0069] In embodiments of this application, the process of using the channel value corresponding to the target data channel as the activation input value of the model functional network may include:
[0070] Calculate the similarity between the target data channel and its adjacent data channels;
[0071] The outlier channel values in the target data channel are reduced based on similarity to obtain the target channel value;
[0072] The target channel value and the channel values in the target data channel, excluding outlier channel values, are used as the activation input values for the model's functional network.
[0073] That is, in the embodiment, the similarity between the target data channel and the adjacent data channels corresponding to the target data channel is used to reduce the outlier channel values in the target data channel, thereby obtaining the target channel value. At this time, the target channel value and the channel values in the target data channel other than the outlier channel values (i.e., other channel values) are the activation input values of the model functional network.
[0074] For example, taking the aforementioned data channel c1 as the target data channel, let c1' represent the adjacent data channels of the target data channel c1. First, calculate the similarity s between the target data channel c1 and the adjacent data channel c1'. Then, use the similarity s to reduce the outlier channel values a-c1-301 and a-c1-302 in the target data channel c1, respectively, to obtain the target channel value a-c1-301' corresponding to the outlier channel value a-c1-301, and the target channel value a-c1-302' corresponding to the outlier channel value a-c1-302. At this time, the target channel values a-c1-301', the target channel values a-c1-302', and the channel values in a-c1 corresponding to the target data channel c1 other than the outlier channel values a-c1-301 and a-c1-302 are the activation input values of the model functional network.
[0075] In this way, through the implementation example, the outlier channel values in the target data channel are reduced to represent the target data channel with smaller values, thereby obtaining the activation input values of the model functional network. This helps to select a lower precision quantization bit method for the model functional network in the future.
[0076] In an embodiment of the present application, the process of calculating the similarity between a target data channel and an adjacent data channel corresponding to the target data channel may include:
[0077] Performing a distance calculation based on the channel values of the target data channel and the channel values of the adjacent data channels corresponding to the target data channel to obtain a distance value, where the distance value is greater than 0 and less than 1, and using the distance value as the similarity between the target data channel and the adjacent data channel;
[0078] Correspondingly, performing a reduction process on the outlier channel values in the target data channel based on the similarity to obtain target channel values, including:
[0079] Performing a multiplication operation on the distance value and the outlier channel values in the target data channel to obtain the target channel values.
[0080] That is, in the embodiment, first determine the channel values of the adjacent data channels corresponding to the target data channel. Usually, there are multiple channel values of the adjacent data channels. Then, perform a distance calculation on each channel value of the target data channel and the channel values of the adjacent data channels to obtain distance values. Then, perform an average calculation on all the obtained distance values to obtain an average distance value. This average distance value is the similarity between the target data channel and the adjacent data channel. After that, use the distance value to perform a reduction process on the outlier channel values in the target data channel, thereby obtaining the target channel values.
[0081] In the embodiment, a distance algorithm is used to calculate the distance value between the target data channel and the adjacent data channel. The distance algorithm includes at least one of, but is not limited to, cosine similarity, Euclidean distance, Manhattan distance, Chebyshev distance, and Minkowski distance, etc.
[0082] For example, continuing with the previous example, assuming the distance algorithm is cosine similarity, then the distance value between the target data channel c1 and the adjacent data channel c1' where c1 and c1' are vector matrices.
[0083] In the embodiment, after obtaining the distance value between the target data channel and the adjacent data channel, a multiplication operation can be performed on the distance value and the outlier channel values in the target data channel, thereby obtaining the target channel values.
[0084] For example, continuing with the previous example, the outlier channel values are a-c1-301 and a-c1-302; then the target channel value a-c1-301' = cosθ × (a-c1-301), and the target channel value a-c1-302' = cosθ × (a-c1-302). It can be understood that since 0 < cosθ < 1, the target channel value a-c1-301' is less than a-c1-301, and the target channel value a-c1-302' is less than a-c1-302.
[0085] This example demonstrates how to easily and accurately obtain the target channel value.
[0086] S203, based on the first output result and multiple second output results, select a target quantization bit method from multiple quantization bit methods. The target quantization bit method is used to quantize the original model parameters during the inference phase of the model functional network.
[0087] In this embodiment of the application, after obtaining the first output result and multiple second output results, the first output result and multiple second output results can be used to select a target quantization bit method from multiple quantization bit methods, wherein the target quantization bit method is used to quantize the original model parameters during the inference stage of the model functional network.
[0088] In embodiments of this application, the process of selecting a target quantization bit mode from multiple quantization bit modes based on a first output result and multiple second output results in S203 may include:
[0089] Calculate the difference between the first output and each second output;
[0090] Select the quantization bit mode with the smallest difference value from multiple quantization bit modes, and use the selected quantization bit mode as the target quantization bit mode.
[0091] In other words, in this embodiment, the difference between the first output result and each second output result is first calculated. Then, the quantization bit method corresponding to the smallest difference value is selected from multiple quantization bit methods. The selected quantization bit method is the target quantization bit method. It can be understood that the smaller the difference value, the smaller the impact of the quantization bit method on the output result. Therefore, in this embodiment, the quantization bit method corresponding to the smallest difference value is used as the target quantization bit method.
[0092] For example, continuing from the previous example, suppose the difference between the first output result out0 and the second output result outint2 is calculated to be Lout0-outint2, the difference between the first output result out0 and the second output result outint4 is calculated to be Lout0-outint4, and the difference between the first output result out0 and the second output result outint8 is calculated to be Lout0-outint8. And Lout0-outint2 is the smallest among Lout0-outint2, Lout0-outint4, and Lout0-outint8. Then, int2 corresponding to Lout0-outint2 is selected as the target quantization bit mode from multiple quantization bit modes int2, int4, and int8.
[0093] Thus, through this embodiment, the target quantization bit mode can be easily and accurately selected from multiple quantization bit modes by utilizing the calculation of the difference value.
[0094] In embodiments of this application, the process of calculating the difference between the first output result and each second output result may include:
[0095] For each second output result, the difference between the first and second output results is calculated, and the difference is then raised to a specified power to obtain the difference value between the first and second output results.
[0096] In this embodiment, the specified exponent can be a quadratic power, which is the difference between the first output result and each second output result calculated using the Mean Squared Error (MSE) algorithm.
[0097] For example, continuing from the previous example, Lout0 - outint2 = (out0 - outint2) 2 ,Lout0-outint4=(out0-outint4) 2 ,Lout0-outint8=(out0-outint8) 2 .
[0098] In this way, through the implementation examples, the difference between the first output result and each second output result can be obtained easily and accurately.
[0099] In embodiments of this application, the process of calculating the difference between the first output result and each second output result may include:
[0100] For each second output result, a signal power value is calculated based on the second output result, a noise power value is calculated based on the first output result and the second output result, a signal-to-noise ratio is calculated based on the signal power value and the noise power value, and a difference value between the first output result and the second output result is calculated based on the signal-to-noise ratio.
[0101] That is, in the embodiment, the difference between the first output result and each second output result is calculated using the signal-to-noise ratio (SNR) algorithm.
[0102] For example, continuing from the previous example, let's take calculating the difference value Lout0-outint2 between the first output result out0 and the second output result outint2 as an example. Using Psignal to represent the signal power value, then... Using Pnoise to characterize the noise power value, then Using SNR to characterize the signal-to-noise ratio, then Correspondingly, Lout0-outint2 = -SNR, and other difference values follow the same pattern, which will not be elaborated here.
[0103] In this way, through the implementation examples, the difference between the first output result and each second output result can be obtained easily and accurately.
[0104] In embodiments of this application, the process of selecting the quantization bit mode corresponding to the smallest difference value from multiple quantization bit modes may include:
[0105] Select the difference value that is less than the preset difference threshold from multiple difference values;
[0106] If there are multiple selected difference values, then the quantization bit mode corresponding to the smallest difference value is selected from the multiple quantization bit modes.
[0107] That is, in the embodiment, a difference value less than a preset difference threshold is first selected from multiple difference values as a candidate difference value, and when there are multiple candidate difference values, the quantization bit mode corresponding to the smallest candidate difference value is selected from multiple quantization bit modes.
[0108] For example, continuing from the previous example, suppose the preset difference threshold is 0.001, and Lout0-outint2 and Lout0-outint4 are both less than 0.001 among Lout0-outint2, Lout0-outint4, and Lout0-outint8. In this case, Lout0-outint2 and Lout0-outint4 are candidate difference values. Furthermore, assuming Lout0-outint2 is smaller than Lout0-outint4, then the int2 corresponding to Lout0-outint2 is selected as the target quantization bit method from multiple quantization bit methods int2, int4, and int8. In practical applications, the preset difference threshold can be flexibly set according to the specific application scenario.
[0109] In this way, through the implementation example, a preset difference threshold is first used for initial screening, and then a second screening is performed by comparing the difference values, which can accurately obtain the target quantization bit method.
[0110] In embodiments of this application, the process of selecting the quantization bit mode corresponding to the smallest difference value from multiple quantization bit modes may include:
[0111] Select the smallest difference value from multiple difference values;
[0112] If the selected difference value is less than the preset difference threshold, then the quantization bit mode corresponding to the selected difference value is selected from multiple quantization bit modes.
[0113] That is, in the embodiment, the smallest difference value is first selected from multiple difference values as a candidate difference value, and when the candidate difference value is less than a preset difference threshold, the quantization bit method corresponding to the candidate difference value is selected from multiple quantization bit methods.
[0114] For example, continuing from the previous example, suppose that among Lout0-outint2, Lout0-outint4, and Lout0-outint8, Lout0-outint2 is the smallest. In this case, Lout0-outint2 is the candidate difference value. Also, suppose the preset difference threshold is 0.001, and Lout0-outint2 is less than 0.001. Then, the int2 corresponding to Lout0-outint2 is selected as the target quantization bit method from the multiple quantization bit methods int2, int4, and int8. In practical applications, the preset difference threshold can be flexibly set according to the specific application scenario.
[0115] In this way, through the implementation example, the target quantization bit method can be accurately obtained by first screening by comparing the difference values and then screening by using a preset difference threshold.
[0116] In the embodiment, the number of model functional networks (MNFs) in S201 to S203 shown in Figure 2 can be one or more. When there is only one MNF, the final result is the target quantization bit method corresponding to it. When there are multiple MNFs, the final result is the target quantization bit method corresponding to each MNF. The target quantization bit methods corresponding to different MNFs can be the same or different.
[0117] Please refer to Table 1, which provides examples of target quantization bit methods for various model functional networks.
[0118] Table 1
[0119] In this embodiment, the original model parameters are quantized using multiple quantization bit methods to obtain quantized model parameters for each quantization bit method. The target quantization bit method is selected from these multiple methods using the first output of the Model Functional Network (MFN) under the original model parameters and the second output of the MFN under each quantized model parameter. This achieves accurate selection of the quantization bit method for the MFN, resulting in high accuracy in quantization processing. This improves the model's generalization ability beyond the training data, enabling it to better adapt to new data distributions and scenarios. It also enhances the model's stability and robustness, maintaining good performance under complex conditions. Furthermore, it allows for faster inference, reduces unnecessary resource consumption, and improves resource utilization. Therefore, it significantly improves model performance, enabling tasks in various application scenarios to be executed effectively through the model.
[0120] In the embodiments of this application, another quantization-based data processing method is provided. As shown in FIG4, this quantization-based data processing method may further include steps S401 to S403 after S203.
[0121] Detailed introductions of S401 to S403 are as follows:
[0122] S401 generates the target quantization scaling factor based on the target quantization bit method.
[0123] In this embodiment, the quantization process in the functional network using quantization bits involves the use of a quantization scaling factor. This scaling factor is primarily used to map higher-precision data types (e.g., int8) to the representation range corresponding to lower-precision types (e.g., int2). Therefore, in this embodiment, after obtaining the target quantization bit method, the target quantization bit method can be used to generate the target quantization scaling factor corresponding to the model's functional network.
[0124] In the embodiments of this application, the model function network includes multiple networks; for example, as shown in Table 1, the multiple model function networks are E1, E2, and E3.
[0125] Accordingly, the process of generating the target quantization scaling factor based on the target quantization bits in S401 can include at least two cases:
[0126] Case 1: If there is a first model functional network among multiple model functional networks whose quantization precision of the target quantization bit method is greater than the preset precision threshold, then generate the target quantization scaling factor corresponding to each data channel in the first model functional network.
[0127] In this embodiment, the first model function network refers to the model function network among multiple model function networks whose quantization accuracy of the target quantization bit method is greater than a preset accuracy threshold.
[0128] For example, continuing from the previous example, assuming the preset precision threshold is int4, then referring to Table 1, we can see that the quantization precision int8 of the model function network E3 is greater than the preset precision threshold int4. In this case, the model function network E3 is the first model function network.
[0129] In this embodiment, after obtaining the first model functional network, a target quantization scaling factor corresponding to each data channel in the first model functional network can be generated.
[0130] For example, taking the aforementioned Model Functional Network E3 as the first Model Functional Network, assuming that Model Functional Network E3 includes 64 data channels, corresponding target quantization scaling factors are generated for each of the 64 data channels.
[0131] Case 2: If there is a second model function network in multiple model function networks whose quantization precision of the target quantization bit method is less than or equal to the preset precision threshold, then the channel values of each data channel in the second model function network are grouped and a target quantization scaling factor corresponding to each combination is generated.
[0132] In this embodiment, the second model function network refers to the model function network among multiple model function networks whose quantization accuracy of the target quantization bit method is less than or equal to a preset accuracy threshold.
[0133] For example, continuing from the previous example, assuming the preset precision threshold is int4, then referring to Table 1, we can see that the quantization precision int2 of model function network E1 is less than the preset precision threshold int4, and the quantization precision int4 of model function network E2 is equal to the preset precision threshold int4. In this case, both model function networks E1 and E2 are second model function networks.
[0134] In this embodiment, after obtaining the second model functional network, the channel values of each data channel in the second model functional network can be grouped, and a target quantization scaling factor corresponding to each combination can be generated.
[0135] For example, consider the example of a second model function network that follows the aforementioned model function networks E1 and E2.
[0136] For the Model Functional Network (MNF) E2, assuming it comprises 64 data channels, each with 128 channel values, there are 128 combinations, where each data channel has two combinations of 64 channel values. Accordingly, target quantization scaling factors are generated for each of these 128 combinations.
[0137] For the Model Functional Network (MNF) E1, assuming it comprises 128 data channels, each with 128 channel values, and can be combined in groups of 64 channel values (i.e., each data channel has two combinations), there are a total of 256 combinations. Accordingly, target quantization scaling factors are generated for each of these 256 combinations.
[0138] In the embodiments, the number of data channels included in different model functional networks may be the same or different, and the channel values included in each combination can be flexibly adjusted according to the specific application scenario when grouping.
[0139] In this way, through the implementation example, when the quantization accuracy of the target quantization bit method corresponding to the model functional network is high, the target quantization scaling factor is calculated based on the coarser-grained data channel, thereby saving computing resources and time; when the quantization accuracy of the target quantization bit method corresponding to the model functional network is low, the target quantization scaling factor is calculated based on the finer-grained combination, thereby maximizing the quantization accuracy and ensuring the accuracy of quantization processing.
[0140] For ease of understanding, please refer to Figure 5. For the first model functional network where the quantization accuracy of the target quantization bit method is greater than the preset accuracy threshold, a target quantization scaling factor (i.e., channel-by-channel quantization) is generated for each data channel in the first model functional network. For the second model functional network where the quantization accuracy of the target quantization bit method is less than or equal to the preset accuracy threshold, the channel values of each data channel in the second model functional network are grouped, and a target quantization scaling factor (i.e., grouped quantization) is generated for each group.
[0141] In embodiments of this application, the process of generating the target quantization scaling factor corresponding to each data channel in the first model function network may include:
[0142] Obtain the target quantization bit method corresponding to the first model functional network;
[0143] For each data channel in the first model functional network, calculations are performed based on the obtained target quantization bit method, the activation input value corresponding to the data channel, and each quantization scaling factor to obtain the calculation result corresponding to each quantization scaling factor. The quantization scaling factor corresponding to the calculation result with the largest absolute value among multiple calculation results is taken as the target quantization scaling factor of the data channel.
[0144] In other words, in this embodiment, for each data channel in the first model functional network, calculations are performed using the target quantization bit method corresponding to the first model functional network, the activation input value corresponding to the data channel, and each quantization scaling factor. This yields the calculation result corresponding to each quantization scaling factor. The quantization scaling factor corresponding to the calculation result with the largest absolute value among multiple calculation results is then the target quantization scaling factor for the data channel. It is understood that the larger the absolute value of the calculation result, the more significant the changes in the data channel are reflected, thus better reflecting the sensitivity of the model functional network.
[0145] In this way, the target quantization scaling factor for each data channel in the first model functional network can be obtained easily and accurately through the embodiment.
[0146] In the embodiments of this application, the process of performing calculations based on the obtained target quantization bit method, the activation input value corresponding to the data channel, and each quantization scaling factor to obtain the calculation result corresponding to each quantization scaling factor may include:
[0147] Iterate through the first quantization scaling factor in the first scaling factor list, and perform calculations based on the obtained target quantization bit method, the activation input value corresponding to the data channel, and the quantization scaling factor iterates through, to obtain the calculation result corresponding to the quantization scaling factor iterates through.
[0148] The activation input value corresponding to the data channel is updated, and the next adjacent quantization scaling factor of the first quantization scaling factor is traversed. The calculation is performed based on the obtained target quantization bit method, the updated activation input value, and the quantization scaling factor traversed to obtain the calculation result corresponding to the quantization scaling factor traversed, until the first scaling factor list is traversed and the calculation result corresponding to each quantization scaling factor in the first scaling factor list is obtained.
[0149] In other words, the embodiment obtains the calculation result corresponding to each quantization scaling factor by traversing the calculation. Specifically, firstly, the first quantization scaling factor in the first scaling factor list is traversed, and the target quantization bit method corresponding to the first model function network, the activation input value corresponding to the data channel, and the first quantization scaling factor are used to perform calculations to obtain the calculation result corresponding to the first quantization scaling factor; then, the activation input value of the data channel is updated, and the next adjacent quantization scaling factor of the first quantization scaling factor (i.e., the second quantization scaling factor in the first scaling factor list) is traversed, and the target quantization bit method corresponding to the first model function network, the updated activation input value, and the second quantization scaling factor are used to perform calculations to obtain the calculation result corresponding to the second quantization scaling factor. This process is repeated until the first scaling factor list is completely traversed, and the calculation result corresponding to each quantization scaling factor in the first scaling factor list is obtained.
[0150] In this embodiment, updating the activation input value of the data channel during each iteration can be achieved by taking the maximum absolute value of the activation input value of the data channel sampled in the current iteration and the activation input value sampled in the previous iteration. If the absolute value of the activation input value of the data channel sampled in the current iteration is less than the absolute value of the activation input value of the data channel sampled in the previous iteration, then the activation input value of the data channel sampled in the current iteration is updated to the activation input value of the data channel sampled in the previous iteration. If the absolute value of the activation input value of the data channel sampled in the current iteration is greater than or equal to the absolute value of the activation input value of the data channel sampled in the previous iteration, then the activation input value of the data channel sampled in the current iteration is maintained. The activation input value of the sampled data channel refers to the activation input value obtained by performing model prediction on the model's input data.
[0151] By updating the active input value of the data channel during the iteration process, the accuracy of the calculation results corresponding to the quantization scaling factor is improved through this embodiment.
[0152] In embodiments of this application, the process of generating the target quantization scaling factor corresponding to each combination may include:
[0153] Obtain the target quantization bit method corresponding to the second model functional network;
[0154] For each combination corresponding to each data channel in the second model functional network, calculations are performed based on the obtained target quantization bit method, the activation input value corresponding to the combination, and each quantization scaling factor to obtain the calculation result corresponding to each quantization scaling factor. The quantization scaling factor corresponding to the calculation result with the largest absolute value among multiple calculation results is taken as the target quantization scaling factor of the combination.
[0155] In other words, in this embodiment, for each combination of data channels in the first model functional network, calculations are performed using the target quantization bit method corresponding to the second model functional network, the activation input value corresponding to the combination, and each quantization scaling factor. The result for each quantization scaling factor is then obtained. The quantization scaling factor corresponding to the result with the largest absolute value among the multiple results is the target quantization scaling factor for the combination. It is understood that a larger absolute value of the result better reflects the significant changes in the combination, thus better capturing the sensitivity of the model functional network.
[0156] Through this embodiment, the target quantization scaling factor for each combination of data channels in the second model functional network can be obtained easily and accurately.
[0157] In the embodiments of this application, the process of performing calculations based on the obtained target quantization bit method, the corresponding combination of activation input values, and each quantization scaling factor to obtain the calculation result corresponding to each quantization scaling factor may include:
[0158] Iterate through the first quantization scaling factor in the second scaling factor list, and perform calculations based on the obtained target quantization bit method, the corresponding activation input value, and the quantization scaling factor iterates through, to obtain the calculation result corresponding to the quantization scaling factor iterates through.
[0159] The activation input value corresponding to the combination is updated, and the next adjacent quantization scaling factor of the first quantization scaling factor is traversed. The calculation is performed based on the obtained target quantization bit method, the updated activation input value, and the quantization scaling factor traversed to obtain the calculation result corresponding to the quantization scaling factor traversed, until the second scaling factor list is traversed and the calculation result corresponding to each quantization scaling factor in the second scaling factor list is obtained.
[0160] In other words, the embodiment obtains the calculation result corresponding to each quantization scaling factor by traversing the calculation. Specifically, firstly, the first quantization scaling factor in the second scaling factor list is traversed, and the target quantization bit method corresponding to the second model function network, the corresponding activation input value, and the first quantization scaling factor are used to perform the calculation to obtain the calculation result corresponding to the first quantization scaling factor; then, the combined activation input value is updated, and the next adjacent quantization scaling factor of the first quantization scaling factor (i.e., the second quantization scaling factor in the second scaling factor list) is traversed, and the target quantization bit method corresponding to the second model function network, the updated activation input value, and the second quantization scaling factor are used to perform the calculation to obtain the calculation result corresponding to the second quantization scaling factor. This process is repeated until the second scaling factor list is completely traversed, and the calculation result corresponding to each quantization scaling factor in the second scaling factor list is obtained.
[0161] In this embodiment, updating the activation input value of the combination in each iteration can be achieved by taking the maximum absolute value of the activation input value of the combination sampled in the current iteration and the activation input value of the combination sampled in the previous iteration. If the absolute value of the activation input value of the combination sampled in the current iteration is less than the absolute value of the activation input value of the combination sampled in the previous iteration, then the activation input value of the combination sampled in the current iteration is updated to the activation input value of the combination sampled in the previous iteration. If the absolute value of the activation input value of the combination sampled in the current iteration is greater than or equal to the absolute value of the activation input value of the combination sampled in the previous iteration, then the activation input value of the combination sampled in the current iteration is maintained.
[0162] By updating the combined activation input value during the iteration process through this example, the accuracy of the calculation results corresponding to the quantization scaling factor is improved.
[0163] In embodiments of this application, the process of generating the target quantization scaling factor based on the target quantization bits in step S401 may include:
[0164] The quantization scaling factor corresponding to the target quantization bit method is obtained by deploying multiple devices with a hybrid bit quantization model.
[0165] Select the largest quantization scaling factor from multiple quantization scaling factors as the target quantization scaling factor.
[0166] That is, in the embodiment, considering the limited device resources, the quantization scaling factor corresponding to the target quantization bit method is obtained by deploying multiple devices with a hybrid bit quantization model, and the largest quantization scaling factor is selected as the target quantization scaling factor from the multiple quantization scaling factors.
[0167] A hybrid bit quantization model refers to a model functional network (MMF) that employs different quantization bit methods, including int2, int4, int8, and bf16. For example, if a hybrid bit quantization model is deployed on three devices, each device can use the same quantization bit method. Taking MMF E1 as an example, each of the three devices will obtain a quantization scaling factor (i.e., three quantization scaling factors) corresponding to MMF E1. The largest of these three scaling factors is then selected as the target quantization scaling factor for MMF E1.
[0168] In this way, through the implementation examples, the target quantization scaling factor of the same model functional network is obtained by deploying the hybrid bit model on multiple devices, which ensures that each device can perform the task normally and improves the accuracy of the target quantization scaling factor.
[0169] S402, the target quantization bit method, target quantization scaling factor, and identification information corresponding to the model function network are associated and stored in the specified area.
[0170] In this embodiment of the application, after generating the target quantization scaling factor using the target quantization bit method, the target quantization bit method, the target quantization scaling factor, and the identification information corresponding to the model functional network (used to uniquely identify the model functional network, including but not limited to name, number, etc.) can be associated and stored in a designated area.
[0171] For example, following the previous example, please refer to Table 2, which is an example table for related storage.
[0172] Table 2
[0173] S403, if it is determined that the model functional network is in the inference stage, then the target quantization bit method and target quantization scaling factor are obtained from the specified region to quantize the original model parameters.
[0174] In this embodiment, when it is determined that the Model Functional Network (MFN) is in the inference stage, the target quantization bit mode and target quantization scaling factor corresponding to the MFN can be obtained from the designated storage area, and the original model parameters corresponding to the MFN can be quantized. In this embodiment, it can be determined whether the MFN is an inference model, i.e., whether it is in the inference stage, based on the file format of the MFN.
[0175] For a detailed description of S201 to S203 shown in Figure 4, please refer to Figure 2 for S201 to S203, which will not be repeated here.
[0176] In this embodiment, the target quantization bit method, the target quantization scaling factor, and the identification information corresponding to the model functional network are stored together. They can be directly used during model inference, which improves the efficiency and reliability of quantization processing, thereby further improving model performance.
[0177] The specific scenarios of the embodiments of this application will be described in detail below.
[0178] Please refer to Figure 6, which is a flowchart illustrating a quantization-based data processing method according to an embodiment of this application. As shown in Figure 6, the quantization-based data processing method includes at least steps S601 to S607, which are described in detail below:
[0179] S601, obtain the verification set and input the verification set into the hybrid expert model.
[0180] In this application, the hybrid expert model includes, but is not limited to, models used for text processing, video processing, and speech processing. The validation set is a collection obtained by sampling a small portion of the dataset used for model training. This set is used as input to the model for calibration sampling to obtain the quantization scaling factor for each layer.
[0181] S602, check whether the number of verification rounds has reached the preset threshold. If not, proceed to S603; if yes, proceed to S604.
[0182] S603 records the activation input value corresponding to each model functional network and returns it to S602.
[0183] S604 performs sensitivity analysis for each Model Functional Network (MFN) to generate the target quantization bit mode and target quantization scaling factor for each MFN.
[0184] In this embodiment, sensitivity analysis can be achieved by calculating cosine similarity as described above to generate the target quantization bit scheme for each model functional network. For example, the generation of the target quantization bit scheme for each model functional network can be referred to Figure 7. As shown in Figure 7, the hybrid expert model includes four expert networks (i.e., functional networks): Expert1, Expert2, Expert3, and Expert4. Sensitivity analysis is performed on Expert1 to obtain the target quantization bit scheme INT8, on Expert2 to obtain the target quantization bit scheme INT4, on Expert3 to obtain the target quantization bit scheme INT4, and on Expert4 to obtain the target quantization bit scheme INT2.
[0185] For example, generating the target quantization scaling factor for each Model Functional Network (MFN) may include: if a first MFN exists among multiple MFNs with a target quantization bit method whose quantization precision is greater than a preset precision threshold, then the target quantization scaling factor corresponding to each data channel in the first MFN is generated; if a second MFN exists among multiple MFNs with a target quantization bit method whose quantization precision is less than or equal to the preset precision threshold, then the channel values of each data channel in the second MFN are grouped, and the target quantization scaling factor corresponding to each group is generated. That is, when the quantization precision of the target quantization bit method corresponding to the MFN is high, the target quantization scaling factor is calculated based on a coarser-grained data channel; when the quantization precision of the target quantization bit method corresponding to the MFN is low, the target quantization scaling factor is calculated based on a finer-grained combination.
[0186] S605 associates and stores the target quantization bit method and target quantization scaling factor corresponding to each model functional network, as well as the identification information of each model functional network, in a specified area.
[0187] In this embodiment of the application, the target quantization bit method and target quantization scaling factor corresponding to each model functional network, as well as the identification information of each model functional network, can be stored in a specified area locally or in the cloud in vector form through a storage interface (e.g., torch.save).
[0188] S606 If the hybrid expert model is detected to be in the inference stage, the target quantization bit mode and target quantization scaling factor corresponding to each model functional network are obtained from the specified region.
[0189] S607 quantizes the original weight values of each model functional network using the target quantization bit method and target quantization scaling factor corresponding to each model functional network.
[0190] In this embodiment, a mapping table between preset bit methods and computation kernels can be searched to obtain the target computation kernel corresponding to each model functional network. Then, during the computation process of each target computation kernel, the corresponding target quantization scaling factor is passed in to quantize the original weight values to the corresponding target quantization bit number, thereby realizing the hybrid bit quantization inference of the hybrid expert model and obtaining the corresponding task processing results.
[0191] In the embodiments, when the hybrid expert model is applied to text processing application scenarios, it includes, but is not limited to:
[0192] (1) Text classification (e.g., email classification, news classification, etc.). In this case, a hybrid expert model can be used as a text classification model. Specifically, the text to be classified is input into the text classification model. The target quantization bit method and target quantization scaling factor for each functional network in the text classification model are obtained using the aforementioned method. Quantization is then performed using the target quantization bit method and target quantization scaling factor for each functional network. Finally, the quantized data is used for text classification, outputting text classification information. Text classification models have good performance, thereby improving the accuracy and efficiency of text classification.
[0193] (2) Contextual understanding and automatic response (e.g., automated customer service, chatbots, etc.): In this case, a hybrid expert model can serve as the response model. Specifically, the text to be responded to is input into the response model. The target quantization bit method and target quantization scaling factor for each functional network in the response model are obtained using the aforementioned method. Quantization is then performed using the target quantization bit method and target quantization scaling factor for each functional network. Finally, the quantized data is used to understand the user's intent and output the response information. The response model has good performance, thereby improving the accuracy and efficiency of contextual understanding and automatic response.
[0194] (3) Multilingual text processing (e.g., text translation, language recognition, etc.), where the hybrid expert model can serve as a multilingual processing model. Specifically, the text to be processed is input into the multilingual processing model. The target quantization bit method and target quantization scaling factor for each functional network in the multilingual processing model are obtained using the aforementioned method. Quantization is then performed using the target quantization bit method and target quantization scaling factor for each functional network. Finally, the quantized data is used for text parsing, outputting the parsed text information. The multilingual processing model exhibits good performance, thereby improving the accuracy and efficiency of multilingual text processing.
[0195] In the embodiments, when the hybrid expert model is applied to video processing application scenarios, it includes, but is not limited to:
[0196] (1) Video detection (e.g., identifying different actions, scenes, objects, etc. in a video). In this case, a hybrid expert model can be used as a video detection model. Specifically, the video to be detected is input into the video detection model. The target quantization bit method and target quantization scaling factor corresponding to each functional network in the video detection model are obtained using the aforementioned method. Quantization processing is then performed using the target quantization bit method and target quantization scaling factor corresponding to each functional network. Finally, the quantized data is used for video detection, and video detection information is output. The video detection model has good performance, thereby improving the accuracy and efficiency of video detection.
[0197] (2) Video compression and transmission (e.g., online video playback, video conferencing, etc.). In this case, the hybrid expert model can be a video transmission model. Specifically, the video to be transmitted is input into the video transmission model. The target quantization bit method and target quantization scaling factor corresponding to each functional network in the video transmission model are obtained using the aforementioned method. Quantization processing is then performed using the target quantization bit method and target quantization scaling factor corresponding to each functional network. Finally, the quantized data is used for video compression, achieving efficient video transmission and playback. The video transmission model has good performance, thereby improving the accuracy and efficiency of video compression and transmission.
[0198] (3) Video effects and enhancements (e.g., film and television production, virtual reality, augmented reality, etc.). In this case, the hybrid expert model can serve as a video processing model. Specifically, the video to be processed is input into the video processing model. The target quantization bit method and target quantization scaling factor for each functional network in the video processing model are obtained using the aforementioned method. Quantization processing is then performed using the target quantization bit method and target quantization scaling factor for each functional network. Finally, the quantized data is used for effects and enhancement processing, achieving efficient video rendering. The high performance of the video processing model improves the accuracy and efficiency of video effects and enhancements.
[0199] In the embodiments, when the hybrid expert model is applied to speech processing application scenarios, it includes, but is not limited to:
[0200] (1) Speech recognition and transcription (e.g., voice assistants, voice search, caption generation, etc.), where the hybrid expert model can be a speech recognition model. Specifically, the speech to be recognized is input into the speech recognition model. The target quantization bit method and target quantization scaling factor for each functional network in the speech recognition model are obtained using the aforementioned method. Quantization is then performed using the target quantization bit method and target quantization scaling factor for each functional network. Finally, the quantized data is used for speech recognition and transcription, outputting speech recognition information. The speech recognition model has good performance, thereby improving the accuracy and efficiency of speech recognition and transcription.
[0201] (2) Speech activity detection and keyword recognition (e.g., wake word detection, voice control, etc.). In this case, a hybrid expert model can be used as a speech detection model. Specifically, the speech to be detected is input into the speech detection model. The target quantization bit method and target quantization scaling factor for each functional network in the speech detection model are obtained using the aforementioned method. Quantization is then performed using the target quantization bit method and target quantization scaling factor for each functional network. Finally, the quantized data is used for speech activity detection and keyword recognition, outputting speech detection information. The speech detection model has good performance, thereby improving the accuracy and efficiency of speech activity detection and keyword recognition.
[0202] (3) Speech quality enhancement and noise reduction (e.g., teleconferences, voice calls, recording applications, etc.). In this case, a hybrid expert model can be used as a speech enhancement model. Specifically, the speech to be enhanced is input into the speech enhancement model. The target quantization bit method and target quantization scaling factor for each functional network in the speech enhancement model are obtained using the aforementioned method. Quantization is then performed using the target quantization bit method and target quantization scaling factor for each functional network. Finally, the quantized data is used for speech quality enhancement and noise reduction, outputting speech enhancement information. The speech enhancement model has good performance, thereby improving the accuracy and efficiency of speech quality enhancement and noise reduction.
[0203] It should be noted that for a detailed explanation of the determination of the target quantization bit method and target quantization scaling factor for each functional network in the aforementioned application scenarios, as well as the quantization processing using the target quantization bit method and target quantization scaling factor for each functional network, please refer to the aforementioned embodiments, which will not be repeated here.
[0204] In this embodiment, models of different sizes, 7B-MoE and 70B-MoE, were tested on the ptb-en check set. The test results are shown in Table 3.
[0205] Table 3
[0206] As shown in Table 3, the obtained values are all perplexity (PPL) values. A smaller PPL value indicates less loss. For 7B-MoE, the PPL values, arranged from smallest to largest, are 12.209 < 12.323 < 12.350. For 70B-MoE, the PPL values, arranged from smallest to largest, are 8.858 < 8.903 < 8.925. The values for mixed bit quantization are not significantly different from those for bf16 and int8. Therefore, the quantization accuracy of the embodiments in this application is close to lossless, and the model performance is good.
[0207] Figure 8 is a block diagram illustrating a quantization-based data processing apparatus according to an embodiment of this application. As shown in Figure 8, the apparatus includes:
[0208] The quantization processing module 801 is configured to quantize the original model parameters corresponding to the model functional network using multiple quantization bit methods to obtain the quantized model parameters corresponding to each quantization bit method.
[0209] The acquisition module 802 is configured to acquire the first output result of the model functional network under the original model parameters, and to acquire the second output result of the model functional network under each quantized model parameter;
[0210] The selection module 803 is configured to select a target quantization bit method from the multiple quantization bit methods based on the first output result and multiple second output results. The target quantization bit method is used to quantize the original model parameters during the inference stage corresponding to the model functional network.
[0211] In the embodiments of this application, based on the foregoing scheme, module 803 is selected and specifically configured as follows:
[0212] Calculate the difference between the first output result and each second output result;
[0213] Select the quantization bit mode with the smallest difference value from the plurality of quantization bit modes, and use the selected quantization bit mode as the target quantization bit mode.
[0214] In the embodiments of this application, based on the foregoing scheme, module 803 is further specifically configured as follows:
[0215] For each second output result, the difference between the first output result and the second output result is calculated, and the difference is raised to a specified power to obtain the difference value between the first output result and the second output result.
[0216] In the embodiments of this application, based on the foregoing scheme, module 803 is further specifically configured as follows:
[0217] For each second output result, a signal power value is calculated based on the second output result, a noise power value is calculated based on the first output result and the second output result, a signal-to-noise ratio is calculated based on the signal power value and the noise power value, and a difference value between the first output result and the second output result is calculated based on the signal-to-noise ratio.
[0218] In the embodiments of this application, based on the foregoing scheme, the original model parameters include original weight values, and the quantized model parameters include quantized weight values; the acquisition module 802 is specifically configured as follows:
[0219] Obtain the activation input values of the model's functional network;
[0220] The activation input value and the original weight value are multiplied to obtain the first output result of the model functional network, and the activation input value and each quantized weight value are multiplied to obtain multiple second output results of the model functional network.
[0221] In the embodiments of this application, based on the foregoing scheme, the model functional network corresponds to multiple data channels, and each data channel corresponds to a channel value; the acquisition module 802 is further specifically configured as follows:
[0222] If an outlier channel value is detected, the data channel corresponding to the largest outlier channel value is taken as the target data channel, and the channel value corresponding to the target data channel is taken as the activation input value of the model functional network. The original weight value and quantized weight value corresponding to the target data channel are also obtained.
[0223] The first output result of the model functional network is obtained by multiplying the channel value and the original weight value corresponding to the target data channel, and the second output result of the model functional network is obtained by multiplying the channel value and each quantization weight value corresponding to the target data channel.
[0224] In the embodiments of this application, based on the foregoing solution, the acquisition module 802 is further configured as follows:
[0225] Calculate the similarity between the target data channel and its adjacent data channels;
[0226] Based on the similarity, the outlier channel values in the target data channel are reduced to obtain the target channel value;
[0227] The target channel value and the channel values in the target data channel, excluding outlier channel values, are used as the activation input values of the model functional network.
[0228] In the embodiments of this application, based on the foregoing solution, the acquisition module 802 is further configured as follows:
[0229] A distance value is calculated based on the channel value of the target data channel and the channel values of the adjacent data channels corresponding to the target data channel. The distance value is greater than 0 and less than 1, and the distance value is used as the similarity between the target data channel and the adjacent data channels.
[0230] The target channel value is obtained by multiplying the distance value and the outlier channel value in the target data channel.
[0231] In embodiments of this application, based on the foregoing solution, the apparatus further includes:
[0232] The generation module is configured to generate a target quantization scaling factor based on the target quantization bit method;
[0233] The storage module is configured to associate and store the target quantization bit method, the target quantization scaling factor, and the identification information corresponding to the model functional network in a designated area;
[0234] The inference module is configured to, if detected during the inference phase of the model functional network, obtain the target quantization bit mode and the target quantization scaling factor from the specified region and quantize the original model parameters.
[0235] In the embodiments of this application, based on the foregoing scheme, the model functional network includes multiple modules; the generation module is specifically configured as follows:
[0236] If there is a first model functional network among multiple model functional networks whose quantization precision of the target quantization bit method is greater than a preset precision threshold, then generate the target quantization scaling factor corresponding to each data channel in the first model functional network.
[0237] If there is a second model function network among the plurality of model function networks whose quantization precision of the target quantization bit method is less than or equal to the preset precision threshold, then the channel values of each data channel in the second model function network are grouped, and a target quantization scaling factor corresponding to each group is generated.
[0238] In the embodiments of this application, based on the foregoing scheme, the generation module is further specifically configured as follows:
[0239] Obtain the target quantization bit method corresponding to the first model functional network;
[0240] For each data channel in the first model functional network, calculations are performed based on the obtained target quantization bit method, the activation input value corresponding to the data channel, and each quantization scaling factor to obtain the calculation result corresponding to each quantization scaling factor. The quantization scaling factor corresponding to the calculation result with the largest absolute value among multiple calculation results is taken as the target quantization scaling factor of the data channel.
[0241] In the embodiments of this application, based on the foregoing scheme, the generation module is further specifically configured as follows:
[0242] The first quantization scaling factor in the first scaling factor list is traversed, and the calculation is performed based on the obtained target quantization bit method, the activation input value corresponding to the data channel, and the quantization scaling factor traversed, to obtain the calculation result corresponding to the quantization scaling factor traversed.
[0243] The activation input value corresponding to the data channel is updated, and the next adjacent quantization scaling factor of the first quantization scaling factor is traversed. Based on the obtained target quantization bit method, the updated activation input value, and the traversed quantization scaling factors, calculations are performed to obtain the calculation results corresponding to the traversed quantization scaling factors. This process continues until the first scaling factor list is completely traversed, and the calculation results corresponding to each quantization scaling factor in the first scaling factor list are obtained.
[0244] In the embodiments of this application, based on the foregoing scheme, the generation module is further specifically configured as follows:
[0245] Obtain the target quantization bit method corresponding to the second model functional network;
[0246] For each combination corresponding to each data channel in the second model functional network, calculations are performed based on the obtained target quantization bit method, the activation input value corresponding to the combination, and each quantization scaling factor to obtain the calculation result corresponding to each quantization scaling factor. The quantization scaling factor corresponding to the calculation result with the largest absolute value among multiple calculation results is taken as the target quantization scaling factor of the combination.
[0247] In the embodiments of this application, based on the foregoing scheme, the generation module is further specifically configured as follows:
[0248] Iterate through the first quantization scaling factor in the second scaling factor list, and perform calculations based on the obtained target quantization bit method, the activation input value corresponding to the combination, and the quantization scaling factor iterates through, to obtain the calculation result corresponding to the quantization scaling factor iterates through.
[0249] The activation input value corresponding to the combination is updated, and the next adjacent quantization scaling factor of the first quantization scaling factor is traversed. The calculation is performed based on the obtained target quantization bit mode, the updated activation input value, and the traversed quantization scaling factors to obtain the calculation result corresponding to the traversed quantization scaling factor. This process continues until the second scaling factor list is completely traversed, and the calculation result corresponding to each quantization scaling factor in the second scaling factor list is obtained.
[0250] In the embodiments of this application, based on the foregoing scheme, the generation module is further specifically configured as follows:
[0251] The quantization scaling factor corresponding to the target quantization bit method is obtained by deploying multiple devices with a hybrid bit quantization model.
[0252] Select the largest quantization scaling factor from multiple quantization scaling factors as the target quantization scaling factor.
[0253] It should be noted that the apparatus provided in the foregoing embodiments and the method provided in the foregoing embodiments belong to the same concept, and the specific way in which each module and unit performs operations has been described in detail in the method embodiments.
[0254] Embodiments of this application also provide an electronic device, including: one or more processors; and a memory for storing one or more computer programs, which, when executed by one or more processors, cause the electronic device to implement the quantization-based data processing method as described in the foregoing embodiments.
[0255] Figure 9 is a schematic diagram of the structure of a computer system suitable for implementing an electronic device (such as the terminal device or server shown in Figure 1) in the embodiments of this application.
[0256] It should be noted that the computer system 900 of the electronic device shown in Figure 9 is only an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.
[0257] As shown in Figure 9, the computer system 900 includes a Central Processing Unit (CPU) 901, which can perform various appropriate actions and processes, such as executing the methods described in the above embodiments, based on a computer program stored in Read-Only Memory (ROM) 902 or a computer program loaded from storage portion 908 into Random Access Memory (RAM) 903. The RAM 903 also stores various computer programs and data required for system operation. The CPU 901, ROM 902, and RAM 903 are interconnected via a bus 904. An input / output (I / O) interface 905 is also connected to the bus 904.
[0258] The following components are connected to I / O interface 905: an input section 906 including a keyboard, mouse, etc.; an output section 907 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 908 including a hard disk, etc.; and a communication section 909 including a network interface card such as a LAN (Local Area Network) card, modem, etc. The communication section 909 performs communication processing via a network such as the Internet. A drive 910 is also connected to I / O interface 905 as needed. Removable media 911, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., are installed on drive 910 as needed so that computer programs read from them can be installed into storage section 908 as needed.
[0259] Specifically, according to embodiments of this application, the processes described above with reference to the flowcharts can be implemented as a computer program. For example, embodiments of this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing computer instructions for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 909, and / or installed from removable medium 911. When the computer program is executed by central processing unit (CPU) 901, it performs various functions defined in the system of embodiments of this application.
[0260] It should be noted that the computer-readable medium shown in the embodiments of this application can be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. For example, a computer-readable medium can be an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, optical fiber, portable compact disc read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In the embodiments of this application, a computer-readable medium can be any tangible medium containing or storing a computer program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the embodiments of this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, wherein a computer-readable computer program is carried. Such transmitted data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer program contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination thereof.
[0261] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. Each block in a flowchart or block diagram may represent a module, segment, or portion of code, which contains one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0262] The units described in the embodiments of this application can be implemented in software or hardware, and the described units can also be located in a processor. The names of these units do not necessarily limit the specific unit itself.
[0263] Embodiments of this application also provide a computer-readable medium having a computer program stored thereon, which, when executed by a processor, implements the aforementioned quantization-based data processing method. This computer-readable medium may be included in the electronic device described in the above embodiments, or it may exist independently and not assembled into the electronic device.
[0264] Embodiments of this application also provide a computer program product or computer program, which includes computer instructions stored in a computer-readable medium. A processor of an electronic device reads the computer instructions from the computer-readable medium and executes the computer instructions, causing the electronic device to perform the quantization-based data processing methods provided in the various embodiments described above.
[0265] The above content is merely an exemplary embodiment of this application and is not intended to limit the implementation of this application. Those skilled in the art can easily make corresponding modifications or alterations based on the main concepts and spirit of the embodiments of this application. Therefore, the scope of protection of this application should be determined by the scope of protection claimed in the claims.
Claims
1. A data processing method based on quantization, comprising: The original model parameters corresponding to the model functional network are quantized using multiple quantization bit methods to obtain the quantized model parameters corresponding to each quantization bit method. Obtain the first output result of the model functional network under the original model parameters, and obtain the second output result of the model functional network under each quantized model parameter; Based on the first output result and multiple second output results, a target quantization bit method is selected from the multiple quantization bit methods. The target quantization bit method is used to quantize the original model parameters during the inference phase of the model functional network.
2. The method according to claim 1, wherein, The step of selecting a target quantization bit mode from the plurality of quantization bit modes based on the first output result and multiple second output results includes: Calculate the difference between the first output result and each second output result; Select the quantization bit mode with the smallest difference value from the plurality of quantization bit modes, and use the selected quantization bit mode as the target quantization bit mode.
3. The method according to claim 2, wherein, The calculation of the difference between the first output result and each second output result includes: For each second output result, the difference between the first output result and the second output result is calculated, and the difference is raised to a specified power to obtain the difference value between the first output result and the second output result.
4. The method according to claim 2, wherein, The calculation of the difference between the first output result and each second output result includes: For each second output result, a signal power value is calculated based on the second output result, a noise power value is calculated based on the first output result and the second output result, a signal-to-noise ratio is calculated based on the signal power value and the noise power value, and a difference value between the first output result and the second output result is calculated based on the signal-to-noise ratio.
5. The method according to any one of claims 1-4, wherein, The original model parameters include original weight values, and the quantized model parameters include quantized weight values; obtaining the first output result of the model functional network under the original model parameters, and obtaining the second output result of the model functional network under each quantized model parameter, includes: Obtain the activation input values of the model's functional network; The activation input value and the original weight value are multiplied to obtain the first output result of the model functional network, and the activation input value and each quantized weight value are multiplied to obtain multiple second output results of the model functional network.
6. The method according to claim 5, wherein, The model functional network corresponds to multiple data channels, and each data channel corresponds to a channel value; obtaining the activation input value of the model functional network includes: Detect whether there are outlier channel values in each of the multiple data channels; If an outlier channel value is detected, all detected outlier channel values are compared, the data channel corresponding to the largest outlier channel value is taken as the target data channel, the channel value corresponding to the target data channel is taken as the activation input value of the model functional network, and the original weight value and quantized weight value corresponding to the target data channel are obtained. The first output of the model functional network is obtained by multiplying the activation input value and the original weight value, and the second output of the model functional network is obtained by multiplying the activation input value and each quantized weight value, including: The first output result of the model functional network is obtained by multiplying the channel value and the original weight value corresponding to the target data channel, and the second output result of the model functional network is obtained by multiplying the channel value and each quantization weight value corresponding to the target data channel.
7. The method according to claim 6, wherein, The step of using the channel value corresponding to the target data channel as the activation input value of the model functional network includes: Calculate the similarity between the target data channel and its adjacent data channels; Based on the similarity, the outlier channel values in the target data channel are reduced to obtain the target channel value; The target channel value and the channel values in the target data channel, excluding outlier channel values, are used as the activation input values of the model functional network.
8. The method according to claim 7, wherein, The calculation of the similarity between the target data channel and its adjacent data channels includes: A distance value is calculated based on the channel value of the target data channel and the channel values of the adjacent data channels corresponding to the target data channel. The distance value is greater than 0 and less than 1, and the distance value is used as the similarity between the target data channel and the adjacent data channels. The process of reducing outlier channel values in the target data channel based on the similarity to obtain the target channel value includes: The target channel value is obtained by multiplying the distance value and the outlier channel value in the target data channel.
9. The method according to any one of claims 1 to 8, wherein, After selecting a target quantization bit mode from the plurality of quantization bit modes based on the first output result and the plurality of second output results, the method further includes: Generate the target quantization scaling factor based on the target quantization bit method; The target quantization bit method, the target quantization scaling factor, and the identification information corresponding to the model functional network are associated and stored in a designated area; If detected during the inference phase of the model functional network, the target quantization bit mode and the target quantization scaling factor are obtained from the specified region to quantize the original model parameters.
10. The method according to claim 9, wherein, The model functional network includes multiple networks; the generation of the target quantization scaling factor based on the target quantization bit method includes: If there is a first model functional network among multiple model functional networks whose quantization precision of the target quantization bit method is greater than a preset precision threshold, then generate the target quantization scaling factor corresponding to each data channel in the first model functional network. If there is a second model function network among the plurality of model function networks whose quantization precision of the target quantization bit method is less than or equal to the preset precision threshold, then the channel values of each data channel in the second model function network are grouped, and a target quantization scaling factor corresponding to each group is generated.
11. The method according to claim 10, wherein, The step of generating the target quantization scaling factor for each data channel in the first model functional network includes: Obtain the target quantization bit method corresponding to the first model functional network; For each data channel in the first model functional network, calculations are performed based on the obtained target quantization bit method, the activation input value corresponding to the data channel, and each quantization scaling factor to obtain the calculation result corresponding to each quantization scaling factor. The quantization scaling factor corresponding to the calculation result with the largest absolute value among multiple calculation results is taken as the target quantization scaling factor of the data channel.
12. The method according to claim 11, wherein, The calculation based on the acquired target quantization bit method, the activation input value corresponding to the data channel, and each quantization scaling factor, to obtain the calculation result corresponding to each quantization scaling factor, includes: The first quantization scaling factor in the first scaling factor list is traversed, and the calculation is performed based on the obtained target quantization bit method, the activation input value corresponding to the data channel, and the quantization scaling factor traversed, to obtain the calculation result corresponding to the quantization scaling factor traversed. The activation input value corresponding to the data channel is updated, and the next adjacent quantization scaling factor of the first quantization scaling factor is traversed. Based on the obtained target quantization bit method, the updated activation input value, and the traversed quantization scaling factors, calculations are performed to obtain the calculation results corresponding to the traversed quantization scaling factors. This process continues until the first scaling factor list is completely traversed, and the calculation results corresponding to each quantization scaling factor in the first scaling factor list are obtained.
13. The method according to any one of claims 10-12, wherein, The generation of the target quantization scaling factor for each combination includes: Obtain the target quantization bit method corresponding to the second model functional network; For each combination corresponding to each data channel in the second model functional network, calculations are performed based on the obtained target quantization bit method, the activation input value corresponding to the combination, and each quantization scaling factor to obtain the calculation result corresponding to each quantization scaling factor. The quantization scaling factor corresponding to the calculation result with the largest absolute value among multiple calculation results is taken as the target quantization scaling factor of the combination.
14. The method according to claim 13, wherein, The calculation based on the acquired target quantization bit method, the activation input value corresponding to the combination, and each quantization scaling factor, to obtain the calculation result corresponding to each quantization scaling factor, includes: Iterate through the first quantization scaling factor in the second scaling factor list, and perform calculations based on the obtained target quantization bit method, the activation input value corresponding to the combination, and the quantization scaling factor iterates through, to obtain the calculation result corresponding to the quantization scaling factor iterates through. The activation input value corresponding to the combination is updated, and the next adjacent quantization scaling factor of the first quantization scaling factor is traversed. The calculation is performed based on the obtained target quantization bit mode, the updated activation input value, and the traversed quantization scaling factors to obtain the calculation result corresponding to the traversed quantization scaling factor. This process continues until the second scaling factor list is completely traversed, and the calculation result corresponding to each quantization scaling factor in the second scaling factor list is obtained.
15. The method according to claim 9, wherein, The generation of the target quantization scaling factor based on the target quantization bit method includes: The quantization scaling factor corresponding to the target quantization bit method is obtained by deploying multiple devices with a hybrid bit quantization model. Select the largest quantization scaling factor from multiple quantization scaling factors as the target quantization scaling factor.
16. A data processing apparatus based on quantization, comprising: The quantization processing module is configured to quantize the original model parameters corresponding to the model functional network using multiple quantization bit methods to obtain the quantized model parameters corresponding to each quantization bit method. The acquisition module is configured to acquire the first output result of the model functional network under the original model parameters, and to acquire the second output result of the model functional network under each quantized model parameter; The selection module is configured to select a target quantization bit method from the multiple quantization bit methods based on the first output result and multiple second output results. The target quantization bit method is used to quantize the original model parameters during the inference stage corresponding to the model functional network.
17. An electronic device comprising: One or more processors; as well as A memory for storing one or more programs that, when executed by the electronic device, cause the electronic device to implement the quantization-based data processing method as described in any one of claims 1 to 15.
18. A computer-readable medium having a computer program stored thereon, wherein, When the computer program is executed by a processor, it implements the quantization-based data processing method as described in any one of claims 1 to 15.
19. A computer program product comprising computer instructions, wherein, When the computer instructions are executed by the processor, they implement the quantization-based data processing method as described in any one of claims 1 to 15.