Quantification method of large language model based on theoretical optimal smoothing function

By using a quantization method based on the theoretically optimal smoothing function, the problems of large quantization errors and high computational complexity caused by activation outliers in large language models are solved, achieving efficient quantization under different hardware platforms and task scenarios while maintaining model accuracy.

CN122242596APending Publication Date: 2026-06-19SHANDONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANDONG UNIV
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing large language model quantization methods suffer from large quantization errors, high computational complexity, and a lack of theoretical basis when dealing with outliers in the activation distribution, leading to a decrease in model accuracy, especially when activation outliers have a significant impact in specific layers or channels.

Method used

A quantization method based on the theoretically optimal smoothing function is adopted. By obtaining calibration and evaluation datasets, statistically scaling model parameters, designing a smoothing quantization strategy, obtaining the optimal smoothing coefficient, and constructing an INT8 inference model through joint reparameterization, the quantization error is reduced and the quantization effect is improved.

🎯Benefits of technology

While reducing computational overhead, it significantly reduces quantization error, improves quantization performance on different hardware platforms and task scenarios, and maintains model performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242596A_ABST
    Figure CN122242596A_ABST
Patent Text Reader

Abstract

This invention relates to the field of large model quantization technology, and in particular provides a large language model quantization method based on a theoretically optimal smoothing function. The method includes acquiring calibration and evaluation datasets; processing the large language model to be quantized and statistically scaling the model parameters; designing a smoothing quantization strategy to obtain the optimal smoothing coefficients for each input channel; joint reparameterization; statistically scaling the static quantization scale; and constructing an INT8 inference model. This method reduces computational overhead while lowering quantization errors and improves quantization performance across different hardware platforms and task scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of large model quantization technology, and in particular to a large language model quantization method based on a theoretically optimal smoothing function. Background Technology

[0002] In recent years, with the introduction of Transformer and MOE architectures, deep learning models have easily surpassed the scale of hundreds of millions of parameters, resulting in increasingly larger models. While the continuous increase in model size has significantly improved model expressive power and task performance, it has also brought problems such as high storage overhead, high memory bandwidth pressure, long inference latency, and high deployment costs. Therefore, reducing the deployment cost of large models and improving inference efficiency while maintaining model accuracy as much as possible has become a key issue in the practical application of large models.

[0003] Quantization techniques are one of the important means to solve the above problems. Among existing quantization methods, post-training quantization (PTQ) has become an important technical route for the compression and acceleration of large language models due to its advantages of not requiring model retraining, low implementation cost, and convenient deployment. However, PTQ usually leads to a certain loss of accuracy, especially when there are a large number of outliers or irregularities within the model. In such cases, the quantization error will increase significantly, thus affecting the overall performance of the model. Generally speaking, the weight distribution in large models is relatively concentrated, while the activation distribution is more prone to outliers, especially in certain layers or channels. These few but extremely large activation values ​​will significantly widen the overall quantization interval, causing most normal values ​​to be mapped to a coarser quantization scale, ultimately leading to a decrease in quantization accuracy. If the quantization interval is narrowed, outliers will be truncated or compressed, resulting in the loss of key information. Therefore, the problem of activation outliers has become one of the core bottlenecks limiting the performance improvement of post-training quantization for large models.

[0004] Currently, while existing methods have alleviated the outlier problem in large model quantization to some extent, they still generally suffer from the following shortcomings: First, the setting of quantization parameters or smoothing factors relies heavily on experience or additional searches, lacking rigorous theoretical basis; second, some methods have high computational complexity, increasing quantization and deployment costs; and third, existing methods have not fully analyzed the phenomenon of activation outliers concentrated in a few specific layers or channels from the perspective of error mechanism, making it difficult to achieve better error control while ensuring computational efficiency. Weight quantization is usually not the primary issue; rather, the few outlier layers and channels in activation are the key factors leading to a decrease in quantization accuracy after training. Summary of the Invention

[0005] In view of this, the present invention provides a large language model quantization method based on the theoretically optimal smoothing function, which reduces quantization error while reducing computational overhead and improves quantization performance on different hardware platforms and in different task scenarios.

[0006] In a first aspect, the present invention provides a large language model quantization method based on a theoretically optimal smoothing function, the method comprising:

[0007] Step 1: Obtain the calibration dataset and the evaluation dataset; Step 2: Based on Step 1, process the large language model to be quantized and statistically analyze the model parameter scale; Step 3: Based on Step 2, design a smoothing quantization strategy to obtain the optimal smoothing coefficient for each input channel; Step 4: Based on Step 3, perform joint reparameterization, statistical static quantization scale, and construct the INT8 inference model.

[0008] Optionally, step 1 includes: Step 11: The calibration dataset is used to collect activation information of each layer during the forward propagation of the model and to calculate the scale of the quantized model parameters; a. Data selection method: The Pile validation set was used as calibration data, and 512 text samples were randomly selected to balance statistical sufficiency and computational efficiency. b. Data preprocessing: To adapt to the model input, all text needs to be encoded into a token ID sequence by a tokenizer; all sequences are processed to a fixed length of 512. When the sequence length is greater than 512, it is truncated, and when the sequence length is less than 512, it is padded to obtain the model input tensor; c. Forward Propagation Sampling: Input calibration data into the model and perform forward propagation to collect activation information of each layer; Step 12, Evaluation Datasets: To comprehensively evaluate the performance of the quantization model in language modeling, knowledge reasoning and downstream tasks, benchmark datasets are selected for evaluation, including the LAMBADA dataset and the ARC-Easy ARC-Challenge dataset. The LAMBADA dataset is used for long-distance contextual dependency and word meaning prediction; the key evaluation metric is the accuracy of the prediction of the last word; the data processing method is to evaluate each sample to determine the accuracy of the model's prediction of the last word of the sentence; the ARC-Easy ARC-Challenge dataset is used for scientific knowledge question answering and reasoning ability; the key evaluation metric is the accuracy of multiple-choice questions; the data processing method is to conduct standardized evaluation through a unified evaluation framework to ensure the comparability of results.

[0009] Optionally, step 2 includes: First, the structure of the large language model to be quantized is analyzed to extract the linear layers. Then, based on calibration data, the input activation of each linear layer is statistically analyzed to obtain the activation scale required for subsequent smoothing operations, and the outlier characteristics in the activation distribution are analyzed. Step 21: Process and represent the large language model using linear layers; A large language model with a Transformer architecture is selected as the model to be quantized; the large language model is structurally analyzed to extract the linear layer structure in the model, and each linear layer is used as the basic object for subsequent smooth quantization; all nn.Linear layers in the large language model are traversed, and a forward hook is registered for each linear layer; when the large language model performs forward propagation, the input activation tensor of the linear layer is automatically intercepted through the hook function; Step 22: Activate scale statistics; Activation tensor reconstruction: For the input activation tensor of each linear layer, its original shape is usually (B,L,C). It is reconstructed into a two-dimensional activation matrix (B×L,C), so that all tokens are treated as samples and statistical analysis is performed on the channel dimension; where B represents the batch size, L represents the sequence length, and C represents the channel dimension. Channel scale calculation: For the reconstructed activation matrix, calculate the maximum absolute value of the last dimension, i.e., the hidden dimension or the channel dimension, to form the activation scale corresponding to each input channel; Cross-sample accumulation strategy: For the same linear layer, the above statistical process is repeated on multiple input samples, and a channel-wise maximum value update strategy is adopted: if the channel scale of the current sample is greater than the historical scale, it is updated; otherwise, the original scale is kept unchanged; finally, the global channel maximum value of the current linear layer on all samples is obtained, and it is used as the activation scale. Activation scale storage data structure: Statistical results are stored in an activation scale dictionary with layer name-scale vector mapping relationship; Step 23: Weighting Scale Statistics; For each group of linear layers participating in smoothing, the maximum absolute value of the weight tensor is first calculated in the input channel dimension to obtain the channel-wise weight scale. For multiple linear layers in the same group, a channel-wise maximum value aggregation strategy is adopted to obtain a unified weight scale vector.

[0010] Optionally, step 3 includes: The TOSQ smoothing quantization strategy is designed with the goal of minimizing quantization error. Theoretical analysis is conducted from the aspects of the mean and variance of quantization error to derive the optimal smoothing coefficient corresponding to each input channel. Based on this, a channel-level invertible smoothing matrix is ​​constructed to achieve adaptive collaborative redistribution of activation and weight in the channel dimension. Step 31: Equivalent linear transformation and channel-level smoothing matrix construction; For any linear layer in the large language model to be quantized, let its input activation matrix be X. The weight matrix is The corresponding original output matrix is Where T represents the number of tokens, Indicates the input channel. The output channel is represented by the activation matrix X and the weight matrix W, respectively: ; ; Introducing an invertible smooth matrix into matrix multiplication To reduce the complexity of matrix inversion and deployment implementation, Set as a diagonal matrix: ; in, Indicates the first The smoothing coefficients corresponding to each input channel, and Therefore, the original output matrix is ​​equivalently rewritten as: ; Based on the above transformation, the quantization object of the original linear layer is changed from and Convert to: ; That is, the first The activation values ​​of each input channel are scaled to ; the corresponding weights Line scaling For any element in the output matrix: ; Step 32: Establish the optimization objective and solve for the smoothing coefficient; I. Error Analysis and Expected Variance Calculation: Asymmetric quantization is used for the smoothed activation, and symmetric quantization is used for the smoothed weights; For the activation matrix Let its maximum and minimum values ​​be respectively and ,exist Under bit quantization conditions, the quantization and dequantization processes of the activation matrix are represented as follows: ; The activation quantization and dequantization errors are obtained as follows: ; When using the Round-to-nearest rounding method, the error is a periodic function; lie in Within the range, its size is related to the activation quantization scale. Related; For the weight matrix Let its maximum absolute value be ,exist Under signed symmetric quantization of bits, the quantization and dequantization processes of the weight matrix are represented as follows: ; The weighted quantization and inverse quantization error is obtained as follows: ; When using the Round-to-nearest rounding method, the error is a periodic function; lie in Within the range, its size is related to the activation quantization scale. Related; The floating-point result of matrix multiplication is: The calculation results after quantization and dequantization are The quantization error of a single multiplication term is expressed as: ; Quantization error consists of three parts: the error term formed by weighting the activation quantization error, the error term formed by activation amplification of the weighted quantization error, and the coupling error term formed by the combined effect of the two. Assuming the activation matrix X and weight matrix W are continuous and follow a uniform distribution, the expressions for the expected value and variance of the error are analyzed and solved as follows: ; ; right and Integrating over w and x respectively, we obtain the variance representation of the error. The analysis process is as follows: ; ; ; ; In the above variance Under the given form, assuming If the variance is 0, then the formula for variance simplifies to: ; II. Analysis of the minimum error variance condition; diagonal matrix right The influence of expectation and variance; if the variance is minimized, it is equivalent to... To minimize it, the expression for its objective function is: Define the following matrices A(k) and B(k), then the objective function is transformed into minimizing A(k)B(k): ; make Then prove When the variance is minimized, ; Step 33: Solve for the smoothing coefficient; Referencing Nagel's definition of the best representative power for each channel: ,in This represents the range of values ​​for the matrix channel i. Let represent the total range of values ​​for the matrix; therefore, the function to be solved is... Define the other parameters as follows: ; Solve the function This is equivalent to solving: ; It holds true when the error variance is minimized. ,but The solution is obtained .

[0011] Optionally, step 4 includes: Joint reparameterization is performed on the original floating-point model, and smoothing is completed at the whole model level. Then, the static quantization scale required by each linear layer in the INT8 inference process is statistically analyzed based on the calibration data, and the corresponding INT8 inference model is constructed. Step 41, Joint reparameterization; After obtaining the smoothing coefficients corresponding to each input channel, joint reparameterization is performed on the normalization layer and its subsequent linear layers to explicitly incorporate the equivalent transformation into the model parameters. Divide the normalized layer parameters by the smoothing coefficient; multiply the weights of subsequent linear layers by the same smoothing coefficient; for the LayerNorm normalization, adjust both its weight and bias; for the RMSNorm normalization, adjust only its weight; then iterate through all the decoder layers of the model to be quantized, performing the above operations on each layer in turn, thereby completing the smooth reparameterization of the entire model. Step 42: Static Quantitative Scale Statistics; After completing the overall model smoothing process, the static quantization scale required for INT8 inference is statistically determined based on the calibration data and used for subsequent INT8 model construction and integer inference calculations. To meet deployment inference requirements, the maximum value of each key tensor is statistically analyzed using the whole-tensor magnitude to construct scaling factors suitable for static quantization inference. The representation range of INT8 signed integers is [...]. [127,127], maps the maximum absolute value of each tensor to an integer representation range. [127,127], and calculate the corresponding quantization scaling factor scale accordingly; for the current tensor, its quantization scaling factor is expressed as: ; in, This represents the maximum global absolute value of the current tensor on the calibration data; In practical quantization, floating-point tensors Convert to INT8 tensor based on the scaling factor And satisfy the following relationship: ; in, This indicates the rounding operation. This indicates that the value is truncated to the range of INT8. Step 43: INT8 Model Construction and Inference Execution; After obtaining the smoothed floating-point model and the static quantization scale of each layer, construct the corresponding INT8 inference model and perform forward inference calculation based on the INT8 inference model; The linear layers in the original floating-point model are replaced with the corresponding INT8 linear calculation modules, and the pre-statistically obtained input scale, output scale, and weight quantization parameters are written into each module for quantization, multiply-accumulate, rescaling, and dequantization calculations in the integer domain. During inference, the input activation is first quantized into an INT8 representation based on the input scaling factor. Then, the INT8 linear operator is invoked to perform matrix multiplication or multiply-addition operations on the INT8 activation and INT8 weights in the integer domain to obtain an intermediate integer result. Subsequently, the intermediate integer result is rescaled by combining the input scaling factor, weight scaling factor, and output scaling factor. Depending on the requirements of subsequent modules, the integer representation is either retained for further propagation or dequantized into a floating-point representation before being input into the next calculation module. The above process is repeated layer by layer in the attention branch and feedforward network branch of the Transformer model to complete the INT8 forward inference of the entire large language model.

[0012] Optionally, the proof When the variance is minimized, include: Assume an optimal solution exists. 1 , making Consider making a local modification to k: only change the first... Each component will Decrease to a very small positive number With other components remaining unchanged, a new vector is obtained. : ; If we obtain any value that is arbitrarily small, then the following two points are guaranteed to hold: ① Maximize the index of the first term Unchanged, that is From the original number Item control; ②The second term B(k) will become smaller because It turns out that it was made The index of the maximum value is reduced. This will make Reduce, thereby making ; For the new k', we have: ,then: This contradicts the assumption that k is the optimal solution, therefore the assumption is invalid, and it is necessary that... .

[0013] In a second aspect, embodiments of the present invention provide a computer-readable storage medium comprising a stored program, wherein, when the program is executed, it controls the device where the computer-readable storage medium is located to execute the large language model quantization method based on the theoretically optimal smoothing function in the first aspect or any possible implementation thereof.

[0014] Thirdly, embodiments of the present invention provide an electronic device, including: one or more processors; a memory; and one or more computer programs, wherein the one or more computer programs are stored in the memory, and the one or more computer programs include instructions that, when executed by the device, cause the device to perform a large language model quantization method based on a theoretically optimal smoothing function in the first aspect or any possible implementation of the first aspect.

[0015] The technical solution provided by this invention includes obtaining a calibration dataset and an evaluation dataset; processing the large language model to be quantized and statistically analyzing the model parameter scale; designing a smoothing quantization strategy to obtain the optimal smoothing coefficient corresponding to each input channel; jointly reparameterizing, statistically analyzing the static quantization scale, and constructing an INT8 inference model. This method reduces quantization error while reducing computational overhead and improves quantization performance on different hardware platforms and in different task scenarios. Attached Figure Description

[0016] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 A flowchart of a large language model quantization method based on a theoretically optimal smoothing function provided in an embodiment of the present invention; Figure 2 A schematic diagram of the smoothing quantization strategy provided in an embodiment of the present invention; Figure 3 A schematic diagram of asymmetric quantization error provided in an embodiment of the present invention; Figure 4 A schematic diagram of symmetric quantization error provided in an embodiment of the present invention; Figure 5 This is a schematic diagram of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0019] The terminology used in the embodiments of this invention is for the purpose of describing particular embodiments only and is not intended to limit the invention. The singular forms “a,” “the,” and “the” used in the embodiments of this invention are also intended to include the plural forms unless the context clearly indicates otherwise.

[0020] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.

[0021] Depending on the context, the word "if" as used here can be interpreted as "when," "when," "in response to determination," or "in response to detection." Similarly, depending on the context, the phrase "if determination" or "if detection (of the stated condition or event)" can be interpreted as "when determination," "in response to determination," "when detection (of the stated condition or event)," or "in response to detection (of the stated condition or event)."

[0022] Figure 1 A flowchart of a large language model quantization method based on a theoretically optimal smoothing function provided in an embodiment of the present invention is shown below. Figure 1 As shown, the method includes: Step 1: Obtain the calibration dataset and the evaluation dataset.

[0023] In this embodiment of the invention, to obtain the activation distribution information, quantization scale parameters, and performance of the quantized model required during the quantization process, a calibration dataset and an evaluation dataset are obtained. Step 1 includes: Step 11: The calibration dataset is used to collect activation information of each layer during the forward propagation of the model and to calculate the scale of the quantized model parameters; this process does not involve model training and is only used for parameter estimation before quantization. a. Data selection method: The Pile validation set is used as calibration data. This dataset is a widely used large-scale, multi-domain English text collection that can provide a representative activation distribution. 512 text samples are randomly selected to balance statistical sufficiency and computational efficiency. b. Data preprocessing: To adapt to the model input, all text needs to be encoded into a token ID sequence by a tokenizer; to meet the model input requirements, all sequences are processed to a fixed length of 512 (the maximum context length of the model). When the sequence length is greater than 512, it is truncated, and when the sequence length is less than 512, it is padded to obtain the model input tensor; c. Forward Propagation Sampling: Input calibration data into the model and perform forward propagation to collect activation information of each layer; Step 12, Evaluation Datasets: To comprehensively evaluate the performance of the quantization model in language modeling, knowledge reasoning and downstream tasks, benchmark datasets are selected for evaluation, including the LAMBADA dataset and the ARC-Easy ARC-Challenge dataset. The LAMBADA dataset is used for long-distance contextual dependency and word meaning prediction; the key evaluation metric is the accuracy of the prediction of the last word; the data processing method is to evaluate each sample to determine the accuracy of the model's prediction of the last word of the sentence; the ARC-Easy ARC-Challenge dataset is used for scientific knowledge question answering and reasoning ability; the key evaluation metric is the accuracy of multiple-choice questions; the data processing method is to conduct standardized evaluation through a unified evaluation framework (such as lm-eval-harness) to ensure the comparability of results.

[0024] Step 2: Based on Step 1, process the large language model to be quantized and statistically analyze the model parameter scale.

[0025] In this embodiment of the invention, step 2 includes: First, the structure of the large language model to be quantized (such as the OPT series model) is analyzed to extract the linear layers. Then, based on the calibration data, the input activation of each linear layer is statistically analyzed to obtain the activation scale required for subsequent smoothing operations, and the outlier characteristics in the activation distribution are analyzed. Step 21: Process and represent the large language model using linear layers; A large language model with a Transformer architecture is selected as the model to be quantized; the large language model is structurally analyzed to extract the linear layer structure in the model, and each linear layer is used as the basic object for subsequent smooth quantization; all nn.Linear layers in the large language model are traversed, and a forward hook is registered for each linear layer; when the large language model performs forward propagation, the input activation tensor of the linear layer is automatically intercepted through the hook function; Step 22: Activate scale statistics; Activation Tensor Reconstruction: For the input activation tensor of each linear layer, its original shape is usually (B,L,C). For statistical purposes, it is reconstructed into a two-dimensional activation matrix (B×L,C), so that all tokens are treated as samples and statistical analysis is performed on the channel dimension. Here, B represents the batch size, L represents the sequence length, and C represents the channel dimension. Channel scale calculation: For the reconstructed activation matrix, calculate the maximum absolute value of the last dimension, i.e., the hidden dimension or the channel dimension, to form the activation scale corresponding to each input channel; Cross-sample accumulation strategy: For the same linear layer, the above statistical process is repeated on multiple input samples, and a channel-wise maximum value update strategy is adopted: if the channel scale of the current sample is greater than the historical scale, it is updated; otherwise, the original scale is kept unchanged; finally, the global channel maximum value of the current linear layer on all samples is obtained, and it is used as the activation scale. Activation scale storage data structure: The statistical results are stored in an activation scale dictionary with a layer name-scale vector mapping relationship; for example: q_proj input scale in the 0th layer attention; k_proj input scale in the 0th layer attention; v_proj input scale in the 0th layer attention; fc1 input scale in the 0th layer feedforward network; and the corresponding scales of all subsequent layers. The results of this step provide input for the subsequent smoothing function.

[0026] Step 23: Weighting Scale Statistics; For each group of linear layers participating in smoothing, the maximum absolute value of the weight tensor is first calculated in the input channel dimension to obtain the channel-wise weight scale. For multiple linear layers in the same group (such as Q / K / V projection layers), a channel-wise maximum aggregation strategy is adopted to obtain a unified weight scale vector.

[0027] Step 3: Based on Step 2, design a smoothing quantization strategy to obtain the optimal smoothing coefficient for each input channel.

[0028] In this embodiment of the invention, step 3 includes: The Theory-Optimal SmoothQuant (TOSQ) strategy differs from existing smoothing quantization methods, which mainly rely on empirical ratios or preset hyperparameters to scale and distribute activations and weights. This invention aims to minimize quantization error by conducting theoretical analysis on the mean and variance of quantization error, deriving the optimal smoothing coefficients for each input channel, and constructing a channel-level invertible smoothing matrix accordingly to achieve adaptive collaborative redistribution of activations and weights along the channel dimension. In embodiments of the present invention, such as Figure 2 As shown, the optimal smoothing coefficient is obtained based on the mean and variance analysis of the error. Subsequently, adaptive smoothing is applied to each input channel. The smoothing coefficient is determined by the value range of the activated channel and the maximum absolute value of the corresponding weight channel, and can characterize the optimal scaling ratio required for that channel between the activation side and the weight side. By applying the smoothing coefficient to each channel separately, targeted compression of abnormally activated channels can be achieved, and some of the quantization difficulty can be transferred to the weight side, thereby minimizing the overall quantization error.

[0029] Step 31: Equivalent linear transformation and channel-level smoothing matrix construction; For any linear layer in the large language model to be quantized, let its input activation matrix be X. The weight matrix is The corresponding original output matrix is Where T represents the number of tokens, This indicates the input channel. This represents the output channel; the activation matrix X and the weight matrix W have the following forms: ; ; To reduce the impact of activation outliers on the choice of quantization scale, an invertible smoothing matrix is ​​introduced into the matrix multiplication. To reduce the complexity of matrix inversion and deployment implementation, Set as a diagonal matrix: ; in, Indicates the first The smoothing coefficients corresponding to each input channel, and Therefore, the original output matrix is ​​equivalently rewritten as: ; Based on the above transformation, the quantization object of the original linear layer is changed from and Convert to: ; That is, the first The activation values ​​of each input channel are scaled to ; the corresponding weights Line scaling For any element in the output matrix: ; Therefore, it can be seen that smoothing transformation preserves the matrix multiplication results in the floating-point domain, but changes the numerical distribution of activations and weights before quantization. By appropriately selecting... This can reduce the amplitude of the activated abnormal channel, resulting in a smaller quantization error during low-bit quantization.

[0030] Step 32: Establish the optimization objective and solve for the smoothing coefficient; I. Error Analysis and Expected Variance Calculation: Since the activation values ​​in large language models are usually asymmetrically distributed, while the weight values ​​are usually approximately symmetrically distributed, this invention uses asymmetric quantization for the smoothed activations and symmetric quantization for the smoothed weights. For the activation matrix Let its maximum and minimum values ​​be respectively and ,exist Under bit quantization conditions, the quantization and dequantization processes of the activation matrix are represented as follows: ; The activation quantization and dequantization errors are obtained as follows: ; When using the Round-to-nearest rounding method, such as Figure 3 As shown, its error is a periodic function; lie in Within the range, its size is related to the activation quantization scale. Related; For the weight matrix Let its maximum absolute value be ,exist Under signed symmetric quantization of bits, the quantization and dequantization processes of the weight matrix are represented as follows: ; The weighted quantization and inverse quantization error is obtained as follows: ; When using the Round-to-nearest rounding method, such as Figure 4 As shown, its error is a periodic function; lie in Within the range, its size is related to the activation quantization scale. Related; The floating-point result of matrix multiplication is: The calculation results after quantization and dequantization are The quantization error of a single multiplication term is expressed as: ; Quantization error consists of three parts: the error term formed by weighting the activation quantization error, the error term formed by activation amplification of the weighted quantization error, and the coupling error term formed by the combined effect of the two. Assuming the activation matrix X and weight matrix W are continuous and follow a uniform distribution, the expressions for the expected value and variance of the error are analyzed and solved as follows: ; ; right and Integrating over w and x respectively, we obtain the variance representation of the error. The analysis process is as follows: ; ; ; ; In the above variance Under the given form, assuming If the variance is 0, then the formula for variance simplifies to: ; II. Analysis of the minimum error variance condition; diagonal matrix right The influence of expectation and variance; if the variance is minimized, it is equivalent to... To minimize it, the expression for its objective function is: Define the following matrices A(k) and B(k), then the objective function is transformed into minimizing A(k)B(k): ; make Then prove When the variance is minimized, ; In this embodiment of the invention, the proof When the variance is minimized, include: Assume an optimal solution exists. 1 , making Consider making a local modification to k: only change the first... Each component will Decrease to a very small positive number With other components remaining unchanged, a new vector is obtained. : ; If we obtain any value that is arbitrarily small, then the following two points are guaranteed to hold: ① Maximize the index of the first term Unchanged, that is From the original number The term dominates (because it does not change the first term) There are several components, and minor changes to other components will not cause it to exceed its original maximum value. (Small enough to guarantee this) ②The second term B(k) will become smaller because It turns out that it was made The index of the maximum value is reduced. This will make Reduce, thereby making ; For the new k', we have: ,then: Its solution is optimal along with k (i.e., minimizing A). B) Contradiction, therefore the assumption is invalid, and it is necessary to have .

[0031] Step 33: Solve for the smoothing coefficient; Referencing Nagel's definition of the best representative power for each channel: ,in This represents the range of values ​​for the matrix channel i. Let represent the total range of values ​​for the matrix; therefore, the function to be solved is... Define the other parameters as follows: ; Solve the function This is equivalent to solving: ; It holds true when the error variance is minimized. ,but The solution is obtained .

[0032] Step 4: Based on Step 3, perform joint reparameterization, statistical static quantization scale, and construct the INT8 inference model.

[0033] In this embodiment of the invention, step 4 includes: Based on the optimal smoothing coefficients of each input channel obtained from the mean and variance analysis of quantization error, this invention performs joint reparameterization on the original floating-point model and completes smoothing at the whole model level. Subsequently, by combining calibration data to statistically analyze the static quantization scale required by each linear layer in the INT8 inference process, the corresponding INT8 inference model is constructed, thereby realizing the practical deployment of the smoothing quantization method in large language models. Step 41, Joint reparameterization; After obtaining the smoothing coefficients corresponding to each input channel, joint reparameterization is performed on the normalization layer and its subsequent linear layers to explicitly incorporate the equivalent transformation into the model parameters. Divide the normalized layer parameters by the smoothing coefficient; multiply the weights of subsequent linear layers by the same smoothing coefficient; for the LayerNorm normalization, adjust both its weight and bias; for the RMSNorm normalization, adjust only its weight; then iterate through all the decoder layers of the model to be quantized, performing the above operations on each layer in turn, thereby completing the smooth reparameterization of the entire model. Step 42: Static Quantitative Scale Statistics; After completing the overall model smoothing process, the static quantization scale required for INT8 inference is statistically analyzed based on the calibration data, and used for subsequent INT8 model construction and integer inference computation. Unlike the channel-by-channel activation scale statistics used in the smoothing stage, a maximum value statistical method of the whole tensor is adopted for each key tensor to construct a scaling factor suitable for static quantization inference, catering to deployment inference requirements. The representation range of INT8 signed integers is […]. [127,127], maps the maximum absolute value of each tensor to an integer representation range. [127,127], and calculate the corresponding quantization scaling factor scale accordingly; for the current tensor, its quantization scaling factor is expressed as: ; in, This represents the maximum global absolute value of the current tensor on the calibration data; In practical quantization, floating-point tensors Convert to INT8 tensor based on the scaling factor And satisfy the following relationship: ; in, This indicates the rounding operation. This indicates that the value is truncated to the range of INT8. Step 43: INT8 Model Construction and Inference Execution; After obtaining the smoothed floating-point model and the static quantization scale of each layer, construct the corresponding INT8 inference model and perform forward inference calculation based on the INT8 inference model; The linear layers in the original floating-point model are replaced with the corresponding INT8 linear calculation modules, and the pre-statistically obtained input scale, output scale, and weight quantization parameters are written into each module for quantization, multiply-accumulate, rescaling, and dequantization calculations in the integer domain. During inference, the input activation is first quantized into an INT8 representation based on the input scaling factor. Then, the INT8 linear operator is invoked to perform matrix multiplication or multiply-addition operations on the INT8 activation and INT8 weights in the integer domain to obtain an intermediate integer result. Subsequently, the intermediate integer result is rescaled by combining the input scaling factor, weight scaling factor, and output scaling factor. Depending on the requirements of subsequent modules, the integer representation is either retained for further propagation or dequantized into a floating-point representation before being input into the next calculation module. The above process is repeated layer by layer in the attention branch and feedforward network branch of the Transformer model to complete the INT8 forward inference of the entire large language model.

[0034] To verify the effectiveness of the proposed smooth quantization method in low-bit quantization of large language models, the OPT-13B large language model was selected as the experimental object. Under the same test conditions, the unquantized FP16 model and the INT8 model quantized using the method of this invention were compared and tested. By comparing the task performance, single-sample inference latency, and model storage size of each model on multiple benchmark datasets, the beneficial effects of this invention in model compression, inference acceleration, and accuracy preservation were verified. To ensure the comparability and fairness of the experimental results, all models were tested under the same hardware and software environment and the same inference framework. Several representative datasets or tasks, such as LAMBADA, ARC-Easy, and ARC-Challenge, were selected in the experiment to evaluate the performance changes of the models before and after quantization from different perspectives. Simultaneously, the model storage size and single-sample inference latency of each model were recorded to comprehensively evaluate the application value of this invention in practical deployment. Furthermore, to further verify the advantages of this invention compared to existing quantization methods, typical quantization methods such as SmoothQuant were selected as comparative methods for extended verification. Evaluation dataset and task description: To fully verify the effectiveness of this invention, this experiment sets up evaluation tasks from three aspects: language modeling ability, context prediction ability, and downstream question-answering reasoning ability, as detailed below: 1. The LAMBADA dataset; the LAMBADA dataset is used to evaluate a model's ability to predict the last word of a sentence under long context conditions. During testing, the entire sentence is input into the model, and the last token is used as the prediction target. The prediction accuracy is then calculated. The evaluation metric for this task is accuracy. This dataset reflects the model's ability to understand and utilize semantic information from long contexts, and is used to verify the preservation of context modeling performance after quantization.

[0035] 2. ARC-Easy Dataset; ARC-Easy is a relatively easy multiple-choice task in the AI2 Reasoning Challenge, primarily testing the model's basic common sense understanding and simple reasoning ability. During testing, each question's candidate answers are scored, and the option with the highest score is selected as the result. The evaluation metric is accuracy. This dataset is used to verify the performance retention of the quantized model in basic downstream reasoning tasks.

[0036] 3. ARC-Challenge Dataset; ARC-Challenge is a more challenging subtask in ARC, requiring models to possess stronger knowledge integration and reasoning capabilities. In this experiment, accuracy is also used as the evaluation metric for this task to test the performance retention ability of the quantized model in complex downstream inference tasks. Because this task is more sensitive to quantization errors, it better demonstrates the advantages of the method in this invention in terms of accuracy preservation.

[0037] In summary, LAMBADA is primarily used to evaluate the model's long-context prediction capabilities, while ARC-Easy and ARC-Challenge are mainly used to evaluate the model's downstream question-answering and inference capabilities. Through these various datasets and evaluation metrics, the effectiveness of this invention in preserving model performance under low-bit quantization conditions can be comprehensively verified.

[0038] Experimental procedure: In the specific experiment, the OPT-13B pre-trained model was first loaded as the baseline model and run in FP16 format to obtain the performance metrics of the unquantized model. Subsequently, based on the same model, the TOSQ method proposed in this invention was used for smoothing transformation and INT8 quantization to construct the quantized inference model. To ensure the fairness and consistency of the comparative experiment, the models before and after quantization were evaluated using the same input length settings, unified evaluation scripts, and consistent dataset partitioning, thereby eliminating the interference of non-quantization factors on the experimental results. At the same time, the quantization configuration (such as quantization granularity and calibration data scale) was unified to ensure the objectivity of the comparison results.

[0039] Quantizability analysis begins with an activation distribution statistical analysis of the model before quantization to guide the design of subsequent smoothing quantization strategies. Specifically, by registering forward hooks at each linear layer of the model, input activation tensors are collected layer by layer, and their statistical characteristics, including maximum value, mean, standard deviation, and quantiles, are calculated. Cross-sample aggregation statistics are performed based on a calibration dataset (512 randomly selected samples) to construct a global activation scale distribution. Furthermore, by analyzing the activation distribution characteristics of each layer and channel, outliers are detected, and key channels with significant outliers are identified. Experimental results show that the quantization error in the model is mainly dominated by high-amplitude outliers in a few channels, exhibiting a clear channel-level concentration characteristic. Therefore, it is necessary to introduce a channel-level smoothing mechanism during quantization to mitigate the impact of outliers on quantization accuracy.

[0040] After obtaining activation scale information, the TOSQ smoothing quantization method proposed in this invention is used to quantize the model. Based on the maximum activation value of each channel, a diagonal smoothing matrix is ​​constructed to collaboratively redistribute the activations and weights, making their numerical range more balanced. Then, a collaborative transformation of the activation and weight matrices is performed to reduce extreme values ​​in the activations and improve quantization stability. For layers with significantly abnormal maximum activation values ​​(such as some MLP layers), a stronger smoothing coefficient or selective protection (such as skipping quantization or increasing the bit width) can be used. Finally, INT8 quantization is performed on the smoothed activations and weights to construct the quantized inference model.

[0041] Multi-task evaluation: After model quantization, the model is systematically evaluated from three dimensions: language modeling ability, downstream inference ability, and inference efficiency. On the LAMBADA dataset, the complete context is input into the model, with the last word of the sentence as the prediction target. Prediction accuracy is calculated, and the average inference latency per sample is recorded based on the CUDA event timing mechanism. On the ARC-Easy and ARC-Challenge datasets, a unified evaluation framework is used to score the candidate answers for each multiple-choice question, and the option with the highest score is selected as the prediction result. The model's accuracy and corresponding inference latency on each task are recorded. In addition, throughout the experiment, the model parameter size and related buffer usage are simultaneously calculated to determine the overall storage size of the model, thereby comprehensively evaluating the performance of the quantization method in terms of storage compression and deployment efficiency.

[0042] The following indicators were used to evaluate the model in this experiment: Model size: This metric characterizes the storage requirements of the model, measured in MB. It primarily reflects the effectiveness of quantized model storage compression. A smaller model size indicates a more suitable model for practical deployment, especially for resource-constrained devices or multi-instance inference scenarios.

[0043] Accuracy: Characterizes the model's performance on tasks such as LAMBADA, ARC-Easy, and ARC-Challenge, expressed as a percentage. In LAMBADA, accuracy represents the proportion of samples where the model correctly predicts the last target word; in ARC-Easy and ARC-Challenge, accuracy represents the model's correct answer rate on multiple-choice tasks. This metric is used to evaluate the ability of this invention to preserve model task performance under low-bit quantization conditions.

[0044] Per-sample latency: This metric characterizes model inference efficiency, measured in milliseconds. It reflects the time required for the model to complete inference for a single sample. A lower value indicates a faster model response, making it more suitable for online inference and low-latency deployment scenarios.

[0045] Among them, model size is used to evaluate the effect of the present invention on model compression, accuracy is used to evaluate the accuracy retention capability of the present invention under low bit quantization conditions, and single-sample inference latency is used to evaluate the improvement effect of the present invention on inference response speed in actual deployment.

[0046] Results Analysis and Benefits: The experimental results show that the method of the present invention has good effects in terms of model compression, inference efficiency and performance preservation.

[0047] Firstly, regarding model compression, this invention can effectively convert the original FP16 large language model into an INT8 model, significantly reducing the model storage size. This indicates that this invention can effectively reduce the storage pressure when deploying large models and improve the deployability of models in edge devices, resource-constrained servers, or multi-path concurrent inference scenarios.

[0048] Secondly, regarding inference efficiency, the quantized model exhibits lower single-sample inference latency across multiple evaluation tasks, demonstrating that this invention effectively reduces computational overhead during model forward propagation and improves model response speed and throughput. This has significant practical implications for applications with high real-time requirements, such as online question answering, intelligent interaction, and text generation.

[0049] Furthermore, regarding performance preservation, the quantized model of this invention maintains high accuracy on tasks such as LAMBADA, ARC-Easy, and ARC-Challenge, indicating that this invention can significantly compress the model while effectively preserving its original language modeling capabilities and downstream task performance. Particularly noteworthy is that on some tasks, the quantized model even outperforms the unquantized baseline model, demonstrating that the smooth quantization method proposed in this invention can, to some extent, improve the quantization error distribution and enhance the model's robustness.

[0050] Further analysis from the perspective of quantization error mechanism reveals that this invention introduces a channel-level smoothing matrix to effectively transfer the extreme values ​​in the activation tensor to the weight side, thereby dispersing and balancing the quantization error that was originally concentrated in a few abnormal channels. Since quantization error is significantly positively correlated with the maximum value of the tensor, this strategy can effectively reduce the dynamic range of the activation distribution, thus reducing the overall quantization error. This mechanism explains why this invention can maintain or even improve model performance in complex inference tasks such as ARC-Challenge, which are highly sensitive to numerical accuracy. The experimental results are shown in Table 1.

[0051] Table 1. Test results on the impact of quantization on model performance. ; The above experimental examples and comparative examples demonstrate that the present invention has at least the following beneficial effects: it can effectively quantize large language models from FP16 to INT8, reducing the model storage size to about 50% of the original model, significantly improving storage efficiency while maintaining high accuracy; it can significantly reduce single-sample inference latency and improve the model's real-time response capability; in some tasks, the performance of the quantized model of the present invention is better than that of the unquantized baseline model, indicating that the present invention is more effective in suppressing quantization errors; and it achieves a better performance trade-off between model compression, inference efficiency, and performance preservation.

[0052] The key to this invention lies in establishing an analytical model of quantization error after smoothing transformation, starting from the mathematical mechanism of quantization error, and solving for the smoothing function that minimizes quantization error through analysis of the mean and variance of the error. Addressing the problem that activation outliers easily lead to increased quantization error during neural network quantization, this invention represents the smoothed output error as a combination of activation quantization error, weighted quantization error, and their coupling error terms, and transforms the solution of the optimal smoothing matrix into a problem of minimizing error statistics. Compared with existing smoothing methods that rely on empirically set parameters, the smoothing strategy adopted in this invention has a clearer theoretical basis and better interpretability.

[0053] 1. A smooth transformation model is constructed based on minimizing quantization output error; this invention does not use empirical methods to set smoothing parameters, but rather starts from matrix multiplication. Starting with an invertible smooth matrix, we introduce... Transforming the original quantitative target into a and This addresses the joint quantization problem. Furthermore, by minimizing the error between the quantized and dequantized output and the original floating-point output, an output error-oriented smooth quantization framework is established. This framework characterizes the mechanism of smoothing transformation from the perspective of the final calculation result of the model, ensuring that the design of smoothing parameters is no longer limited to empirical adjustment but is based on clear mathematical objectives.

[0054] 2. An analytical expression for quantization error is established, and the error is decomposed into activation error term, weight error term, and coupling error term. Within the smooth quantization framework, this invention analytically models the quantization error, representing the error between the quantized output and the original floating-point output as a combination of activation quantization error term, weight quantization error term, and their coupling error term. Through this error decomposition, the combined influence mechanism of activation outliers, weight distribution, and smoothing transformation on the final quantization result can be more clearly characterized. This error modeling method not only provides a foundation for subsequent theoretical derivations but also allows smoothing parameter optimization to directly focus on the error source, thereby improving the method's relevance and effectiveness.

[0055] 3. Solving for the optimal smoothing function based on mean and variance analysis of quantization error: Building upon the analytical expression of error, this invention further analyzes the statistical characteristics of quantization error. By calculating or estimating the mean and variance of the error, it studies the influence of smoothing transformation parameters on the distribution of quantization error. Furthermore, using the minimization of error variance or the overall minimization of error statistics as the optimization objective, it solves for the smoothing function or smoothing parameter that minimizes the quantization error. This technical approach provides a rigorous theoretical basis for determining the smoothing parameter, statistically reduces quantization error, and effectively alleviates the instability problems caused by relying solely on empirical formulas or manual parameter tuning in existing technologies.

[0056] 4. The error minimization problem is transformed into an extreme value optimization problem related to the activation range and weight range. To further improve the computability of solving the smoothing parameters, this invention transforms the quantization error statistic minimization problem into an extreme value optimization problem related to the activation channel value range and the weight channel value range. Through structural analysis of the objective function, the determination criteria and solution rules for the optimal smoothing parameters are obtained. In other words, this invention further transforms the originally complex error optimization problem into an analyzable and solvable mathematical optimization problem, giving the process of calculating the smoothing coefficients a clear analytical basis.

[0057] This invention derives the optimal smoothing factor through theoretical analysis, enabling adaptive adjustment of quantization parameters for different model characteristics, thereby significantly improving computational efficiency while maintaining accuracy. Unlike existing methods that rely on empirical selection of smoothing factors, the smoothing factor of this invention is derived through in-depth theoretical analysis, reducing the impact of outliers on quantization errors without increasing additional computational overhead, thus ensuring higher inference speed and accuracy.

[0058] The quantization method provided by this invention avoids the problem of reduced inference speed in existing methods and can maintain high computational efficiency during large-scale model inference. By optimizing the computation process in the outlier handling process, redundant calculations and complex operations are reduced, thereby improving the overall efficiency of the quantization process. Unlike existing methods that are not effective when handling specific models, this invention can adapt to the structure of different deep learning models, ensuring a balance between quantization accuracy and computational efficiency. The optimal smoothing factor obtained through theoretical analysis can ensure that high accuracy is maintained even when handling a large number of outliers, especially with near-lossless quantization accuracy in 8-bit quantization.

[0059] This invention can maintain model accuracy well under low bit quantization conditions. Compared with traditional smooth quantization methods, this invention handles activation distribution more reasonably, mitigating the adverse effects of outliers on quantization scale selection, thereby improving the task performance of the quantized model without significantly increasing quantization complexity. Overall, the advantages of this invention are mainly reflected in the following aspects: strong feasibility and ease of implementation; this invention can be directly applied to existing large language model quantization processes without relying on significant modifications to the original model structure, possessing a clear implementation path and strong engineering operability. Therefore, this invention can complete quantization deployment while maintaining the original model's main structure and inference process essentially unchanged, demonstrating good practical application feasibility; good compatibility and deployment friendliness; this invention is compatible with existing smooth quantization frameworks and can be directly improved based on methods like SmoothQuant, facilitating integration into existing inference deployment processes. Meanwhile, this invention requires no additional complex hardware support or changes to the original model's overall architecture, thus enabling easy implementation on existing inference platforms and quantization toolchains, demonstrating good engineering adaptability and promotional value. It exhibits strong robustness; addressing the issue of outliers in the activation matrix, this invention effectively suppresses the interference of outliers on the selection of quantization scale, thereby enhancing the quantization process's adaptability to activation distribution fluctuations and improving the stability of the quantized model under different tasks and data distributions. Compared to traditional smoothing quantization methods, this invention is more targeted in handling activation outliers, thus achieving more stable quantization results while maintaining quantization efficiency. It also demonstrates strong accuracy preservation capabilities; after quantizing the OPT-13B model from FP16 to INT8, this invention can still maintain model accuracy well. Taking experimental results as an example, on the LAMBADA dataset, the accuracy of the unquantized FP16 model was 80.6942%, while the accuracy after quantization using this invention was 79.8521%, a decrease of only 0.8421 percentage points. On the ARC-E dataset, the accuracy decreased from 61.8266% to 60.1880%, a decrease of only 1.6386 percentage points. Meanwhile, on the ARC-C dataset, the accuracy after quantization using this invention reached 36.4334%, which is not only higher than the traditional SmoothQuant's 35.0683%, but also higher than the unquantized FP16 model's 35.6655%, an improvement of 0.7679 percentage points. This demonstrates that this invention can effectively control accuracy loss under low-bit quantization conditions and achieves better performance than the unquantized model and traditional quantization methods on some tasks, exhibiting strong accuracy preservation capabilities. Low inference latency and fast response speed: While maintaining high accuracy, this invention can also significantly reduce model inference latency and improve the model's real-time response capability.Specifically, on the LAMBADA dataset, the single-sample inference latency of the unquantized FP16 model was 82.5500 ms. After quantization using this invention, it was reduced to 69.5780 ms, a reduction of 12.9720 ms, or approximately 15.71%. On the ARC-C dataset, the latency decreased from 24.4662 ms to 21.2544 ms, a reduction of 3.2118 ms, or approximately 13.13%. On the ARC-E dataset, the latency decreased from 19.6424 ms to 17.0532 ms, a reduction of 2.5892 ms, or approximately 13.18%. These results demonstrate that this invention can effectively improve model computational efficiency after quantization, exhibiting significant acceleration effects on multiple tasks, and is suitable for application scenarios with high requirements for inference speed and real-time response capabilities.

[0060] The technical solution provided by this invention includes obtaining a calibration dataset and an evaluation dataset; processing the large language model to be quantized and statistically analyzing the model parameter scale; designing a smoothing quantization strategy to obtain the optimal smoothing coefficient corresponding to each input channel; jointly reparameterizing, statistically analyzing the static quantization scale, and constructing an INT8 inference model. This method reduces quantization error while reducing computational overhead and improves quantization performance on different hardware platforms and in different task scenarios.

[0061] The various steps in the embodiments of the present invention can be performed by an electronic device. This electronic device includes, but is not limited to, tablet computers, portable PCs, and desktop computers.

[0062] This invention provides a computer-readable storage medium including a stored program, wherein, when the program is running, it controls the electronic device containing the computer-readable storage medium to execute the above-described embodiment of the large language model quantization method based on the theoretically optimal smoothing function.

[0063] Figure 5 A schematic diagram of an electronic device provided in an embodiment of the present invention, such as... Figure 5 As shown, the electronic device 21 includes a processor 211, a memory 212, and a computer program 213 stored in the memory 212 and executable on the processor 211. When the computer program 213 is executed by the processor 211, it implements the large language model quantization method based on the theoretically optimal smoothing function in the embodiment. To avoid repetition, it will not be described in detail here.

[0064] Electronic device 21 includes, but is not limited to, processor 211 and memory 212. Those skilled in the art will understand that... Figure 5This is merely an example of electronic device 21 and does not constitute a limitation on electronic device 21. It may include more or fewer components than shown, or combine certain components, or different components. For example, electronic device may also include input / output devices, network access devices, buses, etc.

[0065] The processor 211 may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.

[0066] The memory 212 can be an internal storage unit of the electronic device 21, such as a hard disk or RAM of the electronic device 21. The memory 212 can also be an external storage device of the electronic device 21, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, or FlashCard equipped on the electronic device 21. Furthermore, the memory 212 can include both internal and external storage units of the electronic device 21. The memory 212 is used to store computer programs and other programs and data required by network devices. The memory 212 can also be used to temporarily store data that has been output or will be output.

[0067] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0068] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A large language model quantization method based on a theoretical optimal smoothing function, characterized in that, The method includes: Step 1: Obtain the calibration dataset and the evaluation dataset; Step 2: Based on Step 1, process the large language model to be quantized and statistically analyze the model parameter scale; Step 3: Based on Step 2, design a smoothing quantization strategy to obtain the optimal smoothing coefficient for each input channel; Step 4: Based on Step 3, perform joint reparameterization, statistical static quantization scale, and construct the INT8 inference model.

2. The method of claim 1, wherein, Step 1 includes: Step 11: The calibration dataset is used to collect activation information of each layer during the forward propagation of the model and to calculate the scale of the quantized model parameters; a. Data selection method: The Pile validation set was used as calibration data, and 512 text samples were randomly selected to balance statistical sufficiency and computational efficiency. b. Data preprocessing: To adapt to the model input, all text needs to be encoded into a token ID sequence by a tokenizer; all sequences are processed to a fixed length of 512. When the sequence length is greater than 512, it is truncated, and when the sequence length is less than 512, it is padded to obtain the model input tensor; c. Forward Propagation Sampling: Input calibration data into the model and perform forward propagation to collect activation information of each layer; Step 12, Evaluation Datasets: To comprehensively evaluate the performance of the quantization model in language modeling, knowledge reasoning and downstream tasks, benchmark datasets are selected for evaluation, including the LAMBADA dataset and the ARC-Easy ARC-Challenge dataset. The LAMBADA dataset is used for long-distance contextual dependency and word meaning prediction; the key evaluation metric is the accuracy of the prediction of the last word; the data processing method is to evaluate each sample to determine the accuracy of the model's prediction of the last word of the sentence; the ARC-Easy ARC-Challenge dataset is used for scientific knowledge question answering and reasoning ability; the key evaluation metric is the accuracy of multiple-choice questions; the data processing method is to conduct standardized evaluation through a unified evaluation framework to ensure the comparability of results.

3. The method of claim 2, wherein, Step 2 includes: First, the structure of the large language model to be quantized is analyzed to extract the linear layers. Then, based on calibration data, the input activation of each linear layer is statistically analyzed to obtain the activation scale required for subsequent smoothing operations, and the outlier characteristics in the activation distribution are analyzed. Step 21: Process and represent the large language model using linear layers; A large language model with a Transformer architecture is selected as the model to be quantized; the large language model is structurally analyzed to extract the linear layer structure in the model, and each linear layer is used as the basic object for subsequent smooth quantization; all nn.Linear layers in the large language model are traversed, and a forward hook is registered for each linear layer; when the large language model performs forward propagation, the input activation tensor of the linear layer is automatically intercepted through the hook function; Step 22: Activate scale statistics; Activation tensor reconstruction: For the input activation tensor of each linear layer, its original shape is usually (B,L,C). It is reconstructed into a two-dimensional activation matrix (B×L,C), so that all tokens are treated as samples and statistical analysis is performed on the channel dimension; where B represents the batch size, L represents the sequence length, and C represents the channel dimension. Channel scale calculation: For the reconstructed activation matrix, calculate the maximum absolute value of the last dimension, i.e., the hidden dimension or the channel dimension, to form the activation scale corresponding to each input channel; Cross-sample accumulation strategy: For the same linear layer, the above statistical process is repeated on multiple input samples, and a channel-wise maximum value update strategy is adopted: if the channel scale of the current sample is greater than the historical scale, it is updated; otherwise, the original scale is kept unchanged; finally, the global channel maximum value of the current linear layer on all samples is obtained, and it is used as the activation scale. Activation scale storage data structure: Statistical results are stored in an activation scale dictionary with layer name-scale vector mapping relationship; Step 23: Weighting Scale Statistics; For each group of linear layers participating in smoothing, the maximum absolute value of the weight tensor is first calculated in the input channel dimension to obtain the channel-wise weight scale. For multiple linear layers in the same group, a channel-wise maximum value aggregation strategy is adopted to obtain a unified weight scale vector.

4. The method according to claim 3, characterized in that, Step 3 includes: The TOSQ smoothing quantization strategy is designed with the goal of minimizing quantization error. Theoretical analysis is conducted from the aspects of the mean and variance of quantization error to derive the optimal smoothing coefficient corresponding to each input channel. Based on this, a channel-level invertible smoothing matrix is ​​constructed to achieve adaptive collaborative redistribution of activation and weight in the channel dimension. Step 31: Equivalent linear transformation and channel-level smoothing matrix construction; For any linear layer in the large language model to be quantized, let its input activation matrix be X. The weight matrix is The corresponding original output matrix is Where T represents the number of tokens, Indicates the input channel. The output channel is represented by the activation matrix X and the weight matrix W, respectively: ; ; Introducing an invertible smooth matrix into matrix multiplication To reduce the complexity of matrix inversion and deployment implementation, Set as a diagonal matrix: ; in, Indicates the first The smoothing coefficients corresponding to each input channel, and Therefore, the original output matrix is ​​equivalently rewritten as: ; Based on the above transformation, the quantization object of the original linear layer is changed from and Convert to: ; That is, the first The activation values ​​of each input channel are scaled to ; the corresponding weights Row scaling For any element in the output matrix: ; Step 32: Establish the optimization objective and solve for the smoothing coefficient; I. Error Analysis and Expected Variance Calculation: Asymmetric quantization is used for the smoothed activation, and symmetric quantization is used for the smoothed weights; For the activation matrix Let its maximum and minimum values ​​be respectively and ,exist Under bit quantization conditions, the quantization and dequantization processes of the activation matrix are represented as follows: ; The activation quantization and dequantization errors are obtained as follows: ; When using the Round-to-nearest rounding method, the error is a periodic function; lie in Within the range, its size is related to the activation quantization scale. Related; For the weight matrix Let its maximum absolute value be ,exist Under signed symmetric quantization of bits, the quantization and dequantization processes of the weight matrix are represented as follows: ; The weighted quantization and inverse quantization error is obtained as follows: ; When using the Round-to-nearest rounding method, the error is a periodic function; lie in Within the range, its size is related to the activation quantization scale. Related; The floating-point result of matrix multiplication is: The calculation results after quantization and dequantization are The quantization error of a single multiplication term is expressed as: ; Quantization error consists of three parts: the error term formed by weighting the activation quantization error, the error term formed by activation amplification of the weighted quantization error, and the coupling error term formed by the combined effect of the two. Assuming the activation matrix X and weight matrix W are continuous and follow a uniform distribution, the expressions for the expected value and variance of the error are analyzed and solved as follows: ; ; right and Integrating over w and x respectively, we obtain the variance representation of the error. The analysis process is as follows: ; ; ; ; In the above variance Under the given form, assuming If the variance is 0, then the formula for variance simplifies to: ; II. Analysis of the minimum error variance condition; diagonal matrix right The influence of expectation and variance; if the variance is minimized, it is equivalent to... To minimize it, the expression for its objective function is: Define the following matrices A(k) and B(k), then the objective function is transformed into minimizing A(k)B(k): ; make Then prove When the variance is minimized, ; Step 33: Solve for the smoothing coefficient; Referencing Nagel's definition of the best representative power for each channel: ,in This represents the range of values ​​for the matrix channeli. Let represent the total range of values ​​for the matrix; therefore, the function to be solved is... Define the other parameters as follows: ; Solve the function This is equivalent to solving: ; It holds true when the error variance is minimized. ,but The solution is obtained .

5. The method according to claim 4, characterized in that, Step 4 includes: Joint reparameterization is performed on the original floating-point model, and smoothing is completed at the whole model level. Then, the static quantization scale required by each linear layer in the INT8 inference process is statistically analyzed based on the calibration data, and the corresponding INT8 inference model is constructed. Step 41, Joint reparameterization; After obtaining the smoothing coefficients corresponding to each input channel, joint reparameterization is performed on the normalization layer and its subsequent linear layers to explicitly incorporate the equivalent transformation into the model parameters. Divide the normalized layer parameters by the smoothing coefficient; multiply the weights of subsequent linear layers by the same smoothing coefficient; for the LayerNorm normalization, adjust both its weight and bias; for the RMSNorm normalization, adjust only its weight; then iterate through all the decoder layers of the model to be quantized, performing the above operations on each layer in turn, thereby completing the smooth reparameterization of the entire model. Step 42: Static Quantitative Scale Statistics; After completing the overall model smoothing process, the static quantization scale required for INT8 inference is statistically determined based on the calibration data and used for subsequent INT8 model construction and integer inference calculations. To meet deployment inference requirements, the maximum value of each key tensor is statistically analyzed using the whole-tensor magnitude to construct scaling factors suitable for static quantization inference. The representation range of INT8 signed integers is [...]. [127,127], maps the maximum absolute value of each tensor to an integer representation range. [127,127], and calculate the corresponding quantization scaling factor scale accordingly; for the current tensor, its quantization scaling factor is expressed as: ; in, This represents the maximum global absolute value of the current tensor on the calibration data; In practical quantization, floating-point tensors Convert to INT8 tensor based on the scaling factor And satisfy the following relationship: ; in, This indicates the rounding operation. This indicates that the value is truncated to the range of INT8. Step 43: INT8 Model Construction and Inference Execution; After obtaining the smoothed floating-point model and the static quantization scale of each layer, construct the corresponding INT8 inference model and perform forward inference calculation based on the INT8 inference model; The linear layers in the original floating-point model are replaced with the corresponding INT8 linear calculation modules, and the pre-statistically obtained input scale, output scale, and weight quantization parameters are written into each module for quantization, multiply-accumulate, rescaling, and dequantization calculations in the integer domain. During inference, the input activation is first quantized into an INT8 representation based on the input scaling factor. Then, the INT8 linear operator is invoked to perform matrix multiplication or multiply-addition operations on the INT8 activation and INT8 weights in the integer domain to obtain an intermediate integer result. Subsequently, the intermediate integer result is rescaled by combining the input scaling factor, weight scaling factor, and output scaling factor. Depending on the requirements of subsequent modules, the integer representation is either retained for further propagation or dequantized into a floating-point representation before being input into the next calculation module. The above process is repeated layer by layer in the attention branch and feedforward network branch of the Transformer model to complete the INT8 forward inference of the entire large language model.

6. The method according to claim 4, characterized in that, The proof When the variance is minimized, include: Assume an optimal solution exists. 1 , making Consider making a local modification to k: only change the first... Each component, Decrease to a very small positive number With other components remaining unchanged, a new vector is obtained. : ; If we obtain any value that is arbitrarily small, then the following two points are guaranteed to hold: ① Maximize the index of the first term Unchanged, that is From the original number Item control; ②The second term B(k) will become smaller because It turns out that it was made The index that reaches the maximum value is reduced. This will make Reduce, thereby making ; For the new k', we have: ,then: This contradicts the assumption that k is the optimal solution, therefore the assumption is invalid, and it is necessary that... .

7. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes a stored program, wherein, when the program is executed, it controls the device on which the computer-readable storage medium is located to perform the large language model quantization method based on the theoretically optimal smoothing function as described in any one of claims 1 to 6.

8. An electronic device, characterized in that, include: One or more processors; Memory; And one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs including instructions that, when executed by the device, cause the device to perform the large language model quantization method based on the theoretically optimal smoothing function as described in any one of claims 1 to 6.