Quantification method and device of feature data in model, storage medium and electronic equipment

By constructing inter-block and intra-block orthogonal transformation matrices, the problems of variance imbalance of activation data and codebook utilization imbalance in quantization technology are solved, achieving efficient improvement in quantization accuracy, solving the codebook collapse phenomenon under MXFP4 format, and improving quantization efficiency.

CN122242595APending Publication Date: 2026-06-19NANJING HOUMO TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING HOUMO TECH CO LTD
Filing Date
2026-03-10
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing quantization techniques suffer from imbalances in variance between activated data blocks and imbalances in codebook utilization within blocks, resulting in low quantization accuracy, especially when using the micro-scaling MXFP4 format where codebook collapse occurs.

Method used

By constructing inter-block orthogonal transformation matrices and intra-block orthogonal transformation matrices, feature data is transformed between blocks and within blocks, activation energy is evenly distributed and codebook utilization is optimized, and quantization is performed using a pre-constructed normalized coefficient matrix to eliminate the influence of high-variance blocks on quantized data, thus achieving uniform codebook mapping.

Benefits of technology

It significantly improves quantization accuracy, eliminates codebook collapse, enhances quantization efficiency, achieves near-floating-point performance, and constructs analytical parameters using only a small amount of calibration data without the need for training data.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242595A_ABST
    Figure CN122242595A_ABST
Patent Text Reader

Abstract

This disclosure provides a method, apparatus, storage medium, and electronic device for quantizing feature data in a model. The method includes: acquiring feature data to be quantized from a target model deployed on a target device, wherein the feature data includes at least two data blocks; performing an inter-block transformation on the feature data using a pre-constructed inter-block orthogonal transformation matrix to obtain inter-block transformed data; performing an intra-block transformation on each data block included in the inter-block transformed data using a pre-constructed intra-block orthogonal transformation matrix to obtain intra-block transformed data; and quantizing the intra-block transformed data to obtain quantized feature data. This disclosure can evenly distribute the energy of the feature data across all blocks, eliminating the influence of high-variance blocks on the quantization sharing index, and redistributing the intra-block data so that the intra-block data is evenly mapped to each codeword interval of the codebook, solving the codebook collapse problem and thus significantly improving quantization accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to neural network model compression technology and data quantization technology, and in particular to a method, apparatus, storage medium and electronic device for quantizing feature data in a model. Background Technology

[0002] With the development of large language models (LLMs), the parameter size and memory bandwidth requirements during inference have become major bottlenecks. To efficiently run LLMs on resource-constrained edge devices or large-scale data centers, post-training quantization (PTQ) has become the industry standard paradigm. While traditional INT4 quantization has achieved success in weight compression, its limited dynamic range makes it difficult to accommodate the long-tailed outliers prevalent in the activation data of the model.

[0003] Therefore, low-precision floating-point formats with higher dynamic range are gaining wider adoption, especially the Microscaling FP4 (MXFP4 / e2m1) standard. MXFP4 employs block-wise shared exponent technology, providing near-FP8 representation capabilities while maintaining a 4-bit compression ratio and supporting a wider range of hardware types.

[0004] Currently used quantization techniques suffer from imbalances in the variance between blocks of activated data and imbalances in the utilization of the codebook within blocks. Summary of the Invention

[0005] To address the aforementioned technical problems, this disclosure is proposed. Embodiments of this disclosure provide a method, apparatus, storage medium, and electronic device for quantifying feature data in a model.

[0006] Embodiments of this disclosure provide a method for quantizing feature data in a model. The method includes: obtaining feature data to be quantized from a target model deployed on a target device, wherein the feature data includes at least two data blocks; performing an inter-block transformation on the feature data using a pre-constructed inter-block orthogonal transformation matrix to obtain inter-block transformed data; performing an intra-block transformation on each data block included in the inter-block transformed data using a pre-constructed intra-block orthogonal transformation matrix to obtain intra-block transformed data; and quantizing the intra-block transformed data to obtain quantized feature data.

[0007] In some embodiments, quantizing the transformed data within a block to obtain quantized feature data includes: obtaining a pre-constructed normalization coefficient matrix; for each data block in the transformed data within a block, extracting the normalization coefficient corresponding to that data block from the normalization coefficient matrix; normalizing the data block using the normalization coefficient to obtain a normalized data block; quantizing the normalized data block based on a preset data quantization strategy to obtain a quantized normalized data block; performing an inverse normalization operation on the quantized normalized data block based on the normalization coefficient to obtain a quantized data block; and combining the obtained quantized data blocks into quantized feature data.

[0008] In some embodiments, before obtaining the feature data to be quantized from the target model deployed on the target device, the method further includes: obtaining first sample feature data and an initialized inter-block orthogonal transformation matrix; determining the covariance matrix of the first sample feature data; updating the inter-block orthogonal transformation matrix based on the covariance matrix, with the objective of equal variance of each data block in the first sample feature data, and returning the updated inter-block orthogonal transformation matrix if the current return condition is met.

[0009] In some embodiments, based on the covariance matrix, with the goal of equal variances in each data block of the first sample feature data, the inter-block orthogonal transformation matrix is ​​updated, and if the current return condition is met, the updated inter-block orthogonal transformation matrix is ​​returned. This includes performing the following first update step: determining the average variance of each variance value in the covariance matrix, and the absolute value of the difference between each variance value in the covariance matrix and the average variance; if the absolute value meets a preset threshold condition, selecting two target variance values ​​from the covariance matrix, wherein one of the two target variance values ​​is greater than the average variance and the other is less than the average variance; determining the matrix rotation angle based on the two target variance values, and constructing a first rotation matrix based on the matrix rotation angle; updating the covariance matrix and the inter-block orthogonal transformation matrix based on the first rotation matrix; continuing to perform the first update step based on the updated covariance matrix and the inter-block orthogonal transformation matrix; if the absolute value does not meet the threshold condition, returning the current inter-block orthogonal transformation matrix.

[0010] In some embodiments, before obtaining the feature data to be quantized from the target model deployed on the target device, the method further includes: obtaining an initialized intra-block orthogonal transformation matrix, an initialized normalized coefficient matrix, and second sample feature data, wherein the second sample feature data is feature data after transforming the original sample feature data using an inter-block orthogonal transformation matrix; performing the following second update step: mapping the second sample feature data to a normalized space and quantizing it based on the intra-block orthogonal transformation matrix and the normalized coefficient matrix to obtain sample quantized data; statistically analyzing each codeword in a preset codebook based on the sample quantized data to obtain the empirical occupancy probability of each codeword; calculating a loss value based on a preset codebook occupancy balancing loss function using the empirical occupancy probability and the average empirical occupancy probability of the codewords in the codebook; and updating the intra-block orthogonal transformation matrix and the normalized coefficient matrix based on the loss value.

[0011] In some embodiments, updating the intra-block orthogonal transformation matrix and normalized coefficient matrix based on the loss value includes: if the loss value meets the iterative update condition, updating the normalized coefficient matrix based on the maximum codeword in the codebook and the maximum value of each data block in the second sample feature data; selecting first target column data and second target column data that meet the optimization condition from the second sample feature data; constructing a second rotation matrix based on the first target column data and the second target column data; updating the intra-block orthogonal transformation matrix based on the second rotation matrix; continuing to execute the second update step based on the updated intra-block orthogonal transformation matrix and normalized coefficient matrix; if the loss value does not meet the iterative update condition, returning the current intra-block orthogonal transformation matrix and normalized coefficient matrix.

[0012] In some embodiments, selecting a first target column and a second target column of data that meet the optimization conditions from the second sample feature data includes: for any column of data in the second sample feature data, determining the empirical occupancy probability of each codeword corresponding to that column of data; determining the codebook occupancy imbalance score corresponding to that column of data based on the empirical occupancy probability and the average empirical occupancy probability; determining M columns of candidate data from the second sample feature data based on the obtained codebook occupancy imbalance scores; for any two columns of candidate data in the M columns of candidate data, determining the complementarity score between the two columns of candidate data based on the empirical occupancy probability and the average empirical occupancy probability of the codewords corresponding to the two columns of candidate data respectively; determining the selection score of the two columns of candidate data based on the complementarity score and the codebook occupancy imbalance score corresponding to the two columns of candidate data respectively; and selecting N pairs of candidate data from the M columns of candidate data based on the obtained selection scores, and determining each selected pair of candidate data as the first target column and the second target column of data that meet the optimization conditions.

[0013] In some embodiments, constructing a second rotation matrix based on the first target column data and the second target column data includes: selecting a target rotation angle that minimizes the loss value from a preset set of rotation angles; and constructing the second rotation matrix based on the target rotation angle, the first target column data, and the second target column data.

[0014] In some embodiments, after quantizing the intra-block transformed data to obtain quantized feature data, the method further includes: inversely rotating the inter-block orthogonal transformation matrix and the intra-block orthogonal transformation matrix to obtain the inter-block orthogonal transformation inverse rotation matrix and the intra-block orthogonal transformation inverse rotation matrix, respectively; fusing them with the weight data included in the linear layer of the target model to obtain fused weight data; and using the fused weight data to perform a linear transformation on the quantized feature data to obtain the feature data output by the linear layer.

[0015] According to another aspect of the present disclosure, a quantization apparatus for feature data in a model is provided. The apparatus includes: a first acquisition module, configured to acquire feature data to be quantized from a target model deployed on a target device, wherein the feature data includes at least two data blocks; a first transformation module, configured to perform an inter-block transformation on the feature data using a pre-constructed inter-block orthogonal transformation matrix to obtain inter-block transformed data; a second transformation module, configured to perform an intra-block transformation on each data block included in the inter-block transformed data using a pre-constructed intra-block orthogonal transformation matrix to obtain intra-block transformed data; and a quantization module, configured to quantize the intra-block transformed data to obtain quantized feature data.

[0016] According to another aspect of the present disclosure, a computer-readable storage medium is provided that stores computer program instructions thereon, which, when executed by a processor, implement the steps of the quantization method for feature data in the above-described model.

[0017] According to another aspect of the present disclosure, an electronic device is provided, comprising: a processor; a memory for storing processor-executable instructions; and a processor for reading executable instructions from the memory and executing the instructions to implement the quantization method of feature data in the above model.

[0018] According to another aspect of the present disclosure, a computer program product is provided, including computer program instructions that, when executed by a processor, implement the steps of the quantization method for feature data in the above-described model.

[0019] Based on the quantization method, apparatus, storage medium, and electronic device for feature data in the model provided in the above embodiments of this disclosure, by pre-constructing inter-block orthogonal transformation matrices and intra-block orthogonal transformation matrices, the inter-block orthogonal transformation matrix is ​​used to perform inter-block transformation on the feature data in the model. This can evenly distribute the energy of the feature data across all blocks, eliminating the influence of high-variance blocks on the sharing index of the quantized data. Furthermore, by using the intra-block orthogonal transformation matrix to perform intra-block transformation on each data block included in the inter-block transformed data, the local data within the block can be redistributed, allowing the intra-block data to be evenly mapped to each codeword interval of the codebook, solving the codebook collapse problem, and thus significantly improving quantization accuracy. In addition, the quantization method provided in the embodiments of this disclosure requires no training data and only uses a small amount of calibration data to construct analytical parameters, greatly improving quantization efficiency.

[0020] The technical solutions of this disclosure will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description

[0021] The above and other objects, features, and advantages of this disclosure will become more apparent from the more detailed description of the embodiments thereof in conjunction with the accompanying drawings. The drawings are provided to further illustrate the embodiments of this disclosure and form part of the specification. They are used together with the embodiments of this disclosure to explain the disclosure and do not constitute a limitation thereof. In the drawings, the same reference numerals generally represent the same components or steps;

[0022] Figure 1 This is a flowchart illustrating a method for quantifying feature data in a model provided by an exemplary embodiment of this disclosure;

[0023] Figure 2 This is a flowchart illustrating a method for quantifying feature data in a model provided in another exemplary embodiment of this disclosure;

[0024] Figure 3 This is a flowchart illustrating a method for quantifying feature data in a model provided in yet another exemplary embodiment of this disclosure;

[0025] Figure 4 This is a flowchart illustrating a method for quantifying feature data in a model provided in yet another exemplary embodiment of this disclosure;

[0026] Figure 5 This is a flowchart illustrating a method for quantifying feature data in a model provided in yet another exemplary embodiment of this disclosure;

[0027] Figure 6 This is a flowchart illustrating a method for quantifying feature data in a model provided in yet another exemplary embodiment of this disclosure;

[0028] Figure 7This is a flowchart illustrating a method for quantifying feature data in a model provided in yet another exemplary embodiment of this disclosure;

[0029] Figure 8 This is a flowchart of yet another exemplary embodiment of this disclosure;

[0030] Figure 9 This is a schematic diagram of the structure of a quantization device for feature data in a model provided by an exemplary embodiment of this disclosure;

[0031] Figure 10 This is a structural diagram of an electronic device provided in an exemplary embodiment of this disclosure. Detailed Implementation

[0032] Hereinafter, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of the present disclosure, and not all embodiments of the present disclosure, and it should be understood that the present disclosure is not limited to the exemplary embodiments described herein.

[0033] It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values ​​of the components and steps set forth in these embodiments do not limit the scope of this disclosure.

[0034] Those skilled in the art will understand that the terms "first," "second," etc., in the embodiments of this disclosure are only used to distinguish different steps, devices, or modules, and do not represent any specific technical meaning, nor do they indicate a necessary logical order between them.

[0035] It should also be understood that in the embodiments disclosed herein, "a plurality of" may refer to two or more, and "at least one" may refer to one, two or more.

[0036] It should also be understood that any component, data or structure mentioned in the embodiments of this disclosure can generally be understood as one or more unless expressly defined or given to the contrary in the context.

[0037] Furthermore, the term "and / or" in this disclosure is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this disclosure generally indicates that the preceding and following related objects have an "or" relationship.

[0038] It should also be understood that the description of the various embodiments in this disclosure emphasizes the differences between the various embodiments, and the similarities or similarities can be referred to each other. For the sake of brevity, they will not be described in detail.

[0039] At the same time, it should be understood that, for ease of description, the dimensions of the various parts shown in the accompanying drawings are not drawn according to actual scale.

[0040] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit this disclosure or its application or use.

[0041] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.

[0042] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.

[0043] The embodiments disclosed herein can be applied to electronic devices such as terminal devices, computer systems, and servers, and can operate together with a wide range of other general-purpose or special-purpose computing system environments or configurations. Examples of well-known terminal devices, computing systems, environments, and / or configurations suitable for use with electronic devices such as terminal devices, computer systems, and servers include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments including any of the above systems, etc.

[0044] Electronic devices such as terminal devices, computer systems, and servers can be described in the general context of computer system executable instructions (such as program modules) executed by a computer system. Typically, program modules can include routines, programs, object programs, components, logic, data structures, etc., which perform specific tasks or implement specific abstract data types. Computer systems / servers can be implemented in distributed cloud computing environments, where tasks are executed by remote processing devices linked through communication networks. In distributed cloud computing environments, program modules can reside on local or remote computing system storage media, including storage devices.

[0045] Application Overview

[0046] Related rotation-based quantization methods (such as QuaRot) are primarily designed to suppress outliers to accommodate a uniform distribution of integers, ignoring the topology of block floating-point data. However, MXFP4 format data has a unique "shared exponent + logarithmic quantization" structure, making existing integer-optimized rotation strategies often ineffective.

[0047] The existing quantization methods have not addressed the imbalance in inter-block variance. Specifically, with a fixed block size (e.g., Block Size=32), the block variance of the activation data exhibits a significant long-tailed distribution. A few high-energy blocks dominate the overall quantization error and force a sharp increase in the sharing exponent, causing more than 90% of the tiny values ​​within the same block to be zeroed out due to insufficient resolution.

[0048] The existing quantization methods have not addressed the issue of unbalanced codebook utilization within blocks. In real-world scenarios, the distribution of activation data in the model is typically highly concentrated around zero (peaked distribution), while the e2m1 codebook design of MXFP4 assumes that the data follows an ideal log-normal distribution. This results in the vast majority of values ​​falling into the minimum quantization interval, while a large number of large-value codewords with strong representational capabilities remain idle, causing severe "codebook collapse."

[0049] The embodiments disclosed herein aim to solve the above-mentioned problems. By constructing inter-block orthogonal transformation matrices and intra-block orthogonal transformation matrices, variance balancing of activation data at the macro level (evenly distributing activation energy to all blocks and eliminating the dominant influence of high-variance blocks on the sharing index) and codebook alignment at the micro level (reshaping the intra-block data distribution so that it is evenly mapped to each codeword interval of MXFP4) are achieved, thereby significantly improving quantization accuracy and achieving near-floating-point performance.

[0050] Exemplary methods

[0051] Figure 1 This is a flowchart illustrating a method for quantizing feature data in a model provided by an exemplary embodiment of this disclosure. This embodiment can be applied to various types of electronic devices, such as... Figure 1 As shown, the method includes the following steps:

[0052] Step 101: Obtain the feature data to be quantized from the target model deployed on the target device.

[0053] In this embodiment, the target device can be various types of electronic devices, such as user terminal devices, edge computing devices, backend servers, etc. This method can be executed by the target device or by other electronic devices connected to the target device.

[0054] The target model described above can be a neural network model of various structures, such as a large language model or a multimodal model. A large language model refers to a deep learning model trained on a large amount of text data, enabling it to generate natural language text or understand language text. It can process various types of text content from the input text data, including conversational AI, chatbots, marketing content, and code assistants. A multimodal model is a model trained by combining text, images, video, audio, and other multimodal information. The input and output of a multimodal model can include multiple forms, such as text, images, audio, and video.

[0055] The aforementioned feature data can be data generated internally by the target model during the execution of inference tasks. This data is typically obtained by the activation function of a certain layer within the model. Therefore, the feature data processed in this embodiment can be activation data. For example, when the target model performs a natural language processing task, the input is text. The target model processes the text features layer by layer, and the activation data is obtained after the activation function is calculated for a certain layer of features. As another example, when the target model performs an image recognition task, the input is an image. The target model processes the image features layer by layer, and the activation data is obtained after the activation function is calculated for a certain layer of features.

[0056] The feature data in this embodiment may include at least two data blocks. For example, the original feature data is x, with a size of n×d, which is a vector generated by a certain layer in the target model, where n is the number of tokens and d is the dimension of the tokens. The electronic device can reshape x into X, with a size of B×K, where B is the number of blocks and K is the size of each block (e.g., K=32).

[0057] Step 102: Using the pre-constructed inter-block orthogonal transformation matrix, perform inter-block transformation on the feature data to obtain the transformed data.

[0058] In this embodiment, the purpose of using the inter-block orthogonal transformation matrix is ​​to reduce the data distribution variance between data blocks, so that the activation energy is evenly distributed in all blocks, and after the feature data is quantized (e.g., quantization is performed according to the MXFP4 standard), the dominant influence of high-variance blocks on the shared index used in the quantization process is eliminated.

[0059] The inter-block orthogonal transformation matrix can be pre-constructed manually (e.g., obtained through extensive testing and adjustment of matrix elements), or it can be automatically generated by electronic devices using a preset algorithm (e.g., with minimizing inter-block variance as the optimization objective, the algorithm parameters are adjusted through supervised training to output the inter-block orthogonal transformation matrix).

[0060] Let X be the feature data to be quantized, and R be the inter-block orthogonal transformation matrix.inter R inter If the size is B×B, then after this step, the matrix is ​​rotated, and the resulting inter-block transformed data is Y = R. inter X, with a size of B×K.

[0061] Step 103: Using the pre-constructed intra-block orthogonal transformation matrix, perform intra-block transformation on each data block included in the inter-block transformed data to obtain the intra-block transformed data.

[0062] In this embodiment, the purpose of using the intra-block orthogonal transformation matrix is ​​to adjust the data distribution state within each data block, so that after the data in each data block is quantized, it can be evenly distributed to each codeword interval in the codebook, thus avoiding codebook collapse.

[0063] The intra-block orthogonal transformation matrix can be pre-constructed manually (e.g., obtained through extensive testing and adjustment of matrix elements), or it can be automatically generated by electronic devices using a preset algorithm (e.g., using the balance of data occupancy in the codebook within the block as the optimization objective, adjusting algorithm parameters through supervised training, and outputting the intra-block orthogonal transformation matrix).

[0064] After the inter-block transformed data Y is processed in this step, the intra-block transformed data Z = YR is obtained. intra , where R intra Let Z be an intra-block orthogonal transformation matrix of size K×K, and Z be of size B×K.

[0065] Step 104: Quantize the transformed data within the block to obtain quantized feature data.

[0066] In this embodiment, this step can be implemented using relevant data quantization techniques. For example, a quantization method conforming to the MXFP4 standard can be used.

[0067] The quantized feature data can be directly output by the target device or further input into the lower layer of the target model to perform subsequent inference tasks.

[0068] It should be understood that the method provided in this embodiment can be applied to each layer of the target model. That is, by executing this method multiple times, the feature data of each layer in the target model can be transformed and quantized.

[0069] The method provided in the above embodiments of this disclosure, by pre-constructing inter-block orthogonal transformation matrices and intra-block orthogonal transformation matrices, uses the inter-block orthogonal transformation matrix to perform inter-block transformations on the feature data in the model. This can evenly distribute the energy of the feature data across all blocks, eliminating the influence of high-variance blocks on the sharing index of the quantized data. Using the intra-block orthogonal transformation matrix, each data block included in the inter-block transformed data undergoes an intra-block transformation, which can redistribute the local data within the block, making the data within the block evenly mapped to each codeword interval of the codebook, solving the codebook collapse problem, and thus significantly improving quantization accuracy. Furthermore, the quantization method provided in the embodiments of this disclosure requires no training data, using only a small amount of calibration data for analytical parameter construction, greatly improving quantization efficiency.

[0070] In some alternative implementations, such as Figure 2 As shown, step 104 includes:

[0071] Step 1041: Obtain the pre-constructed normalized coefficient matrix.

[0072] The normalized coefficient matrix can be represented as S. It can be a diagonal matrix of size B×B. For example, S = diag(s1, ..., s). B ), where each element s b This refers to the shared normalization coefficients for a certain data block. The normalization coefficient matrix can be pre-constructed manually (e.g., obtained through extensive testing and adjustment of the elements in the matrix), or it can be automatically generated by an electronic device using a preset algorithm (e.g., with the balance of data occupancy within the codebook as the optimization objective, the algorithm parameters are adjusted through supervised training to output the normalization coefficient matrix).

[0073] Step 1042: For each data block in the transformed data within the block, extract the normalization coefficients corresponding to the data block from the normalization coefficient matrix; normalize the data block using the normalization coefficients to obtain a normalized data block; quantize the normalized data block based on a preset data quantization strategy to obtain a quantized normalized data block; perform inverse normalization operation on the quantized normalized data block based on the normalization coefficients to obtain a quantized data block.

[0074] Specifically, when performing this step, you can first extract a data block. For example, the extracted data block is represented as z. b = [z b,1 , . . . , z b,K ] T .

[0075] Then, for a certain data block b, the corresponding normalization coefficient s is obtained from S. b .

[0076] Next, a normalization operation is performed to obtain normalized data block a. b = z b / s b .

[0077] Next, a quantization operation is performed on the normalized data block to obtain a quantized normalized data block. .

[0078] Finally, the quantized and normalized data blocks are denormalized back to the quantized data blocks. .

[0079] Step 1043: Combine the obtained quantized data blocks into quantized feature data.

[0080] The final quantized feature data is Its size is B×K.

[0081] This embodiment uses a pre-constructed normalization coefficient matrix to normalize the transformed data within the block before quantization. This allows for more targeted codebook mapping of the data within the block during the quantization process, which helps to uniformly map the quantized data within the block to each codeword interval, prevents codebook collapse, and improves quantization accuracy.

[0082] In some alternative implementations, such as Figure 3 As shown, prior to step 101, the method further includes:

[0083] Step 105: Obtain the first sample feature data and the initialized inter-block orthogonal transformation matrix.

[0084] The first sample feature data may be the same as or different from the feature data to be quantized. The first sample feature data can reshape the original feature data into a B×K matrix, where B is the number of blocks and K is the size of each block.

[0085] The initial inter-block orthogonal transformation matrix can be preset, for example, it can be an all-1 matrix.

[0086] Step 106: Determine the covariance matrix of the first sample feature data.

[0087] The covariance matrix Σ of the first sample feature data can be determined by following the method for calculating the covariance matrix. In this embodiment, the size of the covariance matrix can be B×B, which can represent the variance of each data block and the covariance between different data blocks.

[0088] Step 107: Based on the covariance matrix, with the goal of equal variance of each data block in the first sample feature data, update the inter-block orthogonal transformation matrix, and if the current return condition is met, return the updated inter-block orthogonal transformation matrix.

[0089] Specifically, when updating the inter-block orthogonal transformation matrix, adjustment strategies for each element in the matrix can be set (e.g., adjustment step size, adjustment direction, etc.). After each adjustment, the difference between the variances of each data block can be compared. If the difference is less than or equal to a preset threshold, it can be determined that the return condition is met, and the adjusted inter-block orthogonal transformation matrix is ​​used as the inter-block orthogonal transformation matrix in steps 101-104 above.

[0090] The update objective described above can be set based on convex optimization theory. According to convex optimization theory, the mean square error (MSE) of the quantized data is minimized if and only if the variances of all blocks are equal. That is, the update objective is shown in equation (1) below, which seeks an orthogonal matrix. This makes the diagonal elements of the transformed covariance matrix equal:

[0091] (1)

[0092] in It is the covariance matrix at position k of each data block. It is a vector consisting entirely of 1s.

[0093] It should be noted that the process of determining the inter-block orthogonal transformation matrix in this embodiment can be executed offline. That is, in actual application scenarios, when executing steps 101-104 online, the inter-block orthogonal transformation matrix obtained offline can be directly called, improving data quantization efficiency and model inference efficiency. The layer of the target model where the inter-block orthogonal transformation matrix is ​​calculated in this embodiment is the same layer as that used during data quantization in steps 101-104.

[0094] This embodiment determines the inter-block orthogonal transformation matrix in advance using an iterative update method, which can accurately and efficiently obtain the inter-block orthogonal transformation matrix, thereby improving the accuracy and efficiency of feature data quantization.

[0095] In some alternative implementations, such as Figure 4 As shown, step 107 includes:

[0096] Perform the following first update steps:

[0097] Step 1071: Determine the mean variance of each variance value in the covariance matrix, and the absolute value of the difference between each variance value in the covariance matrix and the mean variance.

[0098] The variance mean is expressed as: c = tr(Σ) / B, where tr(Σ) represents the trace of the covariance matrix Σ, that is, the sum of the elements on the main diagonal of the matrix, and B is the number of data blocks.

[0099] The absolute value of the difference between the variance and the mean variance is expressed as: |σ ii - c|, where σ ii This represents the variance of the i-th data block.

[0100] Step 1072: If the absolute value meets the preset threshold condition, select two target variance values ​​from the covariance matrix.

[0101] Among the two target variance values, one is greater than the variance mean, and the other is less than the variance mean.

[0102] The above threshold condition can be |σ ii - c| > ε, in this case, two target variance values ​​can be selected according to the following conditions: (σ ii - c) ( σ jj - c) < 0, where σ ii and σ jj These are the two target variance values ​​selected.

[0103] Step 1073: Based on the two target variance values, determine the matrix rotation angle, and construct the first rotation matrix based on the matrix rotation angle.

[0104] Specifically, the rotation angle can be calculated using the following formula:

[0105] (2)

[0106] Where σ ij It can be extracted from the covariance matrix.

[0107] Based on the matrix rotation angles and values ​​i and j, the first rotation matrix can be constructed using relevant rotation matrix construction methods. Optionally, the Givens matrix G can be constructed using the Givens rotation matrix construction method. ij (θ) is the first rotation matrix.

[0108] Step 1074: Update the covariance matrix and the inter-block orthogonal transformation matrix based on the first rotation matrix.

[0109] Specifically, G can be calculated. T ΣG, the result of which is used as the updated covariance matrix Σ. Calculate R. inter G, the calculation result is used as the updated inter-block orthogonal transformation matrix R. inter .

[0110] Step 1075: Based on the updated covariance matrix and inter-block orthogonal transformation matrix, continue with the first update step.

[0111] That is, if the threshold condition is met, the process is repeated from step 1071 until the threshold condition is no longer met, thereby achieving iterative updates of the covariance matrix and the inter-block orthogonal transformation matrix.

[0112] Step 1076: If the absolute value does not meet the threshold condition, return the current inter-block orthogonal transformation matrix.

[0113] That is, |σ ii When -c| ≤ ε, we can assume that the variances of each data block are equal, thus returning the inter-block orthogonal transformation matrix R. inter , which serves as the inter-block orthogonal transformation matrix used in steps 101-104.

[0114] This embodiment provides an algorithm for updating the inter-block orthogonal transformation matrix. Based on this algorithm, the inter-block orthogonal transformation matrix can be accurately obtained, making the variance of each data block of the transformed feature data approximately equal, thereby improving the accuracy and efficiency of determining the inter-block orthogonal transformation matrix.

[0115] In some alternative implementations, such as Figure 5 As shown, prior to step 101, the method further includes:

[0116] Step 108: Obtain the initialized intra-block orthogonal transformation matrix, the initialized normalized coefficient matrix, and the second sample feature data.

[0117] The second sample feature data is the feature data obtained by transforming the original sample feature data using an inter-block orthogonal transformation matrix. For example, the inter-block orthogonal transformation matrix R can be used. inter Regarding the above Figure 4 The first sample feature data is transformed to obtain the second sample feature data.

[0118] Initialized intra-block orthogonal transformation matrix R intra The initial normalized coefficient matrix S can be an all-one matrix. intra The size of is K×K, and the size of S is B×B.

[0119] Next, the following second update step can be performed:

[0120] Step 109: Based on the intra-block orthogonal transformation matrix and the normalization coefficient matrix, the second sample feature data is mapped to the normalization space and quantized to obtain sample quantized data.

[0121] Specifically, the second sample feature data X can be mapped to the normalized space and rotated within the block according to the following formula:

[0122] Z=S -1 XR intra (3)

[0123] Then, Z is quantized to obtain the sample quantized data:

[0124] (4)

[0125] Among them That is, sample quantification data.

[0126] Step 110: Based on the sample quantization data, perform statistics on each codeword in the preset codebook to obtain the empirical occupancy probability of each codeword.

[0127] Among them, the empirical occupancy probability of a codeword is the ratio of the number of times the codeword is used during the quantization process to the total number of times all codewords are used.

[0128] Step 111: Based on the preset codebook occupancy balancing loss function, calculate the loss value using the empirical occupancy probability and the average empirical occupancy probability of the codewords in the codebook.

[0129] The loss value calculated using the codebook occupancy leveling loss function can represent the balance of usage among the codewords in the codebook. As an example, the codebook occupancy leveling loss function is expressed as:

[0130] (5)

[0131] Where J is the number of codewords in the codebook, and 1 / J is the average empirical occupancy probability. Let be the empirical occupancy probability of the j-th codeword. The determination of this empirical occupancy probability requires the use of the intra-block orthogonal transformation matrix R. intra And the normalized coefficient matrix S.

[0132] Step 112: Based on the loss value, update the intra-block orthogonal transformation matrix and normalized coefficient matrix.

[0133] Specifically, minimizing the aforementioned loss value can be used as the update objective. During each iteration, the values ​​of each element in the intra-block orthogonal transformation matrix and the normalized coefficient matrix are adjusted. A preset adjustment strategy (e.g., adjusting the step size, adjusting the direction, etc.) can be used to iteratively update the intra-block orthogonal transformation matrix and the normalized coefficient matrix. This normalized coefficient matrix can be used when performing step 1041 above.

[0134] It should be noted that the process of determining the intra-block orthogonal transformation matrix and normalized coefficient matrix in this embodiment can be executed offline. That is, in actual application scenarios, when executing steps 101-104 online, the intra-block orthogonal transformation matrix and normalized coefficient matrix obtained offline can be directly called, improving data quantization efficiency and model inference efficiency. The layer of the target model where the intra-block orthogonal transformation matrix and normalized coefficient matrix are calculated in this embodiment is the same layer as that used during data quantization in steps 101-104.

[0135] This embodiment determines the intra-block orthogonal transformation matrix and normalization coefficient matrix in advance using an iterative update method, which can accurately and efficiently obtain the intra-block orthogonal transformation matrix and normalization coefficient matrix, thereby improving the accuracy and efficiency of feature data quantization.

[0136] In some alternative implementations, such as Figure 6 As shown, step 112 includes:

[0137] Step 1121: If the loss value meets the iterative update condition, update the normalization coefficient matrix based on the maximum codeword in the codebook and the maximum value of each data block in the sample quantized data.

[0138] Typically, the iterative update conditions can be: the loss value is greater than a preset loss threshold and the current iteration number is less than a preset number threshold.

[0139] When updating the normalized coefficient matrix, the intra-block orthogonal transformation matrix R can be fixed. intra For each data block in the second sample feature data, the maximum value within that data block can be mapped to the maximum codeword, thus obtaining the updated normalization coefficient corresponding to that data block.

[0140] Step 1122: Select the first target column data and the second target column data that meet the optimization conditions from the second sample feature data.

[0141] In this case, the size of the second sample feature data is B×K, meaning that each column contains data from the same position in all blocks.

[0142] Optimization conditions can be preset. For example, from the second sample feature data, the two most imbalanced columns after mapping to the codebook can be selected as the first and second target columns. The imbalance can be measured by the difference between the empirical occupancy probability and the average empirical occupancy probability of each codeword corresponding to a column of data; the larger the difference, the greater the imbalance.

[0143] Step 1123: Construct a second rotation matrix based on the first target column data and the second target column data.

[0144] When constructing the second rotation matrix, the rotation angle can be determined first. This rotation angle can be adjusted to minimize the aforementioned loss value. Then, the second rotation matrix is ​​constructed based on the updated rotation angle. Using the matrix rotation angle, and the column labels p and q of the first and second target column data, the second rotation matrix can be constructed according to relevant rotation matrix construction methods. Optionally, the Givens rotation matrix construction method can be used to construct the Givens matrix G. pq (θ) serves as the second rotation matrix.

[0145] Step 1124: Update the intra-block orthogonal transformation matrix based on the second rotation matrix.

[0146] Specifically, R can be calculated. intra G pq (θ), the calculation result is used as the updated intra-block orthogonal transformation matrix R intra .

[0147] Step 1125: Based on the updated intra-block orthogonal transformation matrix and normalized coefficient matrix, continue with the second update step.

[0148] That is, if the iterative update conditions are met, the process is repeated starting from step 109 until the iterative update conditions are no longer met, thereby achieving the iterative update of the orthogonal transformation matrix and the normalized coefficient matrix within the block.

[0149] Step 1126: If the loss value does not meet the iterative update condition, return the current intra-block orthogonal transformation matrix and normalized coefficient matrix.

[0150] That is, when the loss value is not greater than a preset loss threshold, or when the current iteration count reaches a preset threshold, the intra-block orthogonal transformation matrix R is returned. intra And the normalized coefficient matrix S, for use when performing feature data quantization.

[0151] This embodiment provides an algorithm for updating the intra-block orthogonal transformation matrix and the normalized coefficient matrix. Based on this algorithm, the intra-block orthogonal transformation matrix and the normalized coefficient matrix can be accurately obtained, so that the transformed feature data can be uniformly mapped to the codeword interval contained in the codebook within each data block, thereby improving the accuracy and efficiency of determining the intra-block orthogonal transformation matrix and the normalized coefficient matrix.

[0152] In some alternative implementations, step 1122 above can be performed as follows:

[0153] First, for any column of data in the second sample feature data, determine the empirical occupancy probability of each codeword corresponding to that column of data; based on the empirical occupancy probability and the average empirical occupancy probability, determine the codebook occupancy imbalance score corresponding to that column of data.

[0154] The specific statistical method for the empirical occupancy probability of each codeword corresponding to this column of data can be: for a certain codeword, the ratio of the number of elements in this column of data that are mapped to that codeword to the total number of elements in all codewords mapped to this column of data.

[0155] For a given column of data, the codebook imbalance score can be calculated using the following formula:

[0156] (6)

[0157] Where k represents a certain column of data, and J is the total number of codewords in the codebook. This represents the empirical occupancy probability of the j-th codeword in the k-th column of data. A higher codebook occupancy imbalance score indicates that the data in that column is highly concentrated in certain codeword intervals; therefore, this column of data can be prioritized for optimization.

[0158] Then, based on the obtained imbalance scores of each codebook occupancy, M columns of candidate data are determined from the second sample feature data.

[0159] Where M is a preset positive integer, that is, the M columns of data with the highest codebook occupancy imbalance score are determined from the second sample feature data as candidate data.

[0160] Then, for any two candidate data columns in the M candidate data columns, the complementarity score between the two candidate data columns is determined based on the empirical occupancy probability and the average empirical occupancy probability of the codewords corresponding to the two candidate data columns respectively; the selection score of the two candidate data columns is determined based on the complementarity score and the codebook occupancy imbalance score corresponding to the two candidate data columns respectively.

[0161] The complementarity between the two candidate data columns indicates that when one column over-occupies a specific codeword, the other column often does not fully occupy that codeword. The higher the complementarity score, the higher the complementarity between the two columns, and the more inclined to optimize for these two candidate data columns.

[0162] For a given two columns of data, the complementarity score between them can be calculated using the following formula:

[0163] (7)

[0164] Where k and l are the column labels of the two candidate data columns.

[0165] The selection scores for these two columns of candidate data can be calculated using the following formula:

[0166] (8)

[0167] Where λ is the preset weight.

[0168] Finally, based on the obtained scores of each selected candidate, N pairs of candidate data are selected from the M columns of candidate data, and each selected pair of candidate data is determined as the first target column data and the second target column data that meet the optimization conditions.

[0169] Where N is a preset positive integer. From the M columns of candidate data, the N pairs of candidate data with the highest selected scores can be selected, and the two columns of data included in each candidate data pair are used as the first target column data and the second target column data for performing optimization operations.

[0170] This embodiment calculates the codebook occupancy imbalance score and complementarity score, and then calculates the selected score. This allows for the accurate selection of column data with a high degree of codebook occupancy imbalance from the second sample feature data, enabling more targeted optimization and improving the accuracy of determining the intra-block orthogonal transformation matrix and normalized coefficient matrix.

[0171] In some alternative implementations, step 1123 can be performed as follows:

[0172] First, select the target rotation angle that minimizes the loss value from the preset set of rotation angles.

[0173] Due to the above loss function Since it is a piecewise constant function, the globally optimal rotation angle can be found by finitely enumerating the "critical angle" (the angle at which a data point crosses the quantization boundary). This avoids the local optima problem of gradient descent.

[0174] Specifically, the angle from 0 to 2π is divided into M intervals, i.e., s0 ≤ τ1 < τ2 < ... < τM < 2π. The critical angle is expressed as:

[0175] (9)

[0176] Find the globally optimal rotation angle The following formula can be used:

[0177] (10)

[0178] It should be understood that, referring to the above embodiments, if N pairs of candidate data are obtained, then each pair of candidate data is a first target data and a second target data. Therefore, this embodiment is executed for each pair of candidate data. After obtaining the first target data and the second target data each time, the operation of updating the intra-block orthogonal transformation matrix and the normalization coefficient matrix is ​​performed once.

[0179] Then, based on the target rotation angle, the first target column data, and the second target column data, a second rotation matrix is ​​constructed.

[0180] The Givens matrix can be constructed using the Givens rotation matrix construction method. As the second rotation matrix.

[0181] This embodiment searches for the target rotation angle through finite enumeration, which avoids the local optimum problem of gradient descent and updates the intra-block orthogonal transformation matrix and normalization coefficient matrix more efficiently and accurately.

[0182] In some alternative implementations, such as Figure 7 As shown, after step 104, the method further includes:

[0183] Step 113: Inversely rotate the inter-block orthogonal transformation matrix and the intra-block orthogonal transformation matrix respectively to obtain the inter-block orthogonal transformation inverse rotation matrix and the intra-block orthogonal transformation inverse rotation matrix.

[0184] Specifically, let the inter-block orthogonal transformation matrix be R. inter The intra-block orthogonal transformation matrix is ​​R intra Then the inverse rotation matrix of the inter-block orthogonal transformation is R. T inter The inverse rotation matrix of the intra-block orthogonal transformation is R. T intra That is, R respectively inter and R intra The transpose of .

[0185] Step 114: Fuse the weight data with the weight data of the linear layer in the target model to obtain the fused weight data.

[0186] The fusion process is shown in the following formula:

[0187] (11)

[0188] in, It represents the Kronecker product.

[0189] Step 115: Using the fused weight data, perform a linear transformation on the quantized feature data to obtain the feature data output by the linear layer.

[0190] The feature data output by the linear layer is represented as The output feature data can be input into the next layer of the target model, where the method can be repeated, or a specific prediction task (such as classification or prediction) can be performed on the feature data.

[0191] This embodiment integrates the inter-block orthogonal transformation matrix and the intra-block orthogonal transformation matrix with the weight data of the model's linear layers. This enables matrix multiplication to be performed directly during model inference without the need for inverse rotation, thereby eliminating the computational overhead caused by inverse rotation and improving inference efficiency.

[0192] The quantization method for feature data in the model provided in this disclosure is applicable to various models deployed on electronic devices. While maintaining high quantization accuracy, it is particularly suitable for industrial-grade large model inference scenarios that require high throughput and low latency, such as intelligent assistants and code generation services.

[0193] Figure 8 A flowchart illustrating the method provided in this disclosure embodiment is shown, which implements Two-level Orthogonal Rotation for Quantization (TORQ). The framework comprises online and offline components. In the offline component, the inter-block orthogonal transformation matrix R... inter With the intra-block orthogonal transformation matrix R intra The Kronecker product is calculated to obtain the fused weighted data. In the online part, the original feature data is first reshaped to obtain feature data containing multiple data blocks. Then, the inter-block orthogonal transformation matrix R is used. inter The feature data undergoes inter-block rotation (macro-equilibrium rotation) to make the variances of each data block approximately equal. Then, the intra-block orthogonal transformation matrix R is used... intra Each data block is rotated within the block (micro-alignment rotation) and quantized to uniformly map the data into the codebook space. Finally, the fused weighted data is used to perform a linear transformation on the quantized feature data to obtain the feature data output by the linear layer.

[0194] The TORQ method enables covariance estimation and rotation matrix construction using a small amount of calibration data (e.g., 128 sequences). TORQ's construction is analytical, eliminating the need for backpropagation and significantly improving calibration speed. The model generated by this method is fully compatible with hardware supporting the MXFP4 format, requiring no modification to the hardware kernel.

[0195] Exemplary device

[0196] Figure 9 This is a schematic diagram of the structure of a feature data quantization device in a model provided by an exemplary embodiment of this disclosure. This embodiment can be applied to electronic devices, such as... Figure 8 As shown, the quantization device for feature data in the model includes:

[0197] The first acquisition module 901 is used to acquire feature data to be quantized from the target model deployed on the target device, wherein the feature data includes at least two data blocks;

[0198] The first transformation module 902 is used to perform inter-block transformation on the feature data using a pre-constructed inter-block orthogonal transformation matrix to obtain the data after inter-block transformation.

[0199] The second transformation module 903 is used to perform intra-block transformation on each data block included in the inter-block transformed data using a pre-constructed intra-block orthogonal transformation matrix to obtain intra-block transformed data.

[0200] The quantization module 904 is used to quantize the transformed data within the block to obtain quantized feature data.

[0201] In some optional implementations, the quantization module includes: an acquisition unit for acquiring a pre-constructed normalization coefficient matrix; a quantization unit for extracting the normalization coefficients corresponding to each data block in the transformed data within a block from the normalization coefficient matrix; normalizing the data block using the normalization coefficients to obtain a normalized data block; quantizing the normalized data block based on a preset data quantization strategy to obtain a quantized normalized data block; performing an inverse normalization operation on the quantized normalized data block based on the normalization coefficients to obtain a quantized data block; and a combination unit for combining the obtained quantized data blocks into quantized feature data.

[0202] In some optional implementations, the device further includes: a second acquisition module for acquiring first sample feature data and an initialized inter-block orthogonal transformation matrix; a determination module for determining the covariance matrix of the first sample feature data; and a first update module for updating the inter-block orthogonal transformation matrix based on the covariance matrix, with the objective of equal variances of each data block in the first sample feature data, and returning the updated inter-block orthogonal transformation matrix if the current return condition is met.

[0203] In some optional implementations, the first update module is further configured to: perform the following first update steps: determine the average variance of each variance value in the covariance matrix, and the absolute value of the difference between each variance value in the covariance matrix and the average variance; if the absolute value meets a preset threshold condition, select two target variance values ​​from the covariance matrix, wherein one of the two target variance values ​​is greater than the average variance and the other is less than the average variance; determine the matrix rotation angle based on the two target variance values, and construct a first rotation matrix based on the matrix rotation angle; update the covariance matrix and the inter-block orthogonal transformation matrix based on the first rotation matrix; continue to perform the first update steps based on the updated covariance matrix and the inter-block orthogonal transformation matrix; if the absolute value does not meet the threshold condition, return the current inter-block orthogonal transformation matrix.

[0204] In some optional implementations, the device further includes: a third acquisition module, used to acquire an initialized intra-block orthogonal transformation matrix, an initialized normalized coefficient matrix, and second sample feature data, wherein the second sample feature data is feature data transformed by the inter-block orthogonal transformation matrix on the original sample feature data; and a second update module, used to perform the following second update steps: mapping the second sample feature data to a normalized space and quantizing it based on the intra-block orthogonal transformation matrix and the normalized coefficient matrix to obtain sample quantized data; statistically analyzing each codeword in a preset codebook based on the sample quantized data to obtain the empirical occupancy probability of each codeword; calculating a loss value based on a preset codebook occupancy balancing loss function using the empirical occupancy probability and the average empirical occupancy probability of the codewords in the codebook; and updating the intra-block orthogonal transformation matrix and the normalized coefficient matrix based on the loss value.

[0205] In some optional implementations, the second update module is further configured to: if the loss value meets the iterative update conditions, update the normalized coefficient matrix based on the maximum codeword in the codebook and the maximum value of each data block in the second sample feature data; select the first target column data and the second target column data that meet the optimization conditions from the second sample feature data; construct a second rotation matrix based on the first target column data and the second target column data; update the intra-block orthogonal transformation matrix based on the second rotation matrix; continue to execute the second update step based on the updated intra-block orthogonal transformation matrix and the normalized coefficient matrix; if the loss value does not meet the iterative update conditions, return the current intra-block orthogonal transformation matrix and the normalized coefficient matrix.

[0206] In some optional implementations, the second update module is further configured to: for any column of data in the second sample feature data, determine the empirical occupancy probability of each codeword corresponding to that column of data; based on the empirical occupancy probability and the average empirical occupancy probability, determine the codebook occupancy imbalance score corresponding to that column of data; based on the obtained codebook occupancy imbalance scores, determine M columns of candidate data from the second sample feature data; for any two columns of candidate data in the M columns of candidate data, determine the complementarity score between the two columns of candidate data based on the empirical occupancy probability and the average empirical occupancy probability of the codewords corresponding to the two columns of candidate data respectively; based on the complementarity score and the codebook occupancy imbalance score corresponding to the two columns of candidate data respectively, determine the selection score of the two columns of candidate data; based on the obtained selection scores, select N pairs of candidate data from the M columns of candidate data, and determine each selected pair of candidate data as the first target column data and the second target column data that meet the optimization conditions.

[0207] In some optional implementations, the second update module is further used to: for any pair of candidate data in N pairs of candidate data, select the target rotation angle that minimizes the loss value from a preset set of rotation angles; and construct a second rotation matrix based on the target rotation angle, the first target column data, and the second target column data.

[0208] In some optional implementations, the device further includes: an inverse rotation module for inversely rotating the inter-block orthogonal transformation matrix and the intra-block orthogonal transformation matrix respectively to obtain the inter-block orthogonal transformation inverse rotation matrix and the intra-block orthogonal transformation inverse rotation matrix; a fusion module for fusing with the weight data included in the linear layer of the target model to obtain fused weight data; and a calculation module for using the fused weight data to perform a linear transformation on the quantized feature data to obtain the feature data output by the linear layer.

[0209] The exemplary embodiments of this device correspond to the exemplary method section described above in terms of implementation. The corresponding content between the two can be referenced, combined, and cited, and will not be repeated here. The beneficial technical effects corresponding to the exemplary embodiments of this device can be found in the corresponding beneficial technical effects of the exemplary method section described above, and will not be repeated here.

[0210] Exemplary electronic devices

[0211] Figure 10 The present disclosure provides a structural diagram of an electronic device 1000, which includes at least one processor 1001 and a memory 1002.

[0212] The processor 1001 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device 1000 to perform desired functions.

[0213] The memory 1002 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and / or cache memory. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 1001 may execute one or more computer program instructions to implement the quantization method of feature data and / or other desired functions in the models of the various embodiments of this disclosure described above.

[0214] In one example, the electronic device 1000 may also include an input device 1003 and an output device 1004, which are interconnected via a bus system and / or other forms of connection mechanism (not shown).

[0215] The input device 1003 may also include, for example, a keyboard, a mouse, etc.

[0216] The output device 1004 can output various information to the outside, including, for example, a display, a speaker, a printer, and a communication network and its connected remote output devices, etc.

[0217] Of course, for the sake of simplicity, Figure 10 Only some of the components of the electronic device 1000 relevant to this disclosure are shown, omitting components such as buses, input / output interfaces, etc. In addition, the electronic device 1000 may include any other suitable components depending on the specific application.

[0218] Exemplary computer program products and computer-readable storage media

[0219] In addition to the methods and apparatus described above, embodiments of this disclosure may also be computer program products comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the methods for quantifying feature data in models according to various embodiments of this disclosure as described in the foregoing portions of this specification.

[0220] The computer program product can be written in any combination of one or more programming languages ​​to perform the operations of the embodiments of this disclosure. The programming languages ​​include object-oriented programming languages ​​such as Java and C++, as well as conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on a user's computing device, partially on a user's computing device, as a standalone software package, partially on a user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.

[0221] Furthermore, embodiments of this disclosure may also be computer-readable storage media storing computer program instructions that, when executed by a processor, cause the processor to perform steps in the quantization methods for feature data in models according to various embodiments of this disclosure as described in the "Exemplary Methods" section above.

[0222] The computer-readable storage medium may be any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.

[0223] The basic principles of this disclosure have been described above with reference to specific embodiments. However, it should be noted that the advantages, benefits, and effects mentioned in this disclosure are merely examples and not limitations, and should not be considered as essential features of each embodiment of this disclosure. Furthermore, the specific details disclosed above are for illustrative and facilitative purposes only, and are not limitations. These details do not limit the scope of this disclosure to the necessity of employing the aforementioned specific details for implementation.

[0224] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For system embodiments, since they largely correspond to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0225] The block diagrams of devices, apparatuses, devices, and systems disclosed herein are merely illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will recognize, these devices, apparatuses, devices, and systems can be connected, arranged, and configured in any manner. Words such as “comprising,” “including,” “having,” etc., are open-ended terms meaning “including but not limited to,” and are used interchangeably with them. The terms “or” and “and” as used herein refer to the terms “and / or,” and are used interchangeably with them unless the context clearly indicates otherwise. The term “such as” as used herein refers to the phrase “such as but not limited to,” and is used interchangeably with it.

[0226] The methods and apparatus of this disclosure may be implemented in many ways. For example, they may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order of steps for the methods is for illustrative purposes only, and the steps of the methods of this disclosure are not limited to the order specifically described above unless otherwise specifically stated. Furthermore, in some embodiments, this disclosure may also be implemented as a program recorded on a recording medium, the program including machine-readable instructions for implementing the methods according to this disclosure. Thus, this disclosure also covers recording media storing programs for performing the methods according to this disclosure.

[0227] It should also be noted that in the apparatus, devices, and methods of this disclosure, the components or steps can be disassembled and / or recombined. These disassemblies and / or recombinations should be considered as equivalent solutions to this disclosure.

[0228] The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use this disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other aspects without departing from the scope of this disclosure. Therefore, this disclosure is not intended to be limited to the aspects shown herein, but rather to be carried out within the widest scope consistent with the principles and novel features disclosed herein.

[0229] The above description has been given for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of this disclosure to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations therein.

Claims

1. A method for quantifying feature data in a model, comprising: From the target model deployed on the target device, obtain the feature data to be quantized, wherein the feature data includes at least two data blocks; The feature data is transformed by using a pre-constructed inter-block orthogonal transformation matrix to obtain the transformed data. Using a pre-constructed intra-block orthogonal transformation matrix, an intra-block transformation is performed on each data block included in the inter-block transformed data to obtain the intra-block transformed data; The transformed data within the block is quantized to obtain quantized feature data.

2. The method according to claim 1, wherein, The step of quantizing the transformed data within the block to obtain quantized feature data includes: Obtain the pre-constructed normalized coefficient matrix; For each data block in the transformed data within the block, the normalization coefficients corresponding to the data block are extracted from the normalization coefficient matrix; the data block is normalized using the normalization coefficients to obtain a normalized data block; the normalized data block is quantized based on a preset data quantization strategy to obtain a quantized normalized data block; and the quantized normalized data block is inversely normalized based on the normalization coefficients to obtain a quantized data block. The obtained quantized data blocks are combined into the quantized feature data.

3. The method according to claim 1, wherein, Before obtaining the feature data to be quantized from the target model deployed on the target device, the method further includes: Obtain the feature data of the first sample and the initialized inter-block orthogonal transformation matrix; Determine the covariance matrix of the feature data of the first sample; Based on the covariance matrix, with the goal of equal variance of each data block in the first sample feature data, the inter-block orthogonal transformation matrix is ​​updated, and if the current return condition is met, the updated inter-block orthogonal transformation matrix is ​​returned.

4. The method according to claim 3, wherein, The step of updating the inter-block orthogonal transformation matrix based on the covariance matrix, with the objective of ensuring equal variances in each data block of the first sample feature data, and returning the updated inter-block orthogonal transformation matrix if the current return conditions are met, includes: Perform the following first update steps: Determine the mean variance of each variance value in the covariance matrix, and the absolute value of the difference between each variance value in the covariance matrix and the mean variance; If the absolute value meets the preset threshold condition, two target variance values ​​are selected from the covariance matrix, wherein one of the two target variance values ​​is greater than the variance mean and the other is less than the variance mean; Based on the two target variance values, the matrix rotation angle is determined, and based on the matrix rotation angle, a first rotation matrix is ​​constructed; Based on the first rotation matrix, update the covariance matrix and the inter-block orthogonal transformation matrix; Based on the updated covariance matrix and inter-block orthogonal transformation matrix, continue to execute the first update step; If the absolute value does not meet the threshold condition, return the current inter-block orthogonal transformation matrix.

5. The method according to claim 1, wherein, Before obtaining the feature data to be quantized from the target model deployed on the target device, the method further includes: Obtain the initialized intra-block orthogonal transformation matrix, the initialized normalized coefficient matrix, and the second sample feature data, wherein the second sample feature data is the feature data after transforming the original sample feature data using the inter-block orthogonal transformation matrix; Perform the following second update step: Based on the intra-block orthogonal transformation matrix and the normalization coefficient matrix, the second sample feature data is mapped to the normalization space and quantized to obtain sample quantized data; Based on the sample quantization data, statistics are performed on each codeword in the preset codebook to obtain the empirical occupancy probability of each codeword; Based on a preset codebook occupancy balancing loss function, the loss value is calculated using the empirical occupancy probability and the average empirical occupancy probability of the codewords in the codebook. Based on the loss value, update the intra-block orthogonal transformation matrix and the normalization coefficient matrix.

6. The method according to claim 5, wherein, The step of updating the intra-block orthogonal transformation matrix and the normalized coefficient matrix based on the loss value includes: If the loss value meets the iterative update condition, the normalization coefficient matrix is ​​updated based on the maximum codeword in the codebook and the maximum value of each data block in the second sample feature data; From the second sample feature data, select the first target column data and the second target column data that meet the optimization conditions; Based on the first target column data and the second target column data, construct a second rotation matrix; Update the intra-block orthogonal transformation matrix based on the second rotation matrix; Based on the updated intra-block orthogonal transformation matrix and normalized coefficient matrix, continue to execute the second update step; If the loss value does not meet the iterative update condition, return the current intra-block orthogonal transformation matrix and normalized coefficient matrix.

7. The method according to claim 6, wherein, The step of selecting the first target column data and the second target column data that meet the optimization conditions from the second sample feature data includes: For any column of data in the second sample feature data, determine the empirical occupancy probability of each codeword corresponding to that column of data; based on the empirical occupancy probability and the average empirical occupancy probability, determine the codebook occupancy imbalance score corresponding to that column of data; Based on the obtained imbalance scores of each codebook, M columns of candidate data are determined from the second sample feature data; For any two candidate data columns in the M candidate data columns, the complementarity score between the two candidate data columns is determined based on the empirical occupancy probability of the codewords corresponding to the two candidate data columns and the average empirical occupancy probability; the selection score of the two candidate data columns is determined based on the complementarity score and the codebook occupancy imbalance score corresponding to the two candidate data columns. Based on the obtained scores, N pairs of candidate data are selected from the M columns of candidate data, and each selected pair of candidate data is determined as the first target column data and the second target column data that meet the optimization conditions.

8. The method according to claim 6, wherein, The construction of the second rotation matrix based on the first target column data and the second target column data includes: Select the target rotation angle that minimizes the loss value from the preset set of rotation angles; Based on the target rotation angle, the first target column data, and the second target column data, the second rotation matrix is ​​constructed.

9. The method according to any one of claims 1-8, wherein, After quantizing the transformed data within the block to obtain quantized feature data, the method further includes: The inter-block orthogonal transformation matrix and the intra-block orthogonal transformation matrix are inversely rotated to obtain the inter-block orthogonal transformation inverse rotation matrix and the intra-block orthogonal transformation inverse rotation matrix, respectively. The weight data is fused with the weight data of the linear layer in the target model to obtain fused weight data; Using the fused weight data, a linear transformation is performed on the quantized feature data to obtain the feature data output by the linear layer.

10. A device for quantizing feature data in a model, comprising: The first acquisition module is used to acquire feature data to be quantized from the target model deployed on the target device, wherein the feature data includes at least two data blocks; The first transformation module is used to perform inter-block transformation on the feature data using a pre-constructed inter-block orthogonal transformation matrix to obtain the inter-block transformed data. The second transformation module is used to perform intra-block transformation on each data block included in the inter-block transformed data using a pre-constructed intra-block orthogonal transformation matrix to obtain intra-block transformed data. The quantization module is used to quantize the transformed data within the block to obtain quantized feature data.

11. An electronic device, characterized in that, include: Memory, used to store computer program products; A processor for executing a computer program product stored in the memory, wherein when the computer program product is executed, it implements the method described in any one of claims 1-9.

12. A computer-readable storage medium having computer program instructions stored thereon, characterized in that, When the computer program instructions are executed by the processor, they implement the method described in any one of claims 1-9.