Low-bit quantization method and system for large language model

By performing a two-stage rotation transformation and low-rank decomposition on the activation data matrix of a large language model, the problem of accuracy loss caused by massive activations and uneven weight distribution in ultra-low bit quantization of large language models is solved, achieving efficient model compression and accuracy improvement, which is suitable for edge devices and cloud deployment.

CN122197988APending Publication Date: 2026-06-12BEIJING SILICONFLOW TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING SILICONFLOW TECHNOLOGY CO LTD
Filing Date
2026-03-12
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing large language models suffer severe accuracy loss during ultra-low bit quantization due to massive activations and uneven weight distribution. Existing methods cannot effectively solve the data distribution problem within the model, leading to a catastrophic decrease in model accuracy after quantization.

Method used

By performing a two-stage rotation transformation optimization on the activation data matrix of a large language model, including uniform preprocessing and data-driven fine optimization, the raster-to-standard deviation ratio is reduced. Combined with low-rank decomposition and row-level fine-tuning, the weight and residual distribution are optimized, and finally low-bit quantization is performed.

Benefits of technology

It significantly improves the accuracy of large language models in ultra-low bit scenarios, achieves efficient model storage compression and inference efficiency, and is suitable for resource-constrained edge devices and cloud deployments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122197988A_ABST
    Figure CN122197988A_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of large language models, and relates to a low-bit quantization method and system for a large language model. Two-stage rotation transformation of uniform preprocessing and data-driven fine optimization is performed on activation data of the model to reduce the grid-standard deviation ratio of the activation value, so that the distribution is more suitable for the quantization grid. Then, the rotation matrix is applied to the weight, and low-rank decomposition and residual distribution processing are performed on the weight to separate the low-rank part reserved with high precision and the to-be-quantized residual part optimized by row-level fine tuning. Finally, low-bit quantization is performed on the residual to generate a deployable model, effectively solving the problem that the large language model has serious precision loss in the ultra-low-bit (such as W4A4) scene due to uneven distribution of a large number of activations and weights in the post-training quantization of the ultra-low-bit (such as W4A4). Without model retraining, the ultra-low-bit quantization precision, model storage, memory occupation and inference efficiency are significantly improved.
Need to check novelty before this filing date? Find Prior Art