Low-bit quantization method and system for large language model
By performing a two-stage rotation transformation and low-rank decomposition on the activation data matrix of a large language model, the problem of accuracy loss caused by massive activations and uneven weight distribution in ultra-low bit quantization of large language models is solved, achieving efficient model compression and accuracy improvement, which is suitable for edge devices and cloud deployment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING SILICONFLOW TECHNOLOGY CO LTD
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-12
AI Technical Summary
Existing large language models suffer severe accuracy loss during ultra-low bit quantization due to massive activations and uneven weight distribution. Existing methods cannot effectively solve the data distribution problem within the model, leading to a catastrophic decrease in model accuracy after quantization.
By performing a two-stage rotation transformation optimization on the activation data matrix of a large language model, including uniform preprocessing and data-driven fine optimization, the raster-to-standard deviation ratio is reduced. Combined with low-rank decomposition and row-level fine-tuning, the weight and residual distribution are optimized, and finally low-bit quantization is performed.
It significantly improves the accuracy of large language models in ultra-low bit scenarios, achieves efficient model storage compression and inference efficiency, and is suitable for resource-constrained edge devices and cloud deployments.
Smart Images

Figure CN122197988A_ABST