Method for compressing weight matrix of neural network model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By utilizing both upper and lower singular values for weight matrix compression, the method addresses data loss issues in Transformer networks, achieving reduced memory usage and computational load while maintaining performance.

WO2026127218A1PCT designated stage Publication Date: 2026-06-18FOUND FOR RES & BUSINESS SEOUL NAT UNIV OF SCI & TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: FOUND FOR RES & BUSINESS SEOUL NAT UNIV OF SCI & TECH
Filing Date: 2025-04-21
Publication Date: 2026-06-18

Smart Images

Figure KR2025005384_18062026_PF_FP_ABST

Patent Text Reader

Abstract

The present invention relates to a method for compressing a weight matrix of a neural network model by using low-rank approximation (LRA) and quantization. The method for compressing a weight matrix of a neural network model according to an embodiment of the present invention comprises the steps of: generating a first compression matrix by using upper singular values of a weight matrix; generating a second compression matrix by adjusting the upper singular values by means of lower singular values of the weight matrix; performing knowledge distillation on the second compression matrix on the basis of an output of the neural network model using the weight matrix; and applying quantization to the knowledge-distilled second compression matrix.

Need to check novelty before this filing date? Find Prior Art

Description

Method to compress the weight matrix of a neural network model

[0001] The present invention relates to a method for compressing the weight matrix of a neural network model using Low-Rank Approximation (LRA) and quantization.

[0002]

[0003] Recently, Transformer network architectures have been attracting attention in the fields of computer vision and natural language processing. However, Transformer network models with a vast number of parameters present a problem in that they are difficult to utilize in environments with limited memory resources, such as edge and mobile environments.

[0004] Although Low-Rank Approximation (LRA) can be performed to simplify the transformer network structure, conventional LRA methods have the problem of significant data loss due to simplifying the transformer network structure by utilizing only upper singular values.

[0005] To address this, a method is needed to simplify the transformer network structure by utilizing both upper and lower singular values when performing low-coefficient matrix approximation.

[0006]

[0007] The present invention aims to compress the weight matrix of a neural network model by performing low-coefficient approximation using both upper and lower singular values.

[0008] The objects of the present invention are not limited to those mentioned above, and other unmentioned objects and advantages of the present invention may be understood from the following description and will be more clearly understood by the embodiments of the present invention. Furthermore, it will be readily apparent that the objects and advantages of the present invention can be realized by the means and combinations thereof set forth in the claims.

[0009]

[0010] A method for compressing a weight matrix of a neural network model according to an embodiment of the present invention for achieving the aforementioned purpose comprises the steps of: generating a first compression matrix using upper singular values of the weight matrix; generating a second compression matrix by adjusting the upper singular values using lower singular values of the weight matrix; performing knowledge distillation on the second compression matrix based on the output of the neural network model using the weight matrix; and applying quantization to the knowledge-distilled second compression matrix.

[0011] The step of generating the first compression matrix is characterized by including the step of selecting the upper singular values by performing singular value decomposition on the weight matrix and the step of generating the first compression matrix using the upper singular values.

[0012] The step of generating the second compression matrix comprises: a step of selecting the lower singular values by performing singular value decomposition on the weight matrix; a step of generating a restoration matrix using the lower singular values; and a step of generating the second compression matrix by combining the first compression matrix and the restoration matrix.

[0013] The step of performing the knowledge distillation above is characterized by including the step of calculating a first weight distribution of the weight matrix, the step of calculating a second weight distribution of the second compression matrix, and the step of performing knowledge distillation on the second compression matrix such that the difference between the first and second weight distributions is minimized.

[0014] The step of applying the quantization is characterized by including the step of determining a scaling coefficient based on the weight distribution of the second compression matrix and the step of applying quantization to the second compression matrix based on the scaling coefficient.

[0015]

[0016] The present invention has the effect of reducing memory usage of a neural network model by performing low-coefficient approximation using both upper and lower singular values.

[0017] In addition to the effects described above, the specific effects of the present invention are described together with the specific details for implementing the invention below.

[0018]

[0019] FIG. 1 is a flowchart illustrating a weight matrix compression method of a neural network model according to an embodiment of the present invention.

[0020] FIG. 2 is a diagram illustrating the process of generating first and second compression matrices using upper singular values and lower singular values.

[0021] Figure 3 is a diagram illustrating the process of performing knowledge distillation.

[0022] FIG. 4 is a diagram illustrating the final second compression matrix.

[0023] FIG. 5 is a drawing visually illustrating the overall process according to one embodiment of the present invention.

[0024]

[0025] The aforementioned objectives, features, and advantages are described in detail below with reference to the attached drawings, thereby enabling those skilled in the art to easily implement the technical concept of the present invention. In describing the present invention, detailed descriptions of known technologies related to the present invention are omitted if it is determined that such descriptions would unnecessarily obscure the essence of the invention. Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the attached drawings. In the drawings, the same reference numerals are used to indicate the same or similar components.

[0026] In this specification, terms such as "first," "second," etc. are used to describe various components, but these components are not limited by these terms. These terms are used merely to distinguish one component from another, and unless specifically stated otherwise, the first component may be the second component.

[0027] Additionally, in this specification, the statement that any configuration is disposed on the "upper (or lower)" or "upper (or lower)" of a component may mean not only that any configuration is disposed in contact with the upper (or lower) surface of said component, but also that another configuration may be interposed between said component and any configuration disposed on (or below) said component.

[0028] Furthermore, where it is stated in this specification that one component is "connected," "coupled," or "connected" to another component, it should be understood that while the components may be directly connected or connected to each other, another component may be "interposed" between each component, or each component may be "connected," "coupled," or "connected" through another component.

[0029] Additionally, singular expressions used in this specification include plural expressions unless the context clearly indicates otherwise. In this application, terms such as "composed of" or "comprising" should not be interpreted as necessarily including all of the various components or steps described in the specification, and should be interpreted as meaning that some of the components or steps may not be included, or that additional components or steps may be included.

[0030] Additionally, in this specification, "A and / or B" means A, B, or A and B unless specifically stated otherwise, and "C to D" means C or more and D or less, unless specifically stated otherwise.

[0031] The present invention relates to a method for compressing the weight matrix of a neural network model using Low-Rank Approximation (LRA) and quantization. Hereinafter, a method for compressing the weight matrix of a neural network model according to an embodiment of the present invention will be described in detail with reference to FIGS. 1 to 5.

[0032] FIG. 1 is a flowchart illustrating a weight matrix compression method of a neural network model according to one embodiment of the present invention.

[0033] Figure 2 is a diagram illustrating the process of generating first and second compression matrices using upper singular values and lower singular values.

[0034] Figure 3 is a diagram illustrating the process of performing knowledge distillation.

[0035] Figure 4 is a diagram illustrating the final second compression matrix.

[0036] FIG. 5 is a drawing that visually illustrates the overall process according to one embodiment of the present invention.

[0037] Referring to FIG. 1, a weight matrix compression method of a neural network model may include a step of generating a first compression matrix using upper singular values of a weight matrix (S100), a step of generating a second compression matrix by adjusting the upper singular values using lower singular values of a weight matrix (S200), a step of performing knowledge distillation on the second compression matrix based on the output of the neural network model using the weight matrix (S300), and a step of applying quantization to the knowledge-distilled second compression matrix (S400).

[0038] However, the weight matrix compression method of the neural network model illustrated in FIG. 1 is according to one embodiment, and the steps constituting the invention are not limited to the embodiment illustrated in FIG. 1, and some steps may be added, changed, or deleted as needed.

[0039] Each step illustrated in FIG. 1 can be performed by a processor, and the processor may include at least one physical element among ASICs (application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs (field programmable gate arrays), a controller, a microprocessor (100), and microcontrollers to perform the operation of the invention described below.

[0040] A neural network model according to one embodiment of the present invention is a vision transformer model, and when an image is input, it may be pre-trained to divide the image into patches and perform vision tasks such as image classification and object detection.

[0041] Meanwhile, neural network models are not limited to Vision Transformer models and can be various types of neural network models, such as CNN models and RNN models.

[0042] A weight matrix may be a mathematical structure that represents the relationship between input data and output data of a neural network model. For example, in an embodiment of the present invention, if an image input to a neural network model is an input vector, the weight matrix can be multiplied by the input vector to output a result.

[0043] To compress the weight matrix, the processor can generate a first compression matrix using the upper singular value of the weight matrix (S100).

[0044] Referring to FIG. 2, the processor can perform singular value decomposition (SVD) on the weight matrix (100) to select upper singular values (10). Here, upper singular values (10) may be values containing key information that make up the weight matrix.

[0045] Referring to FIG. 2, the singular values for the weight matrix (100) can be performed by [Equation 1].

[0046] [Mathematical Formula 1]

[0047]

[0048] (W original : Weight matrix, U: Left singular vector matrix (m*m), S: Diagonal singular value matrix (m*n), V: Right singular vector matrix (n*n))

[0049] The processor can generate a first compression matrix using the upper singular value (10). In other words, the processor can compress the weight matrix primarily using only the upper singular value (10) as in [Equation 2], which can also be called a low-rank approximation.

[0050] [Mathematical Formula 2]

[0051]

[0052] (W top : 1st compression matrix, U top , V top : Singular vector corresponding to the upper singular value, S top : Upper singular value)

[0053] When the weight matrix (100) is compressed using upper singular values (10), the important characteristics of the weight matrix (100) are maintained, while the reduction in computational load and lightweighting of the neural network model can be expected. However, this method has a problem in that lower singular values (20) containing detailed information of the weight matrix (100) are removed, which may result in the loss of detailed information of the weight matrix (100).

[0054] To solve this problem, the processor can generate a second compression matrix (30) by adjusting the upper singular value (10) using the lower singular value (20) of the weight matrix (100) (S200).

[0055] The processor can select lower singular values (20) from the previously decomposed weight matrix (100) and generate a restoration matrix using the selected lower singular values (20). Meanwhile, the processor can generate a restoration matrix through [Equation 3].

[0056] [Mathematical Formula 3]

[0057]

[0058] (W low : Restoration matrix, U low , V low is a singular vector corresponding to a lower singular value)

[0059] Next, the processor can generate a second compression matrix (30) by combining the first compression matrix and the restoration matrix as in [Equation 4].

[0060] [Mathematical Formula 4]

[0061]

[0062] (W compressed : Second compression matrix, W top : 1st compression matrix, Wlow: restoration matrix)

[0063] In this way, by utilizing both the upper singular value (10) and the lower singular value (20) to compress the weight matrix (100), the second compressed matrix can include both the main information of the weight matrix (100) included in the upper singular value (10) and the detailed information of the weight matrix (100) included in the lower singular value (20).

[0064] Meanwhile, the processor can perform knowledge distillation (KD) so that the second compression matrix (30) maintains performance similar to that of the weight matrix.

[0065] Referring to FIG. 3, the processor can perform knowledge distillation on the second compression matrix (30) based on the output of the neural network model using the weight matrix (100) (S300).

[0066] First, the processor can calculate the first weight distribution (40) of the weight matrix as in [Equation 5].

[0067] [Mathematical Formula 5]

[0068]

[0069] (P original : Distribution of the weight matrix, W original : Weight matrix of the neural network model, f(x): function that calculates the weight distribution)

[0070] Next, the processor can calculate the second weight distribution (50) of the second compression matrix (30) as in [Equation 6].

[0071] [Mathematical Formula 6]

[0072]

[0073] (P compressed : Distribution of the second compression matrix, W compressed : Weight matrix of the compressed neural network model, f(x): function that calculates the weight distribution)

[0074] The processor can define a loss function as [Equation 7] to measure the difference between the first weight distribution (40) and the second weight distribution (50).

[0075] [Mathematical Formula 7]

[0076]

[0077] (L distill : Weight distribution difference, P original : First weight distribution, P compressed : Second weight distribution)

[0078] Referring further to FIG. 3, the processor can perform knowledge distillation on the second compression matrix (30) such that the difference between the first weight distribution (40) and the second weight distribution (50) is minimized. In other words, the processor can train the second compression matrix (30) such that the result of the loss function is minimized.

[0079] At this time, the learning method can be performed by inputting the same image into each of the weight matrix (100) and the second compression matrix (30) and outputting the first weight distribution (40) and the second weight distribution (50) in parallel, thereby training the second compression matrix (30) so that the difference between the first weight distribution (40) and the second weight distribution (50) is minimized.

[0080] As a result, as shown in FIG. 4, the second compression matrix (30) can be trained to have characteristics similar to the weight matrix (100).

[0081] Finally, the processor can apply quantization to the knowledge distilled second compression matrix (30) (S400).

[0082] Quantization is performed to reduce memory usage and computational complexity of neural network models, and the weights and activation values of the second compression matrix can be represented as low-bit integers.

[0083] To perform this, the processor can determine a scaling factor based on the weight distribution of the second compression matrix (30). Specifically, the processor can determine the scaling factor as shown in [Equation 8] by analyzing the weight distribution of the second compression matrix (30) to calculate the maximum value (α) and the minimum value (β).

[0084] [Mathematical Formula 8]

[0085]

[0086] (s: scaling factor, b: number of bits to use for quantization, α, β: maximum and minimum values of the weight distribution of the second compression matrix)

[0087] The scaling factor can minimize the error resulting from quantization by mapping the range of weight values to fixed integer values.

[0088] The processor can apply quantization to the second compression matrix (30) as in [Equation 9] based on the scaling factor.

[0089] [Mathematical Formula 9]

[0090]

[0091] (round(x): round to the nearest integer of x, β: minimum of the weight distribution, Q(w): quantized weight value, s: scaling factor)

[0092] Through this process, weight values are converted into a fixed integer range, which can reduce the memory usage and computational load of the neural network model.

[0093] Meanwhile, the processor can perform a weight-aware distribution scaling process to reduce quantization error. The weight-aware distribution scaling process is a process that reduces quantization error by adjusting the distribution of quantized weight values and can be performed by [Equation 10].

[0094] [Mathematical Formula 10]

[0095]

[0096] (w': restored weight value, Q(w): quantized weight value, s: scaling factor, β: minimum value)

[0097] By performing such a restoration process, the processor can minimize information loss caused by weight quantization.

[0098] Meanwhile, the processor can minimize the quantization error between the input activation values and the weight matrix through scaling of the activation values. Specifically, the processor sets a scaling vector to reduce the quantization error between the input activation values and the weight matrix. Meanwhile, the scaling vector can be optimized based on the loss function of [Equation 11].

[0099] [Mathematical Formula 11]

[0100]

[0101] (X: input activation value, W: weight matrix, Q(): quantization function, L(α): quantization loss between activation and weight)

[0102] In particular, the processor can further optimize activation scaling when the weight quantization error occurring in a specific layer is greater than a threshold value. To do this, the processor can determine the optimal scaling vector by considering both the weight quantization error and the activation quantization error as shown in [Equation 12].

[0103] [Mathematical Formula 12]

[0104]

[0105] In summary, the processor can minimize quantization error by optimizing channel scaling vectors based on the loss function. In particular, if the weight quantization error in a specific layer exceeds a threshold, the error can be compensated for through additional activation scaling optimization, thereby reducing quantization errors arising from weights and activation values.

[0106] Figure 5 illustrates the overall process of the weight matrix compression method of a neural network model. In summary, the processor can (a) generate a second compression matrix using upper and lower singular values, and (b) perform knowledge distillation on the second compression matrix to (c) generate a final second compression matrix. Subsequently, the processor can (d) apply quantization to the final second compression matrix.

[0107] By compressing the weight matrix of a neural network model based on this process, the model can output results similar to the weight matrix before compression, even when using the compressed matrix. Additionally, since the neural network model is lightweight, it has the advantage of being able to operate even in environments with limited memory resources (e.g., edge and mobile).

[0108] Meanwhile, when the present invention was implemented on an Android smartphone, a 3.2x inference acceleration effect and a 875 model compression effect were obtained compared to the existing model.

[0109] In addition, when the present invention was implemented on the Nvidia Jetson Xavier platform, a representative edge device, a 2.5x inference acceleration effect was obtained.

[0110] Although the present invention has been described above with reference to the illustrated drawings, the present invention is not limited by the embodiments and drawings disclosed in this specification, and it is obvious that various modifications can be made by a person skilled in the art within the scope of the technical concept of the present invention. Furthermore, even if the effects of the configuration of the present invention were not explicitly described while explaining the embodiments of the present invention above, it is natural to acknowledge that the effects predictable by said configuration should also be recognized.

Claims

1. A method for compressing the weight matrix of a pre-trained neural network model, A step of generating a first compression matrix using the upper singular values of the above weight matrix; A step of generating a second compression matrix by adjusting the upper singular value with the lower singular value of the weight matrix; A step of performing knowledge distillation on the second compression matrix based on the output of the neural network model using the weight matrix; and The step of applying quantization to the second compression matrix distilled from the above knowledge Weight matrix compression method for neural network models.

2. In Paragraph 1, The step of generating the first compression matrix above A step of selecting the upper singular values by performing singular value decomposition on the above weight matrix; and A step comprising generating the first compression matrix using the upper singular value. Weight matrix compression method for neural network models.

3. In Paragraph 1, The step of generating the second compression matrix above A step of selecting the lower singular values by performing singular value decomposition on the above weight matrix; A step of generating a restoration matrix using the above-mentioned lower singular values; and A method comprising the step of generating the second compression matrix by combining the first compression matrix and the restoration matrix. Weight matrix compression method for neural network models.

4. In Paragraph 1, The step of performing the above-mentioned knowledge distillation is A step of calculating the first weight distribution of the above weight matrix; A step of calculating a second weight distribution of the second compression matrix; and A step comprising performing knowledge distillation on the second compression matrix such that the difference between the first and second weight distributions is minimized. Weight matrix compression method for neural network models.

5. In Paragraph 1, The step of applying the above quantization A step of determining a scaling factor based on the weight distribution of the second compression matrix; and A step comprising applying quantization to the second compression matrix based on the scaling coefficient. Weight matrix compression method for neural network models.