Quantization method and quantization apparatus for generative AI model

CN122242587APending Publication Date: 2026-06-19SAMSUNG (CHINA) SEMICONDUCTOR CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SAMSUNG (CHINA) SEMICONDUCTOR CO LTD
Filing Date
2026-02-09
Publication Date
2026-06-19

Smart Images

  • Figure CN122242587A_ABST
    Figure CN122242587A_ABST
Patent Text Reader

Abstract

A quantization method and apparatus for a generative AI model are disclosed. The quantization method includes: inputting multiple preset sample images into the generative AI model and obtaining the input data of the first and last layers of a denoising network as a calibration dataset; performing initial quantization on the weights of each convolutional and linear layer of the denoising network, and determining the quantization parameters of each convolutional and linear layer based on the outputs of each convolutional and linear layer before and after quantization obtained from the calibration dataset; quantizing the weights of each convolutional and linear layer based on the quantization parameters; setting at least two optional quantization layers for each activation layer of the denoising network, and constructing all optional quantization layers into a quantization supernet; and using a preset optimization algorithm, selecting a quantization subnet that meets preset conditions as the quantization model for all activation layers based on the data distribution of the output data of each activation layer and the output data of the quantization subnet in the quantization supernet.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence (AI) technology, and more specifically, to methods and apparatus for quantizing generative AI models. Background Technology

[0002] With the rapid development of AI technology, generative AI has become a focus of attention across various industries, and its application scale is gradually expanding. Generative AI models have a huge number of parameters. In order to better deploy them on mobile electronic devices and improve the inference speed of generative AI models, it is usually necessary to compress them to minimize model complexity and storage space requirements.

[0003] For generative AI models, quantization can be considered a relatively effective compression method. It can effectively reduce the model's storage space by representing floating-point models as low-bit-width models, thereby accelerating model inference while maintaining model accuracy. Currently, the main quantization methods include fixed-bit-width quantization and mixed-precision quantization. However, neither of these methods can simultaneously and effectively solve the problems of quantization efficiency, model accuracy, and inference time after deployment. Summary of the Invention

[0004] This disclosure provides a quantization method and apparatus for generative AI models, which can achieve a good balance between quantization effect and quantization efficiency.

[0005] According to embodiments of this disclosure, a quantization method for a generative AI model is provided. The quantization method includes: inputting multiple preset sample images into the generative AI model, and obtaining input data of the first and last layers of a denoising network included in the AI ​​model as a calibration dataset for quantization; performing initial quantization on the weights of each convolutional and linear layer of the denoising network, and determining quantization parameters of each convolutional and linear layer based on the outputs of each convolutional and linear layer before and after quantization obtained based on the calibration dataset; quantizing the weights of each convolutional and linear layer of the denoising network based on the determined quantization parameters; setting at least two optional quantization layers for each activation layer of the denoising network, and constructing all optional quantization layers of all activation layers into a quantization supernet, wherein the quantization bits of the at least two optional quantization layers are different from each other; using a preset optimization algorithm, selecting a quantization subnet that meets preset conditions based on the data distribution of the output data of each activation layer and the output data of the quantization subnet in the quantization supernet, as the quantization model for all activation layers of the denoising network, wherein the quantization subnet includes an optional quantization layer corresponding to each activation layer.

[0006] Optionally, the step of initial quantization of the weights of each convolutional layer and linear layer of the denoising network includes: using asymmetric quantization and main channel quantization methods to initially quantize the weights of each convolutional layer and linear layer of the denoising network.

[0007] Optionally, the step of determining the quantization parameters includes: obtaining the outputs before and after quantization of each convolutional layer and linear layer of the denoising network based on the calibration dataset; calculating the loss value based on the outputs before and after quantization of each convolutional layer and linear layer; and adjusting the quantization parameters of each convolutional layer and linear layer with the goal of minimizing the loss value.

[0008] Optionally, the mean square error of the output before and after quantization of each convolutional and linear layer is calculated as the loss value.

[0009] Optionally, the step of selecting a quantization subnet that meets preset conditions as the quantization model for all activation layers of the denoising network includes: using a genetic algorithm, based on the similarity of the data distribution between the output data of each activation layer and the output data of the quantization subnet in the quantization supernet, selecting a quantization subnet that meets preset conditions as the quantization model for all activation layers of the denoising network.

[0010] Optionally, the step of selecting quantization subnets that meet preset conditions further includes: selecting several quantization subnets from the quantization supernet as individuals in the initial population based on a first preset rule, and constructing an initial population; iteratively performing crossover and mutation operations on the individuals in the initial population until individuals that meet preset conditions are generated, which serve as the quantization model for all activation layers of the denoising network.

[0011] Optionally, based on a first preset rule, the step of selecting several quantization subnets from the quantization supernet as individuals in the initial population and constructing the initial population includes: randomly selecting multiple quantization subnets from the quantization supernet; determining the similarity of the data distribution between the output data of each activation layer and the output data of the multiple quantization subnets according to the KS test; adding the N quantization subnets with the highest similarity as individuals in the undetermined initial population, where N is a natural number greater than 1 and N is less than the number of the multiple quantization subnets; determining whether the number of individuals in the undetermined initial population reaches a preset number; in response to the number of individuals in the undetermined initial population not reaching the preset number, returning to the step of randomly selecting multiple quantization subnets from the quantization supernet; and in response to the number of individuals in the undetermined initial population reaching the preset number, determining the undetermined initial population as the initial population.

[0012] Optionally, the step of iteratively performing crossover and mutation operations on individuals in the initial population until an individual that meets a preset condition is generated as the quantization model for all activation layers of the denoising network includes: performing crossover and mutation operations on individuals in the current population to generate the next generation population; determining the similarity between the data distribution of the output data of each activation layer and the data distribution of the output data of each individual in the next generation population according to the KS test; determining whether the highest similarity is greater than a preset similarity threshold; in response to the highest similarity being greater than the preset similarity threshold, determining the individual with the highest similarity as the quantization model for all activation layers of the denoising network; in response to the highest similarity being less than or equal to the preset similarity threshold, using the next generation population as the current population, and returning to the step of performing crossover and mutation operations on individuals in the current population.

[0013] Optionally, the denoising network is a UNet model.

[0014] According to another embodiment of this disclosure, a quantization apparatus for a generative AI model is provided. The quantization apparatus includes: a calibration dataset generation unit configured to: input multiple preset sample images into the generative AI model and acquire input data of the first and last layers of a denoising network included in the AI ​​model as a calibration dataset for quantization; a first quantization parameter determination unit configured to: initially quantize the weights of each convolutional layer and linear layer of the denoising network, and determine the quantization parameters of each convolutional layer and linear layer based on the pre-quantization and post-quantization outputs of each convolutional layer and linear layer obtained based on the calibration dataset; and a first quantization unit configured to: based on the determined... The quantization parameters quantize the weights of each convolutional and linear layer of the denoising network; the quantization supernet construction unit is configured to: set at least two optional quantization layers for each activation layer of the denoising network, and construct all optional quantization layers of all activation layers into a quantization supernet, wherein the quantization bits of the at least two optional quantization layers are different from each other; the quantization model determination unit is configured to: use a preset optimization algorithm, based on the data distribution of the output data of each activation layer and the output data of the quantization subnet in the quantization supernet, select a quantization subnet that meets preset conditions as the quantization model of all activation layers of the denoising network, wherein the quantization subnet includes an optional quantization layer corresponding to each activation layer.

[0015] Optionally, the first quantization parameter determination unit is configured to: perform initial quantization on the weights of each convolutional and linear layer of the denoising network using asymmetric quantization and main channel quantization methods.

[0016] Optionally, the first quantization parameter determination unit is configured to: obtain the outputs before and after quantization of each convolutional layer and linear layer of the denoising network based on the calibration dataset; calculate the loss value based on the outputs before and after quantization of each convolutional layer and linear layer; and adjust the quantization parameters of each convolutional layer and linear layer with the goal of minimizing the loss value.

[0017] Optionally, the first quantization parameter determination unit is configured to: calculate the mean square error of the outputs before and after quantization of each convolutional layer and linear layer, as the loss value.

[0018] Optionally, the quantization model determination unit is configured to: use a genetic algorithm to select a quantization subnet that meets preset conditions as the quantization model for all activation layers of the denoising network, based on the similarity of the data distribution between the output data of each activation layer and the output data of the quantization subnet in the quantization supernet.

[0019] Optionally, the quantization model determination unit is further configured to: select several quantization subnets from the quantization supernet as individuals in the initial population based on a first preset rule, and construct an initial population; iteratively perform crossover and mutation operations on the individuals in the initial population until individuals that meet preset conditions are generated as the quantization model for all activation layers of the denoising network.

[0020] Optionally, the quantization model determination unit is further configured to: randomly select multiple quantization subnets from the quantization supernet; determine the similarity of the data distribution between the output data of each activation layer and the output data of the multiple quantization subnets according to the KS test; add the N quantization subnets with the highest similarity as individuals to the undetermined initial population, where N is a natural number greater than 1 and N is less than the number of the multiple quantization subnets; determine whether the number of individuals in the undetermined initial population reaches a preset number; in response to the number of individuals in the undetermined initial population not reaching the preset number, return to the operation of randomly selecting multiple quantization subnets from the quantization supernet; in response to the number of individuals in the undetermined initial population reaching the preset number, determine the undetermined initial population as the initial population.

[0021] Optionally, the quantization model determination unit is further configured to: perform crossover and mutation operations on individuals in the current population to generate the next generation population; determine the similarity between the data distribution of the output data of each activation layer and the data distribution of the output data of each individual in the next generation population according to the KS test; determine whether the highest similarity is greater than a preset similarity threshold; in response to the highest similarity being greater than the preset similarity threshold, determine the individual with the highest similarity as the quantization model of all activation layers of the denoising network; in response to the highest similarity being less than or equal to the preset similarity threshold, use the next generation population as the current population, and return to perform the crossover and mutation operations on individuals in the current population.

[0022] Optionally, the denoising network is a UNet model.

[0023] According to another embodiment of this disclosure, a computer-readable storage medium storing instructions is provided that, when executed by a processor, implements the quantization method as described above.

[0024] According to another embodiment of this disclosure, a computing device is provided, comprising: at least one processor; and at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the quantization method as described above.

[0025] The quantization method and apparatus for generative AI models according to embodiments of the present disclosure can achieve a good balance between quantization effect and quantization efficiency. Furthermore, it can significantly improve quantization efficiency without significantly reducing quantization effect, which is helpful for deploying generative AI models on mobile electronic devices.

[0026] Further aspects and / or advantages of the general concept of this disclosure will be set forth in part in the description which follows, and in part will be clear from the description or may be learned by practice of the general concept of this disclosure. Attached Figure Description

[0027] The above and other objects and features of exemplary embodiments of this disclosure will become clearer from the following description taken in conjunction with the accompanying drawings, which exemplarily illustrate the embodiments, wherein: Figure 1 This is a flowchart illustrating a quantization method for a generative AI model according to an embodiment of the present disclosure; Figure 2 This is a schematic diagram showing an example of each attention module; Figure 3 This is a flowchart illustrating a method for constructing an initial population according to an embodiment of the present disclosure; Figure 4This is a flowchart illustrating a method for generating a quantized subnet that satisfies preset conditions as a quantization model for all activated layers according to an embodiment of the present disclosure; Figure 5 This is a block diagram illustrating a quantization apparatus for a generative AI model according to an embodiment of the present disclosure; Figure 6 This is a block diagram illustrating a computing device according to an embodiment of the present disclosure. Detailed Implementation

[0028] The following detailed embodiments are provided to aid the reader in gaining a comprehensive understanding of the methods, apparatus, and / or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatus, and / or systems described herein will become apparent upon understanding this disclosure. For example, the order of operations described herein is merely illustrative and is not limited to those orders set forth herein, but may be changed as will become clear upon understanding this disclosure, except for operations that must occur in a specific order. Furthermore, for clarity and conciseness, descriptions of features known in the art may be omitted.

[0029] The features described herein may be implemented in different forms and should not be construed as limited to the examples described herein. Rather, the examples described herein are provided only to illustrate some of the many feasible ways of implementing the methods, apparatus, and / or systems described herein, which will become clear upon understanding the disclosure of this application.

[0030] The terminology used herein is for the purpose of describing various examples only and is not intended to limit disclosure. Unless the context clearly indicates otherwise, the singular form is intended to include the plural form as well. The terms “comprising,” “including,” and “having” indicate the presence of the described features, quantities, operations, components, elements, and / or combinations thereof, but do not preclude the presence or addition of one or more other features, quantities, operations, components, elements, and / or combinations thereof.

[0031] Unless otherwise defined, all terms used herein (including technical and scientific terms) shall have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains upon understanding this disclosure. Unless expressly defined herein, terms (such as those defined in a general dictionary) shall be interpreted as having a meaning consistent with their meaning in the context of the relevant field and in this disclosure, and shall not be interpreted in an idealized or overly formalistic manner.

[0032] Furthermore, in the description of the examples, detailed descriptions of well-known related structures or functions will be omitted when it is believed that such detailed descriptions would lead to a vague interpretation of this disclosure.

[0033] This disclosure provides a quantization method and quantization apparatus for generative AI models to achieve a good balance between quantization effect and quantization efficiency.

[0034] Figure 1 This is a flowchart illustrating a quantization method for a generative AI model according to an embodiment of the present disclosure.

[0035] The generative AI model according to embodiments of this disclosure can be an AI model for generating images, such as, but not limited to, a Stable Diffusion model, whose goal is to learn the latent structure of a dataset by modeling how data points diffuse in the latent space. In computer vision, this means training a neural network to learn the inverse diffusion process so that it can denoise images with superimposed Gaussian noise, typically applied to text-to-image generation scenarios. When quantizing such a model, the denoising process often needs to be repeated multiple times, which is very resource-intensive and time-consuming. Therefore, the focus of quantization is on quantizing the denoising network of the generative AI model (e.g., the UNet model).

[0036] Reference Figure 1 In step S101, multiple preset sample images are input into the generative AI model, and the input data of the first and last layers of the denoising network in the AI ​​model are obtained as a calibration dataset for quantization.

[0037] For example, multiple (e.g., but not limited to 500) preset sample images can be sequentially input into a generative AI model for inference, and the input data of the first and last layers of the denoising network can be saved as a calibration dataset. As an example only, in addition to the preset sample images, the input to the generative AI model can also include corresponding text descriptions.

[0038] According to embodiments of this disclosure, the denoising network can be, for example, but not limited to, the UNet model, a convolutional neural network (CNN) architecture specifically designed for image processing. Research has shown that although the input data at each step (or layer) in the denoising process changes and affects the input data of the next step, the overall data distribution remains largely consistent. Therefore, in embodiments of this disclosure, the input data of the first and last steps (or layers) of the denoising network are used as calibration datasets. This allows for effective model quantization while saving time in preparing calibration datasets and model quantization.

[0039] In step S102, the weights of each convolutional and linear layer of the denoising network are initially quantized, and the quantization parameters of each convolutional and linear layer are determined based on the outputs of each convolutional and linear layer before and after quantization obtained from the calibration dataset.

[0040] Specifically, in optimizing the quantization parameters of each convolutional and linear layer of the denoising network, since the input data of the first and last steps of the denoising network (i.e., the calibration dataset) are known, the pre-quantization outputs of each convolutional and linear layer of the denoising network can be obtained based on the calibration dataset. On the other hand, the quantization bits of the weights can be set (e.g., but not limited to 8 bits), and asymmetric quantization and main channel quantization methods can be used to initially quantize the weights of each convolutional and linear layer of the denoising network. In this way, the quantized outputs of each convolutional and linear layer of the denoising network can be obtained based on the calibration dataset. Next, the loss value can be calculated based on the pre-quantization and post-quantization outputs of each convolutional and linear layer. For example, the mean squared error (MSE) of the pre-quantization and post-quantization outputs of each convolutional and linear layer can be calculated as the loss value. Finally, the quantization parameters of each convolutional and linear layer can be adjusted with the goal of minimizing the loss value. Thus, by iteratively obtaining the pre-quantization and post-quantization outputs, calculating the loss value, and adjusting the quantization parameters, the optimal quantization parameters can ultimately be determined.

[0041] Next, in step S103, the weights of each convolutional and linear layer of the denoising network are quantized based on the determined quantization parameters.

[0042] In step S104, at least two optional quantization layers are set for each active layer of the denoising network, and all optional quantization layers of all active layers are constructed into a quantization supernet, wherein the quantization bits of at least two optional quantization layers of each active layer are different from each other.

[0043] According to embodiments of this disclosure, given that different quantization bit selections for quantization layers result in different quantization effects, a neural architecture search method can be used to assign different quantization bits to each quantization layer to balance model quantization accuracy and inference time (i.e., quantization efficiency). To this end, multiple optional quantization layers with different quantization bits can be set for each activation layer, and these quantization layers constitute a supernet. Then, a preset optimization algorithm (e.g., but not limited to, a genetic algorithm) can be used to search the quantization supernet, ultimately selecting the optimal quantization subnet as the quantization model for all activation layers of the denoising network. The above search process will be described in detail later.

[0044] For example, the UNet model has 16 attention modules, and the two softmax layers (i.e., activation layers) of each attention module can be considered as a single unit. Therefore, when building a quantized supernet for the softmax layers, a total of, for example, 48 optional quantization layers can be constructed. That is, each attention module's activation layer can have three optional quantization layers: 4-bit, 8-bit, and 16-bit quantization layers. However, this is merely an example, and this disclosure is not limited thereto. For example, two, four, or more optional quantization layers can be set for each attention module's activation layer.

[0045] Figure 2 This is a schematic diagram illustrating an example of each attention module. (See reference...) Figure 2 The attention module includes multiple linear layers and a softmax layer, with three linear layers serving as input layers and one linear layer as the output layer. Within the attention module, batch matrix multiplication (BMM) operations can be performed on the outputs of the two linear layers that serve as input layers, and the result is input to the softmax layer. Then, BMM operations can be performed on the output of the softmax layer and the output of the other linear layer that serves as input layers, and the result is input to the linear layer that serves as the output layer, thereby outputting the final result. According to embodiments of this disclosure, in the attention module, all linear layers use 8-bit quantization, while the softmax layer uses three optional quantization bits: 4 bits, 8 bits, and 16 bits.

[0046] Return to reference Figure 1 In step S105, a preset optimization algorithm is used to select a quantization subnet that meets the preset conditions as the quantization model for all activation layers of the denoising network, based on the data distribution of the output data of each activation layer and the output data of the quantization subnet in the quantization supernet. The quantization subnet includes an optional quantization layer corresponding to each activation layer.

[0047] According to embodiments of this disclosure, multiple quantization subnets can be constructed by sampling subnets of a quantization supernet. For example, firstly, for each activation layer, one optional quantization layer is sampled (i.e., selected) from three optional quantization layers. Then, the sampled (i.e., selected) optional quantization layers of all activation layers can be grouped together to form a quantization supernet. It should be noted that the optional quantization layers included in each quantization subnet are different from each other. That is, no two quantization subnets are exactly the same in a quantization supernet. Further, if the number of activation layers is N, and each activation layer has M optional quantization layers, then a total of M subnets can exist. NA quantitative subnet.

[0048] According to embodiments of this disclosure, in order to select a quantization subnet that meets preset conditions as the quantization model for all activation layers, a genetic algorithm can be used to select a quantization subnet that meets preset conditions as the quantization model for all activation layers of the denoising network based on the similarity of the data distribution between the output data of each activation layer and the output data of the quantization subnet in the quantization supernet.

[0049] More specifically, based on a first preset rule, several quantized subnets can be selected from the quantized supernet as individuals in the initial population to construct the initial population. Then, crossover and mutation operations can be iteratively performed on the individuals in the initial population until individuals that meet preset conditions are generated, which serve as the quantization model for all activation layers of the denoising network. See below for reference. Figure 3 and Figure 4 The process of constructing the initial population and generating quantized subnets that meet preset conditions is described in detail.

[0050] Figure 3 This is a flowchart illustrating a method for constructing an initial population according to an embodiment of the present disclosure. Here, the process of constructing the initial population may correspond to the selection operation of a genetic algorithm.

[0051] Reference Figure 3 In step S301, multiple quantization subnets (i.e., individuals) are randomly selected from the quantization supernet. For example, 30 quantization subnets can be randomly sampled from the quantization supernet.

[0052] In step S302, the similarity between the output data of each activation layer and the output data of multiple quantization subnets is determined according to the KS (Kolmogorov-Smirnov) test. Specifically, the P-value of the KS test can be used as the similarity between the output data of each activation layer and the output data of multiple quantization subnets. The larger the P-value, the higher the similarity between the data distributions of the output data before and after quantization, that is, the greater the probability that the data distributions of the output data before and after quantization belong to the same distribution. According to the embodiments of this disclosure, the similarity between the output data of each activation layer and the output data of the corresponding optional quantization layer in a quantization subnet can be determined, and all the obtained similarities can be statistically analyzed as the similarity between the output data of each activation layer and the output data of a quantization subnet. Alternatively, the similarity between the final output data of the network composed of all activation layers and the final output data of a quantization subnet can be determined as the similarity between the output data of each activation layer and the output data of a quantization subnet.

[0053] In step S303, the N quantized subnets with the highest similarity are selected as individuals in the undetermined initial population and added to it. Here, N is a natural number greater than 1, and N is less than the number of randomly selected quantized subnets. For example, the 5 quantized subnets with the highest similarity can be selected as individuals in the undetermined initial population and added to it.

[0054] Next, in step S304, it is determined whether the number of individuals in the initial population to be determined has reached a preset number (e.g., but not limited to 100).

[0055] If the number of individuals in the pending initial population does not reach the preset number, return to step S301 and randomly select multiple quantization subnets from the quantization supernet. If the number of individuals in the pending initial population reaches the preset number, in step S305, the pending initial population is determined as the initial population.

[0056] Figure 4 This is a flowchart illustrating a method for generating a quantized subnet that satisfies preset conditions as a quantization model for all activation layers, according to an embodiment of the present disclosure. Here, the process of generating a quantized subnet that satisfies preset conditions as a quantization model for all activation layers can correspond to the crossover and mutation operations of a genetic algorithm.

[0057] Reference Figure 4 In step S401, crossover and mutation operations are performed on the individuals in the current population (i.e., the quantized subnet) to generate the next generation population. For example, crossover and mutation operations can be performed on 100 individuals in the initial population to generate 200 individuals as the next generation population.

[0058] In step S402, the similarity between the data distribution of the output data of each activation layer and the data distribution of the output data of each individual in the next generation population is determined according to the KS test. For example, the similarity between the data distribution of the output data of each activation layer and the data distribution of the output data of each individual in the next generation population can be determined in a manner similar to step S302.

[0059] Next, in step S403, it is determined whether the highest similarity determined in step S402 is greater than a preset similarity threshold. For example, it can be determined whether the maximum P-value of the KS test determined in step S402 is greater than 0.95.

[0060] In response to the highest similarity being greater than a preset similarity threshold, in step S404, the individual with the highest similarity (i.e., the quantized subnet) is determined as the quantized model for all activated layers. In response to the highest similarity being less than or equal to the preset similarity threshold, in step S405, the next generation population is used as the current population, and then the process returns to step S401 to repeat the crossover and mutation operations to continue generating the next generation population.

[0061] According to embodiments of this disclosure, in order to reduce the search space and search time, a search space is established only for the activation layer, and the KS hypothesis test is used to determine whether the data distribution of the quantized model is similar to the data distribution of the activation layer before quantization, thereby selecting the optimal quantization model, which greatly reduces the search space and search time. Thus, the quantization method for generative AI models according to embodiments of this disclosure can achieve a good balance between quantization effect and quantization efficiency. Furthermore, it can significantly improve quantization efficiency without significantly reducing quantization effect, which is beneficial for the deployment of generative AI models on mobile electronic devices.

[0062] Figure 5 This is a block diagram illustrating a quantization apparatus for a generative AI model according to an embodiment of the present disclosure.

[0063] Reference Figure 5 The quantization device 500 includes a calibration dataset generation unit 501, a first quantization parameter determination unit 502, a first quantization unit 503, a quantization supernet construction unit 504, and a quantization model determination unit 505.

[0064] The calibration dataset generation unit 501 inputs multiple preset sample images into the generative AI model and obtains the input data of the first and last layers of the denoising network included in the generative AI model, as the calibration dataset for quantization. Here, the denoising network can be a UNet model.

[0065] The first quantization parameter determination unit 502 performs initial quantization on the weights of each convolutional and linear layer of the denoising network, and determines the quantization parameters of each convolutional and linear layer based on the outputs of each convolutional and linear layer before and after quantization obtained from the calibration dataset.

[0066] According to embodiments of this disclosure, the first quantization parameter determination unit 502 may employ asymmetric quantization and main channel quantization methods to initially quantize the weights of each convolutional and linear layer of the denoising network. Optionally, the first quantization parameter determination unit 502 may obtain the outputs of each convolutional and linear layer of the denoising network before and after quantization based on a calibration dataset; calculate the loss value based on the outputs of each convolutional and linear layer before and after quantization; and adjust the quantization parameters of each convolutional and linear layer with the goal of minimizing the loss value. Here, the first quantization parameter determination unit 502 may calculate the mean square error of the outputs of each convolutional and linear layer before and after quantization as the loss value.

[0067] The first quantization unit 503 quantizes the weights of each convolutional and linear layer of the denoising network based on determined quantization parameters.

[0068] The quantization supernet construction unit 504 sets at least two optional quantization layers for each activation layer of the denoising network, and constructs all optional quantization layers of all activation layers into a quantization supernet, wherein the quantization bits of at least two optional quantization layers are different from each other.

[0069] The quantization model determination unit 505 uses a preset optimization algorithm to select a quantization subnet that meets preset conditions as the quantization model for all activation layers of the denoising network, based on the data distribution of the output data of each activation layer and the output data of the quantization subnet in the quantization supernet. The quantization subnet includes an optional quantization layer corresponding to each activation layer.

[0070] According to embodiments of this disclosure, the quantization model determination unit 505 can utilize a genetic algorithm to select quantization subnets that meet preset conditions based on the similarity of the data distribution between the output data of each activation layer and the output data of the quantization subnets in the quantization supernet, as the quantization model for all activation layers of the denoising network. Optionally, the quantization model determination unit 505 can select several quantization subnets from the quantization supernet as individuals in an initial population based on a first preset rule to construct an initial population; iteratively performing crossover and mutation operations on the individuals in the initial population until individuals that meet preset conditions are generated as the quantization model for all activation layers of the denoising network.

[0071] According to embodiments of this disclosure, the quantization model determination unit 505 can randomly select multiple quantization subnets from the quantization supernet; determine the similarity of the data distribution between the output data of each activation layer and the output data of the selected multiple quantization subnets based on the KS test; add the N quantization subnets with the highest similarity as individuals to the undetermined initial population, where N is a natural number greater than 1 and N is less than the number of selected multiple quantization subnets; determine whether the number of individuals in the undetermined initial population reaches a preset number; in response to the number of individuals in the undetermined initial population not reaching the preset number, return to the operation of randomly selecting multiple quantization subnets from the quantization supernet; in response to the number of individuals in the undetermined initial population reaching the preset number, determine the undetermined initial population as the initial population.

[0072] According to embodiments of this disclosure, the quantization model determination unit 505 can perform crossover and mutation operations on individuals in the current population to generate the next generation population; determine the similarity between the data distribution of the output data of each activation layer and the data distribution of the output data of each individual in the next generation population based on the KS test; determine whether the highest similarity is greater than a preset similarity threshold; in response to the highest similarity being greater than the preset similarity threshold, determine the individual with the highest similarity as the quantization model of all activation layers of the denoising network; in response to the highest similarity being less than or equal to the preset similarity threshold, take the next generation population as the current population, and return to perform the crossover and mutation operations on individuals in the current population.

[0073] The effects of the quantization method and quantization apparatus for generative AI models according to embodiments of the present disclosure are briefly described below.

[0074] Model quantization performance is generally judged by two metrics: PSNR (Peak Signal to Noise Ratio) and FID (Frechet Inception Distance). PSNR is one of the metrics for measuring image quality; a higher PSNR value indicates better image quality. FID is a metric used to evaluate the quality of the model used to generate images; a lower FID value indicates better image quality generated by the model. On the other hand, model quantization efficiency is measured by BMAC (bit / MAC (Media Access Control)). For a fixed model, BMAC is a key factor affecting model inference time. Compared with the existing best model quantization methods, the quantization method according to the embodiments of this disclosure does not significantly change the PSNR and FID metrics, but significantly improves inference efficiency and significantly reduces memory access. Therefore, the generative AI model quantization method and quantization apparatus according to the embodiments of this disclosure can achieve a good balance between quantization performance and quantization efficiency.

[0075] Figure 6 This is a block diagram illustrating a computing device according to an embodiment of the present disclosure. (Refer to...) Figure 6 The computing device 600 includes a processor 601 and a memory 602. The memory 602 stores computer-executable instructions. When executed by the processor 601, the computer-executable instructions cause the processor 601 to perform a quantization method for a generative AI model according to embodiments of the present disclosure.

[0076] As an example, computing device 600 may be a PC, tablet, personal digital assistant, smartphone, or other device capable of executing the aforementioned set of instructions. Here, computing device 600 is not necessarily a single electronic device, but may be any collection of devices or circuits capable of executing the aforementioned instructions (or instruction sets) individually or in combination. Computing device 600 may also be part of an integrated control system or system manager, or may be configured to interconnect with a portable electronic device locally or remotely (e.g., via wireless transmission) through an interface. Furthermore, computing device 600 may include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of computing device 600 may be interconnected via a bus and / or network.

[0077] Processor 601 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a dedicated processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processor 601 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, etc.

[0078] The processor 601 can execute instructions or code stored in memory 602, which can also store data. Instructions and data can also be sent and received via a network through a network interface device, which can employ any known transmission protocol.

[0079] The memory 602 may be integrated with the processor 601, for example, by placing RAM or flash memory within an integrated circuit microprocessor. Alternatively, the memory 602 may include a separate device, such as an external disk drive, a storage array, or other storage device usable by any database system. The memory 602 and the processor 601 may be operatively coupled, or may communicate with each other, for example, via I / O ports, network connections, etc., enabling the processor 601 to read files stored in the memory 602.

[0080] According to embodiments of the present disclosure, a computer-readable storage medium is provided that, when instructions in the computer-readable storage medium are executed by a processor, causes the processor to perform a quantization method for a generative AI model according to embodiments of the present disclosure.

[0081] The quantization method for generative AI models according to embodiments of this disclosure can be programmed into a computer program and stored on a computer-readable storage medium. Examples of computer-readable storage media include: read-only memory (ROM), random access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or optical disc storage, hard disk drive (HDD), solid-state drive (SSD), card storage (such as multimedia cards, secure digital (SD) cards, or ultra-fast digital (XD) cards), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid-state drive, and any other device configured to store computer programs and any associated data, data files, and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files, and data structures to a processor or computer so that the processor or computer can execute the computer programs. In one example, the computer programs and any associated data, data files, and data structures are distributed across a networked computer system, such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed manner through one or more processors or computers.

[0082] The specific embodiments of this disclosure have been described in detail above. Although some embodiments have been shown and described, those skilled in the art should understand that modifications and variations can be made to these embodiments without departing from the principles and spirit of this disclosure, which are defined by the claims and their equivalents. Such modifications and variations should also be within the protection scope of the claims of this disclosure.

Claims

1. A quantization method for generative AI models, characterized in that, The quantization method includes: Multiple preset sample images are input into the generative AI model, and the input data of the first and last layers of the denoising network included in the generative AI model are obtained as a calibration dataset for quantization. The weights of each convolutional and linear layer of the denoising network are initially quantized, and the quantization parameters of each convolutional and linear layer are determined based on the outputs of each convolutional and linear layer before and after quantization obtained from the calibration dataset. The weights of each convolutional and linear layer of the denoising network are quantized based on the determined quantization parameters. At least two optional quantization layers are set for each active layer of the denoising network, and all optional quantization layers of all active layers are constructed into a quantization supernet, wherein the quantization bits of the at least two optional quantization layers are different from each other; Using a preset optimization algorithm, based on the data distribution of the output data of each activation layer and the output data of the quantization subnet in the quantization supernet, a quantization subnet that meets preset conditions is selected as the quantization model for all activation layers of the denoising network. The quantization subnet includes an optional quantization layer corresponding to each activation layer.

2. The quantization method as described in claim 1, characterized in that, The steps for initial quantization of the weights of each convolutional and linear layer of the denoising network include: Asymmetric quantization and main channel quantization methods are used to initially quantize the weights of each convolutional and linear layer of the denoising network.

3. The quantization method as described in claim 1, characterized in that, The steps to determine the quantization parameters include: Based on the calibration dataset, the outputs of each convolutional and linear layer of the denoising network before and after quantization are obtained; The loss value is calculated based on the outputs of each convolutional and linear layer before and after quantization. The quantization parameters of each convolutional and linear layer are adjusted with the goal of minimizing the loss value.

4. The quantization method as described in claim 3, characterized in that, The mean squared error of the output before and after quantization of each convolutional and linear layer is calculated and used as the loss value.

5. The quantization method as described in claim 1, characterized in that, The step of selecting a quantized subnet that meets preset conditions as the quantization model for all activation layers of the denoising network includes: Using a genetic algorithm, based on the similarity of the data distribution between the output data of each activation layer and the output data of the quantization subnet in the quantization supernet, a quantization subnet that meets preset conditions is selected as the quantization model for all activation layers of the denoising network.

6. The quantization method as described in claim 5, characterized in that, The step of selecting a quantization subnet that meets the preset conditions further includes: Based on the first preset rule, several quantization subnets are selected from the quantization supernet as individuals in the initial population to construct the initial population; Crossover and mutation operations are iteratively performed on individuals in the initial population until individuals that meet preset conditions are generated, which serve as the quantization model for all activation layers of the denoising network.

7. The quantization method as described in claim 6, characterized in that, Based on a first preset rule, the steps of selecting several quantization subnets from the quantization supernet as individuals in the initial population and constructing the initial population include: Randomly select multiple quantization subnets from the quantization supernet; The similarity of the data distribution between the output data of each activation layer and the output data of the multiple quantized subnets is determined based on the KS test. The N quantized subnets with the highest similarity are selected as individuals in the undetermined initial population and added to the undetermined initial population, where N is a natural number greater than 1 and N is less than the number of the multiple quantized subnets; Determine whether the number of individuals in the initial population to be determined has reached the preset number; In response to the fact that the number of individuals in the initial population to be determined has not reached the preset number, the step of randomly selecting multiple quantization subnets from the quantization supernet is returned; In response to the number of individuals in the undetermined initial population reaching a preset number, the undetermined initial population is determined as the initial population.

8. The quantization method as described in claim 6, characterized in that, The steps of iteratively performing crossover and mutation operations on individuals in the initial population until individuals satisfying preset conditions are generated, forming the quantization model of all activation layers of the denoising network, include: Perform crossover and mutation operations on individuals in the current population to generate the next generation of the population; The similarity between the data distribution of the output data of each activation layer and the data distribution of the output data of each individual in the next generation population is determined by the KS test. Determine whether the highest similarity score is greater than a preset similarity threshold; In response to the highest similarity being greater than a preset similarity threshold, the individual with the highest similarity is determined as the quantization model of all activation layers of the denoising network; In response to the highest similarity being less than or equal to a preset similarity threshold, the next generation population is taken as the current population, and the steps for performing crossover and mutation operations on individuals in the current population are returned.

9. The quantization method according to any one of claims 1-8, characterized in that, The denoising network is the UNet model.

10. A quantization device for a generative AI model, characterized in that, The quantization device includes: The calibration dataset generation unit is configured to: input multiple preset sample images into the generative AI model, and obtain the input data of the first and last layers of the denoising network included in the generative AI model as a quantized calibration dataset; The first quantization parameter determination unit is configured to: perform initial quantization on the weights of each convolutional layer and linear layer of the denoising network, and determine the quantization parameters of each convolutional layer and linear layer based on the outputs of each convolutional layer and linear layer before and after quantization obtained based on the calibration dataset. The first quantization unit is configured to quantize the weights of each convolutional and linear layer of the denoising network based on determined quantization parameters. The quantization supernet construction unit is configured to: set at least two optional quantization layers for each active layer of the denoising network, and construct all optional quantization layers of all active layers into a quantization supernet, wherein the quantization bits of the at least two optional quantization layers are different from each other; The quantization model determination unit is configured to: use a preset optimization algorithm to select a quantization subnet that meets preset conditions as the quantization model for all active layers of the denoising network, based on the data distribution of the output data of each active layer and the output data of the quantization subnet in the quantization supernet. The quantization subnet includes an optional quantization layer corresponding to each active layer.

11. A computer-readable storage medium storing instructions, characterized in that, When the instruction is executed by the processor, the quantization method as described in any one of claims 1-9 is implemented.

12. A computing device, characterized in that, include: At least one processor; At least one memory that stores computer-executable instructions. Wherein, when the computer-executable instructions are executed by the at least one processor, they cause the at least one processor to perform the quantization method as described in any one of claims 1-9.