An image processing method based on a multi-scaling factor quantization visual Transformer model

By optimizing the visual Transformer model through multi-scaling factor quantization and iterative grid search strategies, the problems of high computational complexity and accuracy loss on resource-constrained devices are solved, and the model achieves efficient inference performance and accuracy improvement under low bit quantization.

CN119963970BActive Publication Date: 2026-06-30UNIT 32002 OF THE CHINESE PEOPLES LIBERATION ARMY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
UNIT 32002 OF THE CHINESE PEOPLES LIBERATION ARMY
Filing Date
2024-12-31
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing visual Transformer models have high computational complexity on resource-constrained devices, and the single scaling factor quantization method leads to significant accuracy loss, especially when dealing with asymmetric activation distributions, they cannot capture activation features evenly.

Method used

A multi-scaling factor quantization method is adopted, and a logarithmic quantizer and a uniform quantizer are constructed for the post-softmax activation layer and the post-GELU activation layer, respectively. The optimal scaling factor and logarithmic cardinality are dynamically selected through an iterative grid search strategy to optimize the quantization process.

Benefits of technology

It significantly improves the inference performance and accuracy of the visual Transformer model under low bit quantization conditions, adapts to resource-constrained environments, and maintains high efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119963970B_ABST
    Figure CN119963970B_ABST
Patent Text Reader

Abstract

The application discloses an image processing method based on a visual Transformer model of packet quantization, and belongs to the field of image processing. The method comprises the following steps: generating an image dataset, dividing the image dataset into a training set and a test set; using image instances in the training set, performing quantization scaling training on a visual Transformer model based on a model quantization bit width and a quantization scaling factor allowed by a resource-limited device carrying the visual Transformer model, to obtain a visual Transformer model based on multi-scaling factor quantization; and testing image instances in the test set by using the visual Transformer model based on multi-scaling factor quantization. The application improves the inference performance and accuracy of the visual Transformer model in the case of low-bit quantization while retaining the high-efficiency characteristics of the visual Transformer model, and provides technical support for image processing in a resource-limited environment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image processing technology, and in particular relates to an image processing method based on a visual Transformer model with multiple scaling factor quantization. Background Technology

[0002] In recent years, Transformer models have made significant progress in the field of image processing. The self-attention mechanism of Transformers enables them to efficiently handle long-range dependencies and global features, demonstrating powerful capabilities, especially in Natural Language Processing (NLP), where they excel in tasks such as text generation, summarization, translation, and question answering. However, these large-scale Transformer models rely on substantial computational resources, massive amounts of training data, and long training times, making them difficult to apply directly in resource-constrained scenarios.

[0003] The Visual Transformer (ViT) model is a deep neural network based on a self-attention mechanism, which can effectively handle global features, especially excelling in tasks such as image classification, semantic segmentation, and object detection. Compared to traditional convolutional neural networks (CNNs), ViT is better at modeling long-range dependencies in images, giving it an advantage in processing global contextual information. However, the computational complexity of the ViT model is high, especially in the multi-head self-attention mechanism, where computational complexity increases quadratically with the number of image patches. This limits its application on resource-constrained devices, despite its excellent performance on high-performance hardware.

[0004] In practical applications, especially in resource-constrained scenarios such as drones and edge devices, the high computational demands and inference latency of ViT models become bottlenecks. These scenarios require models to maintain high accuracy while possessing rapid response capabilities. To address this challenge, quantization methods have become an important means of improving the efficiency of ViT models.

[0005] By quantizing model parameters and activation values, especially by employing low-bit quantization and multi-scale quantization factor techniques, the computational complexity and memory requirements of the model can be significantly reduced, thus better adapting to resource-constrained devices while maintaining relatively high accuracy.

[0006] Existing quantization methods for visual Transformer models typically employ a single scaling factor and adapt it to this single scaling factor through bias reparameterization. This cumulative effect is particularly pronounced when dealing with Transformer models with complex activation distributions. For post-softmax activation, the output often exhibits a high degree of concentration, leading to significant accuracy loss during quantization. Furthermore, in post-GELU positive and negative activation distributions, positive activation values ​​are mostly concentrated in the positive range, while negative activation values ​​are more dispersed. This asymmetric activation distribution means that quantization using a single scaling factor may fail to capture the characteristics of different activations evenly. While these methods can provide simplified implementations in some cases, bias reparameterization introduces errors, which accumulate at each layer, affecting the accuracy of the quantized model and ultimately reducing image processing accuracy while increasing complexity. Summary of the Invention

[0007] To address the aforementioned technical problems, this invention proposes an image processing scheme based on a visual Transformer model using multi-scaling factor quantization.

[0008] The first aspect of this invention discloses an image processing method based on a visual Transformer model with multiple scaling factor quantization, the method comprising:

[0009] Step S1: Generate an image dataset and divide the image dataset into a training set and a test set; wherein, the image dataset includes several image instances;

[0010] Step S2: Using image instances in the training set, based on the model quantization bit width and quantization scaling factor allowed by the resource-constrained device carrying the visual Transformer model, perform quantization scaling training on the visual Transformer model to obtain a visual Transformer model based on multi-scaling factor quantization.

[0011] Step S3: Use the visual Transformer model based on multi-scaling factor quantization to test the image instances in the test set.

[0012] According to a method of a first aspect of the present invention, the visual Transformer model includes a normalization layer, an attention layer, a post-softmax activation layer, a linear transformation layer, a feedforward neural network layer, a post-GELU activation layer, a Transformer processing layer, and an output layer; wherein:

[0013] In step S2, a multi-scaling factor quantizer is constructed for the post-softmax activation layer and the post-GELU activation layer. When the post-softmax activation layer and the post-GELU activation layer process image instances in the training set, the multi-scaling factor quantizer is used for quantization scaling training.

[0014] According to the method of the first aspect of the present invention, in step S2, when constructing the multi-scaling factor quantizer:

[0015] The activation value distribution characteristics of the post-softmax activation layer, the post-GELU positive activation layer, and the post-GELU negative activation layer were analyzed respectively.

[0016] Based on the analysis results of the distribution characteristics, a logarithmic quantizer is constructed for the post-softmax activation layer, a uniform quantizer is constructed for the post-GELU negative activation layer, a logarithmic quantizer is constructed for the post-GELU positive activation layer with a distribution characteristic of 0 to 1, and a uniform quantizer is constructed for the post-GELU positive activation layer with a distribution characteristic greater than 1.

[0017] The individual logarithmic quantizers and uniform quantizers are integrated into a multi-scaling factor quantizer.

[0018] According to the method of the first aspect of the present invention, in step S2, for the logarithmizer:

[0019] Given activation A, logarithmic cardinality Model quantization bit width ratio and quantization scaling factor The quantization process for activating A is represented as follows:

[0020]

[0021] The dequantization process is represented as:

[0022]

[0023] in, is the quantized activation value, A is the original activation value, b is the adaptive logarithmic base, s is the scaling factor, and bit is the quantization bit width.

[0024] According to the method of the first aspect of the present invention, in step S2, for the uniform quantizer:

[0025] Given activation A', model quantization bit width ratio, and quantization scaling factor. The quantization process for activating A' is represented as follows:

[0026] Quantification:

[0027] The dequantization process is represented as:

[0028]

[0029] in, s' is the quantized activation value, A' is the original activation value, s' is the scaling factor, and bit is the quantization bit width.

[0030] According to the method of the first aspect of the present invention, in step S2, the scaling factor and the logarithmic cardinality are dynamically selected by an iterative grid search strategy; specifically including:

[0031] The scaling factor and zero point are uniformly divided throughout the search space to generate an initial grid, where each grid point represents a combination of hyperparameters, and the candidate set is a set of grid points.

[0032] For each hyperparameter combination in the grid, a quantization loss function is calculated based on the image instance, and the loss value of each combination is recorded. The hyperparameter combination with the smallest loss is selected as the initial search result based on the quantization loss.

[0033] Based on the optimal hyperparameter combination, a smaller search grid is generated, reducing the grid spacing and refining the search range to cover more hyperparameter combinations with finer granularity; a new candidate set is generated using the refined grid to cover the neighborhood of the best hyperparameter in the initial search.

[0034] The same quantization loss calculation and search are performed on the refined grid. The search range is gradually narrowed through continuous iteration until the search range reaches the preset range. The hyperparameter combination with the minimum quantization loss is selected as the final optimal solution.

[0035] According to the method of the first aspect of the present invention, the visual Transformer model includes ViT-S, DeiT-T, and Swin-S models; the image dataset includes ImageNet and the corresponding atlas of COCO.

[0036] A second aspect of this invention discloses an image processing system based on a visual Transformer model with multiple scaling factor quantization, the system comprising a processing unit configured to perform:

[0037] An image dataset is generated and divided into a training set and a test set; wherein the image dataset includes several image instances;

[0038] Using image instances in the training set, the visual Transformer model is trained by quantization scaling based on the model quantization bit width and quantization scaling factor allowed by the resource-constrained device carrying the visual Transformer model, to obtain a visual Transformer model based on multi-scaling factor quantization.

[0039] The image instances in the test set were tested using the visual Transformer model based on multi-scaling factor quantization.

[0040] A third aspect of this invention discloses an electronic device. The electronic device includes a memory and a processor. The memory stores a computer program, and when the processor executes the computer program, it implements the image processing method based on a visual Transformer model with multi-scaling factor quantization as described in the first aspect of this disclosure.

[0041] A fourth aspect of this invention discloses a computer-readable storage medium. The computer-readable storage medium stores a computer program, which, when executed by a processor, implements the image processing method based on a visual Transformer model with multi-scaling factor quantization as described in the first aspect of this disclosure.

[0042] In summary, the technical solution provided by this invention aims to improve the performance and accuracy of the quantized model. Specifically, for the processing of post-softmax activation values, a logarithmic quantizer is employed to more effectively capture the characteristics of the activation values, thereby reducing accuracy loss during quantization. Furthermore, the method is specifically optimized for the characteristics of post-GELU activation values. During quantization, the positive and negative activation values ​​are processed using a logarithmic quantizer and a uniform quantizer, respectively, thus fully utilizing the advantages of both quantization methods to achieve a more accurate representation of the activation values. To ensure that each part uses the most suitable scaling factor, an iterative grid search strategy is introduced. This strategy can efficiently explore the potential scaling factor space and quickly locate the optimal scaling factor configuration. This method can significantly improve the inference performance and accuracy of the visual Transformer model under low-bit quantization conditions while preserving its high efficiency characteristics, providing reliable technical support for the application of the model in resource-constrained environments. Attached Figure Description

[0043] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0044] Figure 1 This is a schematic diagram of multi-scaling factor quantization for a visual Transformer model. Detailed Implementation

[0045] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0046] Existing quantization methods for visual Transformer models often employ a single scaling factor and then reparameterize using bias to make them applicable to that single scaling factor. However, bias reparameterization introduces errors that accumulate layer by layer. To address this, this invention proposes an image processing scheme for visual Transformer models based on multi-scaling factor quantization.

[0047] The first aspect of this invention discloses an image processing method based on a visual Transformer model with multiple scaling factor quantization, the method comprising:

[0048] Step S1: Generate an image dataset and divide the image dataset into a training set and a test set; wherein, the image dataset includes several image instances;

[0049] Step S2: Using image instances in the training set, based on the model quantization bit width and quantization scaling factor allowed by the resource-constrained device carrying the visual Transformer model, perform quantization scaling training on the visual Transformer model to obtain a visual Transformer model based on multi-scaling factor quantization.

[0050] Step S3: Use the visual Transformer model based on multi-scaling factor quantization to test the image instances in the test set.

[0051] According to a method of a first aspect of the present invention, the visual Transformer model includes a normalization layer, an attention layer, a post-softmax activation layer, a linear transformation layer, a feedforward neural network layer, a post-GELU activation layer, a Transformer processing layer, and an output layer; wherein:

[0052] In step S2, a multi-scaling factor quantizer is constructed for the post-softmax activation layer and the post-GELU activation layer. When the post-softmax activation layer and the post-GELU activation layer process image instances in the training set, the multi-scaling factor quantizer is used for quantization scaling training.

[0053] According to the method of the first aspect of the present invention, in step S2, when constructing the multi-scaling factor quantizer:

[0054] The activation value distribution characteristics of the post-softmax activation layer, the post-GELU positive activation layer, and the post-GELU negative activation layer were analyzed respectively.

[0055] Based on the analysis results of the distribution characteristics, a logarithmic quantizer is constructed for the post-softmax activation layer, a uniform quantizer is constructed for the post-GELU negative activation layer, a logarithmic quantizer is constructed for the post-GELU positive activation layer with a distribution characteristic of 0 to 1, and a uniform quantizer is constructed for the post-GELU positive activation layer with a distribution characteristic greater than 1.

[0056] The individual logarithmic quantizers and uniform quantizers are integrated into a multi-scaling factor quantizer.

[0057] According to the method of the first aspect of the present invention, in step S2, for the logarithmizer:

[0058] Given activation A, logarithmic cardinality Model quantization bit width ratio and quantization scaling factor The quantization process for activating A is represented as follows:

[0059]

[0060] The dequantization process is represented as:

[0061]

[0062] in, is the quantized activation value, A is the original activation value, b is the adaptive logarithmic base, s is the scaling factor, and bit is the quantization bit width.

[0063] According to the method of the first aspect of the present invention, in step S2, for the uniform quantizer:

[0064] Given activation A', model quantization bit width ratio, and quantization scaling factor. The quantization process for activating A' is represented as follows:

[0065] Quantification:

[0066] The dequantization process is represented as:

[0067]

[0068] in, s' is the quantized activation value, A' is the original activation value, s' is the scaling factor, and bit is the quantization bit width.

[0069] According to the method of the first aspect of the present invention, in step S2, the scaling factor and the logarithmic cardinality are dynamically selected by an iterative grid search strategy; specifically including:

[0070] The scaling factor and zero point are uniformly divided throughout the search space to generate an initial grid, where each grid point represents a combination of hyperparameters, and the candidate set is a set of grid points.

[0071] For each hyperparameter combination in the grid, a quantization loss function is calculated based on the image instance, and the loss value of each combination is recorded. The hyperparameter combination with the smallest loss is selected as the initial search result based on the quantization loss.

[0072] Based on the optimal hyperparameter combination, a smaller search grid is generated, reducing the grid spacing and refining the search range to cover more hyperparameter combinations with finer granularity; a new candidate set is generated using the refined grid to cover the neighborhood of the best hyperparameter in the initial search.

[0073] The same quantization loss calculation and search are performed on the refined grid. The search range is gradually narrowed through continuous iteration until the search range reaches the preset range. The hyperparameter combination with the minimum quantization loss is selected as the final optimal solution.

[0074] According to the method of the first aspect of the present invention, the visual Transformer model includes ViT-S, DeiT-T, and Swin-S models; the image dataset includes ImageNet and the corresponding atlas of COCO.

[0075] First embodiment (e.g.) Figure 1 (As shown)

[0076] S1: Start by selecting a full-precision visual Transformer model that has been pre-trained on a large dataset, such as ViT-S, DeiT-T, and Swin-S, and prepare a calibration dataset (a corresponding subset of ImageNet or COCO). This dataset typically contains hundreds to thousands of images to determine the quantization parameters.

[0077] S2: Analyze the activation values ​​in the Transformer model, especially focusing on the activation distribution of different modules. Specifically, the softmax activation distribution after different modules is mostly concentrated near zero, and all between zero and one. The positive GELU activation distribution after different modules is mostly concentrated near zero, and the distribution is relatively wide. The negative GELU activation distribution after different modules is distributed in the interval (-0.17, 0], mostly distributed at both ends but relatively uniform. Finally, it is concluded that there are significant differences between the softmax activation distribution, the positive GELU activation distribution, and the negative GELU activation distribution.

[0078] S3: Based on different activation distribution characteristics, a multi-scaling factor quantization method is designed for the post-Softmax and post-GELU layers, performing independent uniform quantization and logarithmic quantization on the positive and negative parts of the post-GELU activation. Specifically, for different modules, the post-Softmax activation distribution is mostly concentrated near zero, between zero and one, and exhibits a power law; therefore, a logarithmic quantizer is designed for the post-Softmax activation distribution. For different modules, the post-GELU negative activation distribution is distributed within the interval (-0.17, 0]; therefore, a uniform quantizer is designed for it. For different modules, the post-GELU positive activation distribution is divided into two parts: one part is mostly distributed between zero and one, and the other part is greater than one. Since both are designed with logarithmic quantizers, the parts less than one and greater than one may have the same quantization result; therefore, a logarithmic quantizer and a uniform quantizer are used for the two parts respectively.

[0079] For quantizers: given activation A, logarithmic base Quantization bit width ratio and scaling factor The quantization process for activating A is formulated as (1)(2):

[0080] Quantification:

[0081] Dequantification:

[0082] in is the quantized activation value, A is the original activation value, b is the adaptive logarithmic base, s is the scaling factor, and bit is the quantization bit width.

[0083] Uniform quantizer: Given activation A, quantization bit width ratio, and scaling factor The quantization process for activating A is formulated as (3)(4):

[0084] Quantification:

[0085] Dequantification:

[0086] in is the quantized activation value, A is the original activation value, s is the scaling factor, and bit is the quantization bit width.

[0087] The above-designed quantizer and uniform quantizer are integrated into a multi-scaling factor quantizer, and then the post-Softmax activation and post-GELU activation of the visual Transformer model are operated on separately.

[0088] S4: To further improve quantization accuracy and speed, the scaling factor and the base of the logarithm should be dynamically adjusted.

[0089] S5: Perform step S3 on the input visual Transformer model, which involves using a multi-scaling factor quantizer for the post-Softmax activation and post-GELU activation of the visual Transformer model, and combining this with the iterative grid search strategy of S4, which dynamically selects the scaling factor and logarithmic cardinality. After removing the post-Softmax activation and post-GELU activation, uniform quantizer and iterative grid search strategy will be used for quantization. Finally, the quantized visual Transformer model will be output as the result.

[0090] The above outputs are deployed on resource-constrained devices such as drones to perform image classification tasks on the ImageNet dataset and object detection tasks on the COCO dataset. This achieves significant model compression and inference speed improvements without significantly reducing the accuracy of these tasks.

[0091] Second embodiment (dynamically selecting scaling factor and logarithmic cardinality for iterative grid search strategy)

[0092] Iterative grid search strategy:

[0093] Step 1: First, uniformly divide the two hyperparameters (such as scaling factor and zero point) in the entire search space to generate an initial grid. Each grid point represents a combination of hyperparameters, and the candidate set is the set of these grid points.

[0094] Step 2: For each hyperparameter combination in the grid, calculate the quantization loss function, i.e., mean square error, based on the calibration data, and record the loss value of each combination. Based on the quantization loss, select the hyperparameter combination with the smallest loss, which will be used as the preliminary search result.

[0095] Step 3: Generate a smaller search grid around the optimal hyperparameter combination found in Step 2. Specifically, reduce the grid spacing to refine the search range and cover more fine-grained hyperparameter combinations. Use the refined grid to generate a new candidate set to cover the neighborhood of the optimal hyperparameter in the initial search.

[0096] Step 4: Repeat Steps 2 and 3, performing the same quantization loss calculation and search steps on the refined mesh. Each iteration performs a more refined search near the optimal hyperparameters of the previous round, gradually narrowing the search range until the search range is small enough or the preset number of iterations is reached. The hyperparameter combination with the minimum quantization loss is then selected as the final optimal solution.

[0097] A second aspect of this invention discloses an image processing system based on a visual Transformer model with multiple scaling factor quantization, the system comprising a processing unit configured to perform:

[0098] An image dataset is generated and divided into a training set and a test set; wherein the image dataset includes several image instances;

[0099] Using image instances in the training set, the visual Transformer model is trained by quantization scaling based on the model quantization bit width and quantization scaling factor allowed by the resource-constrained device carrying the visual Transformer model, to obtain a visual Transformer model based on multi-scaling factor quantization.

[0100] The image instances in the test set were tested using the visual Transformer model based on multi-scaling factor quantization.

[0101] A third aspect of this invention discloses an electronic device. The electronic device includes a memory and a processor. The memory stores a computer program, and when the processor executes the computer program, it implements the image processing method based on a visual Transformer model with multi-scaling factor quantization as described in the first aspect of this disclosure.

[0102] A fourth aspect of this invention discloses a computer-readable storage medium. The computer-readable storage medium stores a computer program, which, when executed by a processor, implements the image processing method based on a visual Transformer model with multi-scaling factor quantization as described in the first aspect of this disclosure.

[0103] In summary, the technical solution provided by this invention aims to improve the performance and accuracy of the quantized model. Specifically, for the processing of post-softmax activation values, a logarithmic quantizer is employed to more effectively capture the characteristics of the activation values, thereby reducing accuracy loss during quantization. Furthermore, the method is specifically optimized for the characteristics of post-GELU activation values. During quantization, the positive and negative activation values ​​are processed using a logarithmic quantizer and a uniform quantizer, respectively, thus fully utilizing the advantages of both quantization methods to achieve a more accurate representation of the activation values. To ensure that each part uses the most suitable scaling factor, an iterative grid search strategy is introduced. This strategy can efficiently explore the potential scaling factor space and quickly locate the optimal scaling factor configuration. This method can significantly improve the inference performance and accuracy of the visual Transformer model under low-bit quantization conditions while preserving its high efficiency characteristics, providing reliable technical support for the application of the model in resource-constrained environments.

[0104] This invention designs a multi-scaling factor quantization method for visual Transformer models. The method includes a multi-scaling factor quantizer, quantization of the post-Softmax activation and post-GELU activation of the visual Transformer model, and an iterative grid search strategy.

[0105] This invention designs a multi-scaling factor quantizer. Based on the unique distribution of the post-Softmax activation and post-GELU activation of the visual Transformer model, the distribution is further subdivided and quantized. A multi-scaling factor quantizer is designed, which combines a quantizer and a uniform quantizer, thereby improving the accuracy of the quantized model. Specifically, it improves the Top-1 accuracy of image classification on the ImageNet dataset and the accuracy of object detection on the COCO dataset.

[0106] This invention designs an iterative grid search strategy to select the scaling factor and zero point for each layer of quantization, thereby improving quantization and inference speed, and thus enhancing real-time performance on image classification tasks on the ImageNet dataset and object detection tasks on the COCO dataset.

[0107] Please note that the technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments have been described. However, as long as the combination of these technical features does not contradict each other, it should be considered within the scope of this specification. The above embodiments only illustrate several implementation methods of this application, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of the invention patent. It should be pointed out that for those skilled in the art, several modifications and improvements can be made without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. An image processing method of a visual Transformer model based on multi-scaling factor quantization, characterized in that, The method includes: Step S1: Generate an image dataset and divide the image dataset into a training set and a test set; wherein, the image dataset includes several image instances; Step S2: Using image instances in the training set, based on the model quantization bit width and quantization scaling factor allowed by the resource-constrained device carrying the visual Transformer model, perform quantization scaling training on the visual Transformer model to obtain a visual Transformer model based on multi-scaling factor quantization. Step S3: Use the visual Transformer model based on multi-scaling factor quantization to test image instances in the test set; The visual Transformer model includes a normalization layer, an attention layer, a post-softmax activation layer, a linear transformation layer, a feedforward neural network layer, a post-GELU activation layer, a Transformer processing layer, and an output layer. In step S2, a multi-scaling factor quantizer is constructed for the post-softmax activation layer and the post-GELU activation layer. When the post-softmax activation layer and the post-GELU activation layer process image instances in the training set, the multi-scaling factor quantizer is used to perform the quantization scaling training. In step S2, when constructing the multi-scaling factor quantizer: The activation value distribution characteristics of the post-softmax activation layer, the post-GELU positive activation layer, and the post-GELU negative activation layer were analyzed respectively. Based on the analysis results of the distribution characteristics, a logarithmic quantizer is constructed for the post-softmax activation layer, a uniform quantizer is constructed for the post-GELU negative activation layer, a logarithmic quantizer is constructed for the post-GELU positive activation layer with a distribution characteristic of 0 to 1, and a uniform quantizer is constructed for the post-GELU positive activation layer with a distribution characteristic greater than 1. The individual logarithmic quantizers and uniform quantizers are integrated into a multi-scaling factor quantizer.

2. The image processing method of the visual Transformer model based on multi-scaling factor quantization according to claim 1, characterized in that, In step S2, for the logarithmizer: Given activation A, logarithmic cardinality b∈ Model quantization bit width ratio and quantization scaling factor s∈ The quantization process for activating A is represented as: The dequantization process is represented as: =s in, is the quantized activation value, A is the original activation value, b is the adaptive logarithmic base, s is the scaling factor, and bit is the quantization bit width.

3. The image processing method based on a visual Transformer model with multi-scaling factor quantization according to claim 2, characterized in that, In step S2, for the uniform quantizer: Given activation A', model quantization bit width ratio, and quantization scaling factor s'∈ The quantization process for activating A' is represented as: The dequantization process is represented as: =s’ in, s' is the quantized activation value, A' is the original activation value, s' is the scaling factor, and bit is the quantization bit width.

4. The image processing method based on a visual Transformer model with multi-scaling factor quantization according to claim 3, characterized in that, In step S2, the iterative grid search strategy dynamically selects the scaling factor and the logarithmic cardinality; specifically, it includes: The scaling factor and zero point are uniformly divided throughout the search space to generate an initial grid, where each grid point represents a combination of hyperparameters, and the candidate set is a set of grid points. For each hyperparameter combination in the grid, a quantization loss function is calculated based on the image instance, and the loss value of each combination is recorded. The hyperparameter combination with the smallest loss is selected as the initial search result based on the quantization loss. Based on the optimal hyperparameter combination, a smaller search grid is generated, reducing the grid spacing and refining the search range to cover more hyperparameter combinations with finer granularity; a new candidate set is generated using the refined grid to cover the neighborhood of the best hyperparameter in the initial search. The same quantization loss calculation and search are performed on the refined grid. The search range is gradually narrowed through continuous iteration until the search range reaches the preset range. The hyperparameter combination with the minimum quantization loss is selected as the final optimal solution.

5. The image processing method based on a visual Transformer model with multi-scaling factor quantization according to claim 4, characterized in that, Visual Transformer models include ViT-S, DeiT-T, and Swin-S models; image datasets include ImageNet and the corresponding atlases of COCO.

6. An image processing system based on a visual Transformer model with multi-scaling factor quantization, characterized in that, The system includes a processing unit configured to perform: An image dataset is generated and divided into a training set and a test set; wherein the image dataset includes several image instances; Using image instances in the training set, the visual Transformer model is trained by quantization scaling based on the model quantization bit width and quantization scaling factor allowed by the resource-constrained device carrying the visual Transformer model, to obtain a visual Transformer model based on multi-scaling factor quantization. The image instances in the test set were tested using the visual Transformer model based on multi-scaling factor quantization. The visual Transformer model includes a normalization layer, an attention layer, a post-softmax activation layer, a linear transformation layer, a feedforward neural network layer, a post-GELU activation layer, a Transformer processing layer, and an output layer. Specifically, a multi-scaling factor quantizer is constructed for the post-softmax activation layer and the post-GELU activation layer. When the post-softmax activation layer and the post-GELU activation layer process image instances in the training set, the multi-scaling factor quantizer is used to perform the quantization scaling training. Specifically, when constructing a multi-scaling factor quantizer: The activation value distribution characteristics of the post-softmax activation layer, the post-GELU positive activation layer, and the post-GELU negative activation layer were analyzed respectively. Based on the analysis results of the distribution characteristics, a logarithmic quantizer is constructed for the post-softmax activation layer, a uniform quantizer is constructed for the post-GELU negative activation layer, a logarithmic quantizer is constructed for the post-GELU positive activation layer with a distribution characteristic of 0 to 1, and a uniform quantizer is constructed for the post-GELU positive activation layer with a distribution characteristic greater than 1. The individual logarithmic quantizers and uniform quantizers are integrated into a multi-scaling factor quantizer.

7. An electronic device, characterized in that, The electronic device includes a memory and a processor. The memory stores a computer program. When the processor executes the computer program, it implements the image processing method based on a visual Transformer model with multiple scaling factor quantization as described in any one of claims 1-5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, which, when executed by a processor, implements the image processing method based on a visual Transformer model with multi-scaling factor quantization as described in any one of claims 1-5.