An image classification method, device, equipment and computer readable storage medium

By constructing an ultra-lightweight and efficient image classification convolutional neural network and utilizing point-to-point channel sliding convolution kernels, the problem of large parameter and computational costs in depth-separable convolutional networks is solved, achieving efficient image classification.

CN116580251BActive Publication Date: 2026-06-26SHANDONG YUNHAI GUOCHUANG CLOUD COMPUTING EQUIP IND INNOVATION CENT CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG YUNHAI GUOCHUANG CLOUD COMPUTING EQUIP IND INNOVATION CENT CO LTD
Filing Date
2023-06-09
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing depthwise separable convolutional networks, pointwise convolution has a large number of parameters and computational cost, which leads to an imbalance between spatial feature extraction capability and channel feature fusion capability, affecting the inference speed of convolutional networks.

Method used

We construct an ultra-lightweight and efficient image classification convolutional neural network by employing point-to-point channel sliding convolution kernels. Through depthwise convolution and point-to-point channel sliding convolution, we reduce the number of parameters and computational cost, and balance the network's spatial feature extraction and channel feature fusion capabilities.

Benefits of technology

It greatly reduces the number of network parameters and computational cost, improves image classification efficiency, maintains high accuracy, and has extremely low time and space complexity.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116580251B_ABST
    Figure CN116580251B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of computer vision, and discloses an image classification method, device and equipment and a computer readable storage medium, wherein a standard image is obtained by preprocessing an acquired image to be classified; the standard image is subjected to deep convolution and point-by-point channel sliding convolution processing by using an image classification model to obtain an output feature map; the input channel number of a point-by-point channel sliding convolution kernel is less than the model input channel number of the image classification model, the output channel number of the point-by-point channel sliding convolution kernel is less than the model output channel number of the image classification model and is also less than the input channel number of the point-by-point channel sliding convolution kernel; and the output feature map is subjected to dimension reduction and classification processing to determine the image category to which the image to be classified belongs. By setting the point-by-point channel sliding convolution kernel which is less than the model input channel number and the model output channel number, an ultralight efficient image classification convolution neural network with extremely low time and space complexity can be constructed, and the image classification efficiency is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer vision technology, and in particular to an image classification method, apparatus, device, and computer-readable storage medium. Background Technology

[0002] Convolutional neural networks have achieved great success in the field of computer vision, such as image classification, object detection, and image segmentation. However, with the development of neural networks, the requirements for network accuracy are becoming increasingly higher, leading to deeper and deeper networks, resulting in a greater computational load and a larger number of parameters.

[0003] Depthwise separable convolutions (DWCs) are widely used in lightweight convolutional architectures due to their superior performance. While DWCs reduce the computational and parameter count compared to traditional convolutions, the computational and parameter count of pointwise convolutions within DWCs remains significant, resulting in a still relatively large computational and parameter count for convolutional networks built using DWCs. Furthermore, in DWCs, the computational and parameter count of pointwise convolutions (PWCs), responsible for fusing channel features, is far greater than that of depthwise convolutions (DWCs), responsible for extracting spatial features. This imbalance between the spatial feature extraction and channel feature fusion capabilities of the convolutional network negatively impacts its inference speed.

[0004] It is evident that how to reduce the number of parameters and computational cost of convolutional networks to improve image classification efficiency is a problem that needs to be solved by those skilled in the art. Summary of the Invention

[0005] The purpose of this application is to provide an image classification method, apparatus, device, and computer-readable storage medium that can reduce the number of parameters and computational load of convolutional networks, thereby improving image classification efficiency.

[0006] To address the aforementioned technical problems, embodiments of this application provide an image classification method, including:

[0007] The acquired images to be classified are preprocessed to obtain standard images;

[0008] The standard image is processed by depthwise convolution and pointwise channel sliding convolution using an image classification model to obtain an output feature map; wherein the number of input channels of the pointwise channel sliding convolution kernel is less than the number of input channels of the image classification model, and the number of output channels of the pointwise channel sliding convolution kernel is less than the number of output channels of the image classification model and less than the number of input channels of the pointwise channel sliding convolution kernel.

[0009] The output feature map is subjected to dimensionality reduction and classification processing to determine the image category to which the image to be classified belongs.

[0010] On one hand, the process of using an image classification model to perform depthwise convolution and pointwise channel sliding convolution on the standard image to obtain the output feature map includes:

[0011] Spatial dimension analysis is performed on the standard image to extract spatial feature information;

[0012] The spatial feature information is subjected to batch normalization to obtain batch normalized feature information;

[0013] The batch normalized feature information is processed by sliding convolution according to the point-by-point channel sliding convolution kernel to obtain fused feature information;

[0014] The fused feature information is subjected to layer normalization processing based on the layer normalization parameters to obtain layer normalized feature information;

[0015] The layer-normalized feature information is converted into nonlinear feature information.

[0016] On the one hand, the step of performing sliding convolution processing on the batch normalized feature information according to the point-by-point channel sliding convolution kernel to obtain fused feature information includes:

[0017] The number of output channels of the point-by-point channel sliding convolution kernel is used as the sliding stride;

[0018] According to the sliding stride and the point-by-point channel sliding convolution kernel, the batch normalized feature information is subjected to sliding convolution in the channel dimension and spatial dimension respectively to obtain fused feature information.

[0019] On one hand, converting the layer-normalized feature information into nonlinear feature information includes:

[0020] The feature values ​​that are greater than zero in the layer normalized feature information are retained, and the feature values ​​that are less than zero in the layer normalized feature information are adjusted to zero to obtain nonlinear feature information.

[0021] On the one hand, after converting the layer-normalized feature information into nonlinear feature information, the method further includes:

[0022] For each transformation that yields nonlinear feature information, the iteration count is incremented by one.

[0023] If the current iteration count meets the preset threshold, then the step of performing dimensionality reduction and classification processing on the output feature map to determine the image category to which the image to be classified belongs is executed;

[0024] If the current iteration count does not meet the preset threshold, the process returns to the step of performing spatial dimension analysis on the standard image to extract spatial feature information.

[0025] On the one hand, regarding the process of setting the number of input channels and the number of output channels of the point-to-point channel sliding convolution kernel, the method further includes:

[0026] The product of the number of input channels of the image classification model and the selected input ratio value is used as the number of input channels of the point-to-point channel sliding convolution kernel;

[0027] Calculate the product of the number of input channels of the point-to-point sliding convolution kernel and the selected output scaling factor;

[0028] If the product value is less than the number of output channels of the image classification model, the product value is used as the number of output channels of the point-to-point channel sliding convolution kernel;

[0029] If the product value is not less than the number of output channels of the image classification model, the product of the number of output channels of the image classification model and the selected output ratio value is used as the number of output channels of the point-to-point channel sliding convolution kernel; wherein, both the input ratio value and the output ratio value are less than 1.

[0030] On the one hand, it also includes:

[0031] A list of correspondences between different image types and different scale groups is pre-established;

[0032] When the image to be classified is obtained, a matching target ratio group is queried from the correspondence list based on the image type to which the image to be classified belongs; wherein, the target ratio group includes an input ratio value and an output ratio value.

[0033] This application also provides an image classification device, including a preprocessing unit, an ultra-lightweight convolution unit, and a determination unit;

[0034] The preprocessing unit is used to preprocess the acquired image to be classified to obtain a standard image;

[0035] The ultra-lightweight convolutional unit is used to perform depthwise convolution and pointwise channel sliding convolution on the standard image using an image classification model to obtain an output feature map; wherein, the number of input channels of the pointwise channel sliding convolution kernel is less than the number of input channels of the image classification model, and the number of output channels of the pointwise channel sliding convolution kernel is less than the number of output channels of the image classification model and less than the number of input channels of the pointwise channel sliding convolution kernel;

[0036] The determining unit is used to perform dimensionality reduction and classification processing on the output feature map to determine the image category to which the image to be classified belongs.

[0037] On the one hand, the ultra-lightweight convolutional unit includes an extraction subunit, a batch normalization subunit, a fusion subunit, a layer normalization subunit, and a transformation subunit;

[0038] The extraction subunit is used to perform spatial dimension analysis on the standard image to extract spatial feature information;

[0039] The batch normalization subunit is used to perform batch normalization processing on the spatial feature information to obtain batch normalized feature information.

[0040] The fusion subunit is used to perform sliding convolution processing on the batch normalized feature information according to the point-by-point channel sliding convolution kernel to obtain fused feature information;

[0041] The layer normalization subunit is used to perform layer normalization processing on the fused feature information according to the layer normalization parameters to obtain layer normalized feature information.

[0042] The transformation subunit is used to convert the layer normalized feature information into nonlinear feature information.

[0043] On the one hand, the batch normalization subunit is used to take the number of output channels of the point-to-point channel sliding convolution kernel as the sliding step size;

[0044] According to the sliding stride and the point-by-point channel sliding convolution kernel, the batch normalized feature information is subjected to sliding convolution in the channel dimension and spatial dimension respectively to obtain fused feature information.

[0045] On the one hand, the transformation subunit is used to retain the feature values ​​with values ​​greater than zero in the layer normalized feature information, and adjust the feature values ​​with values ​​less than zero in the layer normalized feature information to zero, so as to obtain nonlinear feature information.

[0046] On the one hand, it also includes accumulation units;

[0047] The accumulation unit increments the iteration count by one for each nonlinear feature information obtained from the transformation; if the current iteration count meets a preset threshold, the determination unit is triggered to perform the step of dimensionality reduction and classification processing on the output feature map to determine the image category to which the image to be classified belongs; if the current iteration count does not meet the preset threshold, the extraction subunit is triggered to perform the step of spatial dimension analysis on the standard image to extract spatial feature information.

[0048] On the one hand, regarding the process of setting the number of input channels and the number of output channels of the point-to-point channel sliding convolution kernel, the device further includes a first input unit, a calculation unit, a second input unit, and a third input unit;

[0049] The first unit is used to take the product of the number of input channels of the image classification model and the selected input ratio value as the number of input channels of the point-to-point channel sliding convolution kernel;

[0050] The calculation unit is used to calculate the product of the number of input channels of the point-to-point channel sliding convolution kernel and the selected output ratio value;

[0051] The second unit is used to use the product value as the number of output channels of the point-to-point channel sliding convolution kernel when the product value is less than the number of output channels of the image classification model.

[0052] The third unit is used to take the product of the number of output channels of the image classification model and the selected output ratio value as the number of output channels of the point-to-point channel sliding convolution kernel, provided that the product value is not less than the number of output channels of the image classification model; wherein, both the input ratio value and the output ratio value are less than 1.

[0053] On the one hand, it also includes creating units and querying units;

[0054] The establishment unit is used to pre-establish a list of correspondences between different image types and different ratio groups;

[0055] The query unit is used to query a matching target ratio group from the correspondence list based on the image type to which the image to be classified belongs when the image to be classified is obtained; wherein, the target ratio group includes an input ratio value and an output ratio value.

[0056] This application also provides an electronic device, including:

[0057] Memory, used to store computer programs;

[0058] A processor for executing the computer program to implement the steps of the image classification method described above.

[0059] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the image classification method described above.

[0060] As can be seen from the above technical solution, the acquired image to be classified is preprocessed to obtain a standard image; the standard image is then processed by depthwise convolution and pointwise channel sliding convolution using an image classification model to obtain an output feature map; wherein, the number of input channels of the pointwise channel sliding convolution kernel is less than the number of input channels of the image classification model, and the number of output channels of the pointwise channel sliding convolution kernel is less than the number of output channels of the image classification model and less than the number of input channels of the pointwise channel sliding convolution kernel. Dimensionality reduction and classification processing are performed on the output feature map to determine the image category to which the image to be classified belongs. In this technical solution, by setting the pointwise channel sliding convolution kernel to a number less than the number of input and output channels of the image classification model, an ultra-lightweight and efficient image classification convolutional neural network with extremely low time and space complexity can be constructed, reducing the number of parameters and computational cost of pointwise convolution in depthwise separable convolution, balancing the network's spatial feature extraction capability and channel feature fusion capability, and greatly reducing the number of network parameters and computational cost. Compared with traditional deep separable convolutional neural networks, the ultra-lightweight and efficient convolutional neural network proposed in this application has extremely low time and space complexity, and completes spatial feature extraction and channel feature fusion with very few parameters and computational cost, thereby improving the efficiency of image classification. Attached Figure Description

[0061] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0062] Figure 1 A flowchart illustrating an image classification method provided in this application embodiment;

[0063] Figure 2 This is a schematic diagram of a point-by-point channel sliding convolution structure provided in an embodiment of this application;

[0064] Figure 3 This is a schematic diagram of the structure of an image classification device provided in an embodiment of this application;

[0065] Figure 4 This is a structural diagram of an electronic device provided in an embodiment of this application. Detailed Implementation

[0066] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.

[0067] The terms “comprising” and “having” in the specification, claims, and accompanying drawings of this application, and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the steps or units listed, but may include steps or units not listed.

[0068] In traditional image classification depthwise separable convolutional networks, the number of parameters and computational cost of pointwise convolution is relatively large, resulting in an imbalance between the network's spatial feature extraction capability and channel feature fusion capability, which affects the inference speed of the convolutional network.

[0069] To this end, embodiments of this application provide an image classification method, apparatus, device, and computer-readable storage medium. Based on point-by-point channel sliding convolution kernels, an ultra-lightweight and efficient image classification convolutional neural network with extremely low time and space complexity is constructed. This reduces the number of parameters and computational cost of point-by-point convolution in depthwise separable convolution, balances the network's spatial feature extraction capability and channel feature fusion capability, and greatly reduces the number of parameters and computational cost of the network.

[0070] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0071] Next, we will describe in detail an image classification method provided by an embodiment of this application. Figure 1 A flowchart of an image classification method provided in this application embodiment, the method including:

[0072] S101: Preprocess the acquired images to be classified to obtain standard images.

[0073] Preprocessing may include converting the dimensions of the images to be classified to a uniform size that meets the requirements of the image classification model.

[0074] In this application example, the CIFAR-10 image classification dataset can be used as the images to be classified. The uniform size of each image can be set to 32×32. The dataset contains a total of 50,000 training images and 10,000 test images, with 10 classes in total. Each class has 6,000 images, including 5,000 training images and 1,000 test images. The images are preprocessed to a size of 32×32, and each channel is normalized separately.

[0075] S102: Use an image classification model to perform depthwise convolution and pointwise channel sliding convolution on the standard image to obtain the output feature map.

[0076] In this embodiment of the application, in order to reduce the number of parameters and computational cost of the convolutional network and improve the efficiency of image classification, an ultra-lightweight convolutional unit can be used. The number of input channels of the point-to-channel sliding convolutional kernel corresponding to the ultra-lightweight convolutional unit is less than the number of input channels of the image classification model, and the number of output channels of the point-to-channel sliding convolutional kernel is less than the number of output channels of the image classification model and less than the number of input channels of the point-to-channel sliding convolutional kernel.

[0077] Based on the functions required by the convolutional network, it can be divided into five layers. The first layer uses depthwise convolution to perform spatial dimension analysis on the standard image to extract spatial feature information. The second layer is batch normalization, which processes the spatial feature information to obtain batch-normalized feature information. The third layer is point-to-channel sliding convolution, which processes the batch-normalized feature information using a point-to-channel sliding convolution kernel to obtain fused feature information. The fourth layer is a layer normalization layer, which processes the fused feature information according to the layer normalization parameters to obtain layer-normalized feature information. The fifth layer is an activation layer, which is used to convert the layer-normalized feature information into non-linear feature information.

[0078] The kernel size of the first-layer depthwise convolution can be uniformly set to 3×3, which can achieve spatial feature extraction without changing the number of channels in the feature map. Assuming the input feature map size is M×R×C, the parameters of the depthwise convolution are M×9, and the computational cost is M×R×C×9, both of which are very small.

[0079] Batch normalization in the second layer can accelerate the convergence speed of the model, alleviate the gradient vanishing problem in deep networks, and make image classification models easier and more stable.

[0080] The third layer of point-to-point channel sliding convolution is a novel convolution method. In its implementation, the number of output channels of the point-to-point channel sliding convolution kernel can be used as the sliding stride. According to the sliding stride and the point-to-point channel sliding convolution kernel, the batch normalized feature information is slid convolved in the channel dimension and the spatial dimension respectively to obtain the fused feature information.

[0081] As Figure 2 shown in the structural schematic diagram of pointwise channel sliding convolution provided by the embodiment of the present application. Assume that the dimension of the input feature map is M×R×C, the dimension of the output feature map is N×R×C, M is the number of input channels of the model, N is the number of output channels of the model, R and C are the spatial sizes of the feature map, the spatial size of the pointwise channel sliding convolution kernel is 1×1, the input channel size is m, m < M, the output channel size is d, d < N, and the pointwise channel sliding convolution kernel with a size of d×m×1×1 performs sliding convolution in the channel dimension and the spatial dimension respectively. The sliding step size in the channel dimension is d, and the fused feature information is obtained.

[0082] The pointwise channel sliding convolution kernel not only shares parameters in the spatial dimension, but also realizes sparse connection and parameter sharing in the channel dimension. The parameter quantity and computational complexity of the pointwise channel sliding convolution are less than those of traditional pointwise convolution, grouped convolution and other lightweight convolutions. It is a more lightweight convolution form. The convolution kernel slides N / d times in the channel dimension, and d output feature maps are obtained each time. Finally, N output feature maps are obtained to realize the fusion of channel dimension features.

[0083] The parameter quantity of traditional pointwise convolution is M×N, and the computational complexity is M×N×R×C. The parameter quantity of the pointwise channel sliding convolution provided by the embodiment of the present application is d×m, and the computational complexity is d×N×R×C. The parameter quantity of the pointwise channel sliding convolution is dm / MN of the traditional pointwise convolution, and the computational complexity is d / M of the traditional pointwise convolution. In the example of the present application, through experimental testing, the input ratio value can be set to 3 / 4, that is, m takes the value of 3M / 4; the output ratio value is set to 1 / 3, that is, d takes the value of m / 3.

[0084] The layer normalization layer of the fourth layer has the same normalization parameters throughout the layer, which is easy to fuse with the parameters of the pointwise channel sliding convolution that shares parameters in the parameter channel dimension, facilitating the acceleration of the operation.

[0085] The activation function of the fifth layer can uniformly adopt the ReLU activation function in the example of the present application. The ReLU activation function can retain the feature values greater than zero in the layer-normalized feature information and adjust the feature values less than zero in the layer-normalized feature information to zero to obtain non-linear feature information.

[0086] Using the ReLU activation function to achieve the non-linear expression of the network improves the network expression ability, and the ReLU activation function has simple calculation and small computational complexity, which is beneficial to improving the calculation speed.

[0087] S103: Perform dimensionality reduction and classification processing on the output feature map to determine the image category to which the image to be classified belongs.

[0088] After obtaining the output feature map, the output feature map can be subjected to 2×2 max pooling to reduce the size of the feature map. Then, it can be classified through the global average pooling layer and fully connected layer of the image classification model to obtain the image category.

[0089] Taking a value of m set to 3M / 4 and a value of d set to m / 3 as an example, compared with traditional depthwise separable convolutional image classification networks, the embodiments of this application use only 15% of the weights and less than one-third of the computational cost, with an accuracy decrease of less than 4%. The ratio of parameters used for spatial feature extraction to parameters used for channel feature fusion is improved from 74:1 to 10:1. Running on a GPU, there is a 3x speedup. According to the pointwise channel sliding convolution kernel provided in the embodiments of this application, a very lightweight and efficient image classification convolutional neural network can be constructed while maintaining high accuracy and faster inference speed.

[0090] As can be seen from the above technical solution, the acquired image to be classified is preprocessed to obtain a standard image; the standard image is then processed by depthwise convolution and pointwise channel sliding convolution using an image classification model to obtain an output feature map; wherein, the number of input channels of the pointwise channel sliding convolution kernel is less than the number of input channels of the image classification model, and the number of output channels of the pointwise channel sliding convolution kernel is less than the number of output channels of the image classification model and less than the number of input channels of the pointwise channel sliding convolution kernel. Dimensionality reduction and classification processing are performed on the output feature map to determine the image category to which the image to be classified belongs. In this technical solution, by setting the pointwise channel sliding convolution kernel to a number less than the number of input and output channels of the image classification model, an ultra-lightweight and efficient image classification convolutional neural network with extremely low time and space complexity can be constructed, reducing the number of parameters and computational cost of pointwise convolution in depthwise separable convolution, balancing the network's spatial feature extraction capability and channel feature fusion capability, and greatly reducing the number of network parameters and computational cost. Compared with traditional deep separable convolutional neural networks, the ultra-lightweight and efficient convolutional neural network proposed in this application has extremely low time and space complexity, and completes spatial feature extraction and channel feature fusion with very few parameters and computational cost, thereby improving the efficiency of image classification.

[0091] In this embodiment, to fully extract image features, it is necessary to repeatedly execute the operations corresponding to the five layers of the aforementioned convolutional network. In practical applications, the number of repetitions can be determined based on the complexity of the classification task.

[0092] In the specific implementation, the iteration count is incremented by one for each nonlinear feature information obtained from the transformation. If the current iteration count meets the preset threshold, it means that the number of repetitions has met the requirements. At this time, the step of dimensionality reduction and classification processing of the output feature map can be performed to determine the image category to which the image to be classified belongs.

[0093] If the current number of iterations does not meet the preset threshold, it means that the number of repetitions has not yet met the requirements. At this time, it is necessary to return to the step of performing spatial dimension analysis on the standard image to extract spatial feature information, and repeat the operation corresponding to the five layers of the convolutional network.

[0094] Considering that the analysis difficulty varies for different types of images in practical applications, corresponding input and output scale values ​​can be set for different types of images. Both the input and output scale values ​​are less than 1.

[0095] In this embodiment of the application, a list of correspondences between different image types and different ratio groups can be pre-established; when an image to be classified is obtained, a matching target ratio group is queried from the list of correspondences based on the image type to which the image to be classified belongs; wherein, the target ratio group includes an input ratio value and an output ratio value.

[0096] Different image types require different levels of accuracy for image classification and recognition. In practice, the optimal input and output ratios for each image type can be determined through preliminary experimental testing.

[0097] Image types can be categorized based on different image classifications. For example, image types can include building image types, text content image types, and face image types.

[0098] In the specific implementation, the operator can select the desired input and output scale values ​​based on the image type to which the image to be classified belongs. The product of the number of input channels of the image classification model and the selected input scale value is used as the number of input channels of the point-to-channel sliding convolution kernel. The product of the number of input channels of the point-to-channel sliding convolution kernel and the selected output scale value is calculated. If the product value is less than the number of output channels of the image classification model, the product value is used as the number of output channels of the point-to-channel sliding convolution kernel; if the product value is not less than the number of output channels of the image classification model, the product of the number of output channels of the image classification model and the selected output scale value is used as the number of output channels of the point-to-channel sliding convolution kernel.

[0099] By pre-setting the input and output ratios for different types of images, users can easily select the appropriate input and output ratios, avoiding situations where users blindly set the input and output ratios, leading to unreasonable settings of the number of input and output channels for the point-to-point sliding convolution kernel.

[0100] Figure 3 The schematic diagram of an image classification device provided in this application embodiment includes a preprocessing unit 31, an ultra-lightweight convolutional unit 32, and a determination unit 33;

[0101] Preprocessing unit 31 is used to preprocess the acquired image to be classified to obtain a standard image;

[0102] The ultra-lightweight convolutional unit 32 is used to perform depthwise convolution and pointwise channel sliding convolution on a standard image using an image classification model to obtain an output feature map. The number of input channels of the pointwise channel sliding convolution kernel is less than the number of input channels of the image classification model, and the number of output channels of the pointwise channel sliding convolution kernel is less than the number of output channels of the image classification model and less than the number of input channels of the pointwise channel sliding convolution kernel.

[0103] The determination unit 33 is used to perform dimensionality reduction and classification processing on the output feature map in order to determine the image category to which the image to be classified belongs.

[0104] In some embodiments, the ultralightweight convolutional unit includes an extraction subunit, a batch normalization subunit, a fusion subunit, a layer normalization subunit, and a transformation subunit;

[0105] Extracting sub-units is used to perform spatial dimension analysis on standard images in order to extract spatial feature information;

[0106] The batch normalization sub-unit is used to perform batch normalization processing on spatial feature information to obtain batch normalized feature information;

[0107] The fusion subunit is used to perform sliding convolution processing on batch normalized feature information according to the point-by-point channel sliding convolution kernel to obtain fused feature information;

[0108] The layer normalization subunit is used to perform layer normalization processing on the fused feature information according to the layer normalization parameters to obtain layer normalized feature information;

[0109] The transformation subunit is used to convert layer-normalized feature information into nonlinear feature information.

[0110] In some embodiments, the batch normalization subunit is used to take the number of output channels of the point-to-channel sliding convolution kernel as the sliding stride;

[0111] According to the sliding stride and the point-by-point channel sliding convolution kernel, the batch normalized feature information is subjected to sliding convolution in the channel dimension and spatial dimension respectively to obtain the fused feature information.

[0112] In some embodiments, the transformation subunit is used to retain the feature values ​​with values ​​greater than zero in the layer normalized feature information, and adjust the feature values ​​with values ​​less than zero in the layer normalized feature information to zero, so as to obtain nonlinear feature information.

[0113] In some embodiments, an accumulation unit is also included;

[0114] The accumulation unit increments the iteration count by one for each transformation that yields nonlinear feature information. If the current iteration count meets a preset threshold, the determination unit is triggered to perform dimensionality reduction and classification processing on the output feature map to determine the image category to which the image to be classified belongs. If the current iteration count does not meet the preset threshold, the extraction subunit is triggered to perform spatial dimension analysis on the standard image to extract spatial feature information.

[0115] In some embodiments, for the process of setting the number of input channels and the number of output channels of the point-to-point channel sliding convolution kernel, the apparatus further includes a first processing unit, a calculation unit, a second processing unit, and a third processing unit.

[0116] The first unit is used to multiply the number of input channels of the image classification model by the selected input ratio value as the number of input channels of the point-to-point channel sliding convolution kernel;

[0117] The calculation unit is used to calculate the product of the number of input channels of the point-to-point channel sliding convolution kernel and the selected output scaling factor.

[0118] The second unit is used to use the product value as the number of output channels of the point-to-point channel sliding convolution kernel when the product value is less than the number of output channels of the image classification model.

[0119] The third unit is used to take the product of the number of output channels of the image classification model and the selected output ratio as the number of output channels of the point-to-point channel sliding convolution kernel, provided that the product value is not less than the number of output channels of the image classification model; wherein, both the input ratio and the output ratio are less than 1.

[0120] In some embodiments, it further includes an establishment unit and a query unit;

[0121] Establishment unit, used to pre-create a list of correspondences between different image types and different scale groups;

[0122] The query unit is used to query the matching target ratio group from the correspondence list based on the image type to which the image to be classified belongs when the image to be classified is obtained; wherein, the target ratio group includes the input ratio value and the output ratio value.

[0123] Figure 3 For a description of the features in the corresponding embodiments, please refer to Figure 1 The relevant descriptions of the corresponding embodiments will not be repeated here.

[0124] As can be seen from the above technical solution, the acquired image to be classified is preprocessed to obtain a standard image; the standard image is then processed by depthwise convolution and pointwise channel sliding convolution using an image classification model to obtain an output feature map; wherein, the number of input channels of the pointwise channel sliding convolution kernel is less than the number of input channels of the image classification model, and the number of output channels of the pointwise channel sliding convolution kernel is less than the number of output channels of the image classification model and less than the number of input channels of the pointwise channel sliding convolution kernel. Dimensionality reduction and classification processing are performed on the output feature map to determine the image category to which the image to be classified belongs. In this technical solution, by setting the pointwise channel sliding convolution kernel to a number less than the number of input and output channels of the image classification model, an ultra-lightweight and efficient image classification convolutional neural network with extremely low time and space complexity can be constructed, reducing the number of parameters and computational cost of pointwise convolution in depthwise separable convolution, balancing the network's spatial feature extraction capability and channel feature fusion capability, and greatly reducing the number of network parameters and computational cost. Compared with traditional deep separable convolutional neural networks, the ultra-lightweight and efficient convolutional neural network proposed in this application has extremely low time and space complexity, and completes spatial feature extraction and channel feature fusion with very few parameters and computational cost, thereby improving the efficiency of image classification.

[0125] Figure 4 A structural diagram of an electronic device provided in an embodiment of this application, such as... Figure 4 As shown, the electronic device includes: a memory 40 for storing computer programs;

[0126] The processor 41 is used to execute computer programs to implement the steps of the image classification method as described in the above embodiments.

[0127] The electronic devices provided in this embodiment may include, but are not limited to, smartphones, tablets, laptops, or desktop computers.

[0128] The processor 41 may include one or more processing cores, such as a quad-core processor or an octa-core processor. The processor 41 may be implemented using at least one hardware form selected from DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). The processor 41 may also include a main processor and a coprocessor. The main processor, also known as a CPU (Central Processing Unit), is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In some embodiments, the processor 41 may integrate a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the screen. In some embodiments, the processor 41 may also include an AI (Artificial Intelligence) processor, which is used to handle computational operations related to machine learning.

[0129] The memory 40 may include one or more computer-readable storage media, which may be non-transitory. The memory 40 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In this embodiment, the memory 40 is used to store at least the following computer program 401, which, after being loaded and executed by the processor 41, is capable of implementing the relevant steps of the image classification method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 40 may also include an operating system 402 and data 403, and the storage method may be temporary or permanent storage. The operating system 402 may include Windows, Unix, Linux, etc. The data 403 may include, but is not limited to, the number of model input channels, the number of model output channels, and image categories.

[0130] In some embodiments, the electronic device may further include a display screen 42, an input / output interface 43, a communication interface 44, a power supply 45, and a communication bus 46.

[0131] Those skilled in the art will understand that Figure 4 The structures shown do not constitute a limitation on electronic devices and may include more or fewer components than those shown.

[0132] It is understood that if the image classification method in the above embodiments is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the current technology, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and executes all or part of the steps of the methods in the various embodiments of this application. The aforementioned storage medium includes: USB flash drive, mobile hard disk, read-only memory (ROM), random access memory (RAM), electrically erasable programmable ROM, register, hard disk, removable disk, CD-ROM, magnetic disk, or optical disk, and other media capable of storing program code.

[0133] Based on this, embodiments of this application also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the image classification method described above.

[0134] The foregoing has provided a detailed description of an image classification method, apparatus, device, and computer-readable storage medium provided in the embodiments of this application. The various embodiments are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to in the method section.

[0135] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0136] The foregoing has provided a detailed description of an image classification method, apparatus, device, and computer-readable storage medium provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and its core ideas. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this application.

Claims

1. An image classification method, characterized in that, include: The acquired images to be classified are preprocessed to obtain standard images; The standard image is processed using a depthwise convolution and a pointwise channel sliding convolution to obtain an output feature map. The number of input channels in the pointwise channel sliding convolution kernel is less than the number of input channels in the image classification model, and the number of output channels in the pointwise channel sliding convolution kernel is less than the number of output channels in the image classification model and also less than the number of input channels in the pointwise channel sliding convolution kernel. The image classification model consists of five layers: the first layer uses depthwise convolution, the second layer uses batch normalization, the third layer uses pointwise channel sliding convolution, the fourth layer is a layer normalization layer, and the fifth layer is an activation layer. The output feature map is subjected to dimensionality reduction and classification processing to determine the image category to which the image to be classified belongs; The process of using an image classification model to perform depthwise convolution and pointwise channel sliding convolution on the standard image to obtain the output feature map includes: Spatial dimension analysis is performed on the standard image to extract spatial feature information; The spatial feature information is subjected to batch normalization to obtain batch normalized feature information; The batch normalized feature information is processed by sliding convolution according to the point-by-point channel sliding convolution kernel to obtain fused feature information; The fused feature information is subjected to layer normalization processing based on the layer normalization parameters to obtain layer normalized feature information; The layer-normalized feature information is converted into nonlinear feature information; The step of performing sliding convolution processing on the batch normalized feature information according to the point-by-point channel sliding convolution kernel to obtain fused feature information includes: The number of output channels of the point-by-point channel sliding convolution kernel is used as the sliding step size; According to the sliding stride and the point-by-point channel sliding convolution kernel, the batch normalized feature information is subjected to sliding convolution in the channel dimension and spatial dimension respectively to obtain fused feature information.

2. The image classification method according to claim 1, characterized in that, The step of converting the layer-normalized feature information into nonlinear feature information includes: The feature values ​​that are greater than zero in the layer normalized feature information are retained, and the feature values ​​that are less than zero in the layer normalized feature information are adjusted to zero to obtain nonlinear feature information.

3. The image classification method according to claim 1, characterized in that, After converting the layer-normalized feature information into nonlinear feature information, the method further includes: For each transformation that yields nonlinear feature information, the iteration count is incremented by one. If the current iteration count meets the preset threshold, then the step of performing dimensionality reduction and classification processing on the output feature map to determine the image category to which the image to be classified belongs is executed; If the current iteration count does not meet the preset threshold, the process returns to the step of performing spatial dimension analysis on the standard image to extract spatial feature information.

4. The image classification method according to any one of claims 1 to 3, characterized in that, Regarding the process of setting the number of input channels and the number of output channels of the point-to-point channel sliding convolution kernel, the method further includes: The product of the number of input channels of the image classification model and the selected input ratio value is used as the number of input channels of the point-to-point channel sliding convolution kernel; Calculate the product of the number of input channels of the point-to-point sliding convolution kernel and the selected output scaling factor; If the product value is less than the number of output channels of the image classification model, the product value is used as the number of output channels of the point-to-point channel sliding convolution kernel; If the product value is not less than the number of output channels of the image classification model, the product of the number of output channels of the image classification model and the selected output ratio value is used as the number of output channels of the point-to-point channel sliding convolution kernel; wherein, both the input ratio value and the output ratio value are less than 1.

5. The image classification method according to claim 4, characterized in that, Also includes: A list of correspondences between different image types and different scale groups is pre-established; When the image to be classified is obtained, a matching target ratio group is queried from the correspondence list based on the image type to which the image to be classified belongs; wherein, the target ratio group includes an input ratio value and an output ratio value.

6. An image classification device, characterized in that, It includes a preprocessing unit, an ultra-lightweight convolutional unit, and a determination unit; The preprocessing unit is used to preprocess the acquired image to be classified to obtain a standard image; The ultra-lightweight convolutional unit is used to perform depthwise convolution and pointwise channel sliding convolution on the standard image using an image classification model to obtain an output feature map. The number of input channels of the pointwise channel sliding convolution kernel is less than the number of input channels of the image classification model, and the number of output channels of the pointwise channel sliding convolution kernel is less than the number of output channels of the image classification model and less than the number of input channels of the pointwise channel sliding convolution kernel. The image classification model is divided into five layers: the first layer uses depthwise convolution, the second layer uses batch normalization, the third layer uses pointwise channel sliding convolution, the fourth layer is a layer normalization layer, and the fifth layer is an activation layer. The determining unit is used to perform dimensionality reduction and classification processing on the output feature map to determine the image category to which the image to be classified belongs. The ultra-lightweight convolutional unit includes an extraction subunit, a batch normalization subunit, a fusion subunit, a layer normalization subunit, and a transformation subunit. The extraction subunit performs spatial dimension analysis on the standard image to extract spatial feature information. The batch normalization subunit performs batch normalization processing on the spatial feature information to obtain batch normalized feature information. The fusion subunit performs sliding convolution processing on the batch normalized feature information according to the point-to-channel sliding convolution kernel to obtain fused feature information. The layer normalization subunit performs layer normalization processing on the fused feature information according to layer normalization parameters to obtain layer normalized feature information. The transformation subunit converts the layer normalized feature information into nonlinear feature information. The batch normalization subunit uses the output channel number of the point-to-channel sliding convolution kernel as the sliding stride. According to the sliding stride and the point-to-channel sliding convolution kernel, the batch normalized feature information is subjected to sliding convolution in both the channel dimension and the spatial dimension to obtain fused feature information.

7. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program to implement the steps of the image classification method as described in any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the image classification method as described in any one of claims 1 to 5.