Training method of image classification model, image classification method, device and equipment
By encoding the attention mechanism and training the differential information in the image classification model, the problem of high cost and low efficiency caused by the dependence on labels in image classification model training is solved, and a more efficient training process is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TENCENT TECHNOLOGY (SHENZHEN) CO LTD
- Filing Date
- 2022-03-17
- Publication Date
- 2026-06-23
AI Technical Summary
In existing technologies, the training of image classification models relies on a large number of manually labeled tags, resulting in high costs and low efficiency.
The image classification model encodes the image features of the sample images based on the attention mechanism to obtain the first weight, and uses the difference information between the domain features and the classification task features for training, thus avoiding reliance on labels.
This reduces the training cost of image classification models and improves training efficiency.
Smart Images

Figure CN116824196B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a training method for an image classification model, an image classification method, an apparatus, and a device. Background Technology
[0002] With the development of computer technology, image recognition technology is being applied in increasingly wider scenarios. For example, image recognition technology is used in face recognition, object detection, and image classification.
[0003] In related technologies, image classification often involves training an image classification model to classify images. During training, the model is built upon the image features and labels of sample images.
[0004] However, the training methods in related technologies rely on a large number of manually labeled tags, resulting in high costs and low efficiency in training image classification models. Summary of the Invention
[0005] This application provides a training method for an image classification model, an image classification method, an apparatus, and a device, which can improve the training effect of the image classification model. The technical solution is as follows:
[0006] On the one hand, a method for training an image classification model is provided, the method comprising:
[0007] The image features of multiple sample images are input into an image classification model. The image classification model encodes the image features of the multiple sample images based on an attention mechanism to obtain the first weights of the multiple sample images.
[0008] The image classification model processes the image features of the multiple sample images based on the first weights of the multiple sample images to obtain the domain features and classification task features of the multiple sample images. The domain features are used to represent the environment in which the sample images were captured, and the classification task features are used to represent the type of the sample images.
[0009] The image classification model is trained based on the first difference information between each pair of domain features of the multiple sample images and the second difference information between each pair of classification task features of the multiple sample images.
[0010] On the one hand, an image classification method is provided, the method comprising:
[0011] The target image is input into an image classification model, and the image features of the target image are extracted by the image classification model.
[0012] The image features of the target image are processed by the image classification model to obtain the classification task features of the target image, which are used to represent the type of the target image;
[0013] The image classification model predicts the target image's label based on its classification task features and outputs the target image's label.
[0014] On the one hand, a training device for an image classification model is provided, the device comprising:
[0015] The weight acquisition module is used to input the image features of multiple sample images into the image classification model, and through the image classification model, encode the image features of the multiple sample images based on the attention mechanism to obtain the first weight of the multiple sample images;
[0016] The first feature processing module is used to process the image features of the multiple sample images based on the first weights of the multiple sample images through the image classification model, so as to obtain the domain features and classification task features of the multiple sample images. The domain features are used to represent the environment in which the sample images were captured, and the classification task features are used to represent the type of the sample images.
[0017] The training module is used to train the image classification model based on the first difference information between each pair of domain features of the plurality of sample images and the second difference information between each pair of classification task features of the plurality of sample images.
[0018] In one possible implementation, the first feature processing module is configured to, for any sample image among the plurality of sample images, multiply the first weight of the sample image with the image features of the sample image through the image classification model to obtain the domain features of the sample image; and multiply the second weight of the sample image with the image features of the sample image to obtain the classification task features of the sample image, wherein the sum of the second weight and the first weight is a target value.
[0019] In one possible implementation, the training module is used to train the image classification model based on the first difference information between the domain features of every two positive sample images in a plurality of positive sample image groups, each of the positive sample image groups comprising three positive sample images, the positive sample images being sample images of the target type among the plurality of sample images;
[0020] The image classification model is trained based on the second difference information between the classification task features of every two sample images in multiple sample image groups, where each sample image group includes three sample images from the multiple sample images.
[0021] In one possible implementation, the image classification model includes a first image classification sub-model and a second image classification sub-model. The training module is used to train the first image classification sub-model for any positive sample image group among the plurality of positive sample image groups, based on a first difference information between the domain features of the first sample image and the domain features of the second sample image, and a first difference information between the domain features of the first sample image and the domain features of the third sample image. The first sample image, the second sample image, and the third sample image all belong to the positive sample image group.
[0022] For any of the plurality of sample image groups, the second image classification sub-model is trained based on the second difference information between the classification task features of the target sample image and the classification task features of similar sample images, the second difference information between the classification task features of the target sample image and the classification task features of the relative sample image, and the gradient of the loss function of the first image classification sub-model. The similar sample images and the target sample image are sample images of the same type, and the relative sample images and the target sample image are sample images of different types. The target sample image, the similar sample image, and the relative sample image all belong to the plurality of sample images.
[0023] In one possible implementation, the training module is configured to determine first training parameters based on first difference information between a first domain feature of the first sample image and a second domain feature of the second sample image, wherein the first domain feature of the first sample image is a domain feature extracted by the first image classification sub-model, and the second domain feature of the second sample image is a domain feature extracted by the second image classification sub-model; determine second training parameters based on the first difference information between the first domain feature of the first sample image and a third domain feature of the third sample image, wherein the third domain feature of the third sample image is a domain feature extracted by the second image classification sub-model; determine the gradient of the loss function of the first image classification sub-model based on the first training parameters and the second training parameters; and train the first image classification sub-model based on the gradient of the loss function of the first image classification sub-model.
[0024] In one possible implementation, the training module is used to determine the gradient of the loss function of the second image classification sub-model based on the second difference information between the classification task features of the target sample image and the classification task features of the similar sample image, and the second difference information between the classification task features of the target sample image and the classification task features of the relative sample image; and to train the second image classification sub-model based on the gradient of the loss function of the first image classification sub-model and the gradient of the loss function of the second image classification sub-model.
[0025] In one possible implementation, the training module is used to determine a third training parameter based on a second difference information between a first classification task feature of the target sample image and a second classification task feature of the similar sample image, wherein the first classification task feature of the target sample image is a classification task feature extracted by the first image classification sub-model, and the second classification task feature of the similar sample image is a classification task feature extracted by the second image classification sub-model; based on the second difference information between the first classification task feature of the target sample image and a third classification task feature of the relative sample image, wherein the third classification task feature of the relative sample image is a classification task feature extracted by the second image classification sub-model; and based on the third training parameter and the fourth training parameter, to determine the gradient of the loss function of the second image classification sub-model.
[0026] In one possible implementation, the device further includes:
[0027] The encoding module is used to encode the classification task features of the multiple sample images based on the attention mechanism to obtain the target classification task features of the multiple sample images.
[0028] The training module is used to train the image classification model based on the first difference information between the domain features of the multiple sample images and the second difference information between the target classification task features of the multiple sample images.
[0029] In one possible implementation, the training module is further configured to train the image classification model based on the classification task features of multiple sample images and the target labels of the multiple sample images.
[0030] In one possible implementation, the training module is further configured to use the image classification model to predict the classification task features of the plurality of sample images and output the predicted labels of the plurality of sample images; and to train the image classification model based on the third difference information between the predicted labels and the target labels of the plurality of sample images.
[0031] On the one hand, an image classification device is provided, the device comprising:
[0032] The feature extraction module is used to input the target image into the image classification model, extract features from the target image through the image classification model, and output the image features of the target image.
[0033] The second feature processing module is used to process the image features of the target image through the image classification model to obtain the classification task features of the target image, wherein the classification task features are used to represent the type of the target image;
[0034] The prediction module is used to predict the label of the target image based on the classification task features of the target image using the image classification model.
[0035] On one hand, a computer device is provided, the computer device including one or more processors and one or more memories, the one or more memories storing at least one computer program, the computer program being loaded and executed by the one or more processors to implement a training method for the image classification model or the image classification method.
[0036] On one hand, a computer-readable storage medium is provided, wherein at least one computer program is stored in the computer program, which is loaded and executed by a processor to implement a training method for the image classification model or the image classification method.
[0037] On the one hand, a computer program product is provided, including a computer program that, when executed by a processor, implements the training method of the above-mentioned image classification model or the image classification method.
[0038] The technical solution provided in this application encodes image features of sample images based on an attention mechanism during image classification model training, thereby obtaining first weights. These first weights are then used to process the image features, yielding domain features and classification task features of the sample images. When training the image classification model based on the first difference information between domain features and the second difference information between classification task features, it does not rely on labels, thus reducing the cost of training the image classification model and improving training efficiency. Attached Figure Description
[0039] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0040] Figure 1 This is a schematic diagram of the implementation environment of a training method for an image classification model provided in an embodiment of this application;
[0041] Figure 2 This is a flowchart of a training method for an image classification model provided in an embodiment of this application;
[0042] Figure 3 This is a flowchart of a training method for an image classification model provided in an embodiment of this application;
[0043] Figure 4 This is a schematic diagram of a training image classification model provided in an embodiment of this application;
[0044] Figure 5 This is a schematic diagram of a comparative learning method provided in an embodiment of this application;
[0045] Figure 6 This is a schematic diagram of a training image classification model provided in an embodiment of this application;
[0046] Figure 7 This is a flowchart of a training method for an image classification model provided in an embodiment of this application;
[0047] Figure 8 This is a flowchart of a training method for an image classification model provided in an embodiment of this application;
[0048] Figure 9 This is a flowchart of an image classification method provided in an embodiment of this application;
[0049] Figure 10 This is a schematic diagram of the structure of a training device for an image classification model provided in an embodiment of this application;
[0050] Figure 11 This is a schematic diagram of the structure of an image classification device provided in an embodiment of this application;
[0051] Figure 12 This is a schematic diagram of the structure of a terminal provided in an embodiment of this application;
[0052] Figure 13 This is a schematic diagram of the structure of a server provided in an embodiment of this application. Detailed Implementation
[0053] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.
[0054] In this application, the terms "first," "second," etc., are used to distinguish identical or similar items with essentially the same function. It should be understood that there is no logical or temporal dependency between "first," "second," and "nth," nor are there any restrictions on quantity or execution order.
[0055] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.
[0056] Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning / deep learning.
[0057] Machine learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge sub-models to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and instruction-based learning.
[0058] Semantic features: Features used to represent the semantics expressed by text. Different texts can correspond to the same semantic features; for example, the text "How's the weather today?" and the text "How's the weather today?" can correspond to the same semantic feature. Computer devices can map characters in text into character vectors, and combine and operate on these character vectors according to the relationships between characters to obtain the semantic features of the text. For example, computer devices can use a bidirectional encoder representation from transformers (BERT).
[0059] Mask: A mask is a string of binary code multiplied by a target field to either hide or show a specific character within the target field. For example, if the target field is (1, 1, 0, 1) and the mask is (1, 0, 1, 0), multiplying the target field and the mask yields (1, 0, 0, 0). This means the first and third characters of the target field are preserved, while the second and third characters are "masked" and become 0. The mask reveals which characters in the target field are preserved and which are "masked."
[0060] Normalization: Mapping sequences with different value ranges to the interval (0, 1) to facilitate data processing. In some cases, normalized values can be directly expressed as probabilities.
[0061] The Gaussian distribution, also known as the normal distribution, has a bell-shaped curve, high in the middle and low at both ends. The expected value μ determines the position of the Gaussian distribution curve, while the standard deviation σ determines its range. The Gaussian distribution with μ = 0 and σ = 1 is the standard Gaussian distribution.
[0062] Learning rate: Used to control the learning progress of the model. The learning rate guides the model in adjusting network weights using the gradient of the loss function during gradient descent. If the learning rate is too large, the loss function may directly skip the global optimum, resulting in excessive loss. If the learning rate is too small, the loss function changes very slowly, greatly increasing the convergence complexity of the network and making it easy to get trapped in local minima or saddle points.
[0063] Embedded coding, mathematically speaking, represents a correspondence, that is, mapping data in space X to space Y using a function F. This function F is injective, and the mapping result preserves the structure. An injective function means that the mapped data uniquely corresponds to the original data, and preserving the structure means that the order of the original data remains the same. For example, if there are data X1 and X2 before mapping, after mapping we get Y1 corresponding to X1 and Y2 corresponding to X2. If the original data X1 > X2, then correspondingly, the mapped data Y1 > Y2. For words, this means mapping words to another space to facilitate subsequent machine learning and processing.
[0064] Attention weights represent the importance of a piece of data during training or prediction. Importance indicates the magnitude of the influence of input data on output data. Data with high importance corresponds to higher attention weights, while data with low importance corresponds to lower attention weights. The importance of data varies in different scenarios, and training the model to assign attention weights is essentially the process of determining data importance.
[0065] It should be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data used for analysis, data stored, data displayed, etc.) and signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.
[0066] Figure 1 This is a schematic diagram illustrating the implementation environment of a training method for an image classification model provided in this application embodiment. See also... Figure 1 The implementation environment may include terminal 110 and server 140.
[0067] Terminal 110 is connected to server 140 via a wireless or wired network. Optionally, terminal 110 may be an in-vehicle terminal, smart TV, smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, etc., but is not limited to these. Terminal 110 has an application that supports image classification installed and running.
[0068] Server 140 is a standalone physical server, or a server cluster or distributed system consisting of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery network (CDN), and big data and artificial intelligence platforms.
[0069] Optionally, there may be multiple terminals 110 and multiple servers 140.
[0070] After introducing the implementation environment of the embodiments of this application, the application scenarios of the embodiments of this application will be introduced in conjunction with the above implementation environment. In the following description, the terminal is the terminal 110 in the above implementation environment, and the server is the server 140 in the above implementation environment.
[0071] The technical solutions provided in this application can be applied to various scenarios where image classification is required, such as distinguishing between real and non-real images, or distinguishing between images from different sources.
[0072] When the technical solution provided in this application is applied to a scenario that distinguishes between real and non-real images, the server can train an image classification model using the training method of the image classification model provided in this application, and then use the image classification model to classify images. During training, the server inputs the image features of multiple sample images into the image classification model. The image classification model encodes the image features of the multiple sample images based on an attention mechanism to obtain the first weights of the multiple sample images. These multiple sample images include both real and non-real images. The server processes the image features of the multiple sample images based on the first weights using the image classification model to obtain the domain features and classification task features of the multiple sample images. When the server processes the image features of the multiple sample images based on the first weights, it decomposes the image features of each sample image into domain features and classification task features. The domain features represent the environment in which the sample image was captured. In some embodiments, this environment includes an external environment and an internal environment. The external environment includes brightness and lighting, etc. The internal environment includes parameters of the capturing device used to capture the sample image, etc. The classification task features represent the type of the sample image, which includes both real and non-real images. The server trains an image classification model based on the first difference information between each pair of domain features of the multiple sample images and the second difference information between each pair of classification task features of the multiple sample images. The first difference information between domain features represents the differences in the shooting environment of the sample images, and the second difference information between classification task features represents the differences in the types of the sample images. After training the image classification model, the server can input the target image to be classified into the image classification model, and the image classification model will output the type of the target image, that is, whether the target image is a real image or a non-real image.
[0073] In some embodiments, the aforementioned real image is an image obtained by photographing a real person, while the non-real image is an image obtained by taking a photograph or a composite image, etc. In this case, the technical solution provided by the embodiments of this application can be applied in the scenario of identity authentication, and the server can predict whether the obtained image is an image obtained by photographing a real person (live object), an image obtained by taking a photograph, or a composite image through the image classification model.
[0074] It should be noted that the following description of the technical solution provided in this application uses a server as the execution subject as an example. In other possible implementations, a terminal may also be used as the execution subject to execute the technical solution provided in this application. The embodiments of this application do not limit the type of execution subject.
[0075] Figure 2This is a flowchart of a training method for an image classification model provided in an embodiment of this application. See also... Figure 2 Taking the server as the executing entity as an example, the methods include:
[0076] 201. The server inputs the image features of multiple sample images into the image classification model. The image classification model encodes the image features of the multiple sample images based on the attention mechanism to obtain the first weight of the multiple sample images.
[0077] Here, the sample images are used to train the image classification model, and the image features, also known as image embedding features, are used to represent the characteristics of the image. In some embodiments, the image features are in vector form. The image classification model is used to classify images based on the input image features. The first weight is the weight used to decompose the image features.
[0078] 202. The server uses the image classification model to process the image features of the multiple sample images based on the first weight of the multiple sample images, and obtains the domain features and classification task features of the multiple sample images. The domain features are used to represent the environment in which the sample images were captured, and the classification task features are used to represent the type of the sample images.
[0079] The process of processing the image features of multiple sample images based on a first weight involves decomposing the image features into domain features and classification task features based on this first weight. Domain features represent the environment in which the sample images were captured, including both external and internal environments. Classification task features represent the type of the sample images; the image classification model can predict and output the image type based on these classification task features.
[0080] 203. The server trains the image classification model based on the first difference information between each pair of domain features of the multiple sample images and the second difference information between each pair of classification task features of the multiple sample images.
[0081] In this model, the first difference information between each pair of domain features represents the differences in the shooting environment of the sample images, while the second difference information between each pair of classification task features represents the differences in the types of the sample images. Training the image classification model based on the first difference information improves its ability to recognize the environment of the image, while training it based on the second difference information improves its ability to classify images without environmental interference, thus enhancing the overall image classification ability of the model.
[0082] The technical solution provided in this application encodes image features of sample images based on an attention mechanism during image classification model training, thereby obtaining first weights. These first weights are then used to process the image features, yielding domain features and classification task features of the sample images. When training the image classification model based on the first difference information between domain features and the second difference information between classification task features, it does not rely on labels, thus reducing the cost of training the image classification model and improving training efficiency.
[0083] It should be noted that steps 201-203 above are a brief introduction to the technical solutions provided in the embodiments of this application. The technical solutions provided in the embodiments of this application will be described in more detail below with reference to some examples. See [link to relevant documentation]. Figure 3 Taking the server as the executing entity as an example, the methods include:
[0084] 301. The server extracts features from multiple sample images to obtain the image features of these multiple sample images.
[0085] Image features are also known as floating-point features or floating-point embeddings.
[0086] In one possible implementation, the server inputs the multiple sample images into an image classification model, and extracts features from the multiple sample images using the image classification model to obtain the image features of the multiple sample images. In some embodiments, the image classification model includes a feature extraction unit, and the server uses the feature extraction unit of the image classification model to extract features from the multiple sample images, thereby obtaining the image features of the multiple sample images.
[0087] In this implementation, feature extraction is performed on multiple sample images using an image classification model to obtain the image features of these multiple sample images, thereby achieving an abstract representation of these multiple sample images and improving the efficiency of subsequent calculations.
[0088] To illustrate the above implementation methods, four examples are provided below.
[0089] Example 1: The server inputs the multiple sample images into an image classification model, and the image classification model performs convolution and pooling on the multiple sample images to obtain the image features of the multiple sample images.
[0090] For example, the server inputs multiple sample images into an image classification model. The model's convolutional layers convolve the sample images to obtain feature maps. The server then uses the model's pooling layers to perform either max pooling or average pooling on these feature maps to obtain the image features of the sample images. In some embodiments, the server represents the sample images as matrices and the image features as vectors, using a sliding convolution kernel across the sample images during the convolution process.
[0091] In some embodiments, the image classification model is a feature extractor based on convolutional neural networks (CNNs), such as a neural network ResNet-101 (Residual Network 101) or ResNet-50 pre-trained on the large-scale open-source dataset ImageNet. This application does not limit this.
[0092] Example 2: The server inputs the multiple sample images into an image classification model, and the image classification model performs fully connected and pooling operations on the multiple sample images to obtain the image features of the multiple sample images.
[0093] For example, the server inputs multiple sample images into an image classification model. The model then performs a fully connected layer on the sample images to obtain fully connected features. The server then uses the pooling layer of the image classification model to perform either max pooling or average pooling on these fully connected features to obtain image features, also known as deep features or low-level features. In some embodiments, the server represents the sample images as matrices and the image features as vectors. The fully connected layer is multiplied by the matrix of the sample images. In some embodiments, the image classification model is a feature extractor based on Deep Neural Networks (DNNs).
[0094] Furthermore, the above feature extraction process is based on convolution and fully connected layers. The resulting image features are used to express the depth features of the sample image, and these image features are also called the low-level features of the sample image. In other possible implementations, this image classification model can also extract the semantic features of the sample image. The resulting image features can reflect the semantics of the sample image. The following describes the method by which the server extracts the semantic features of the sample image using this image classification model.
[0095] Example 3: The server inputs multiple sample images into an image classification model. The model encodes these images using an attention mechanism to obtain their image features. These image features, obtained through the image classification model, are also the semantic features of the corresponding images. In this implementation, the image classification model is a semantic feature encoder, such as a Transformer encoder.
[0096] For any one of the multiple sample images, the server inputs it into the image classification model. The model then embeds and encodes multiple parts of the sample image, resulting in multiple embedding vectors. Each embedding vector corresponds to a part of the sample image and represents the position and content of that part. The server inputs these embedding vectors into the image classification model and performs linear transformations using three linear transformation matrices to obtain the query vector, key vector, and value vector corresponding to each part of the sample image. Based on the query and key vectors corresponding to the multiple parts of the sample image, the server uses the image classification model to obtain the attention weights for each part. Finally, based on the attention weights and value vectors of each part of the sample image, the server uses the image classification model to obtain the attention encoding vector, which is essentially the image feature of the sample image.
[0097] For example, the server uses an image classification model to multiply each embedding vector by three linear transformation matrices to obtain the query vector, key vector, and value vector corresponding to each part of the sample image. For the first part of the sample image, the server uses the image classification model to determine multiple attention weights between the other parts and the first part, based on the query vector of the first part and the key vectors of the other parts of the sample image. The multiple parts of the sample image are related to its type: if the sample image is an image, the multiple parts are different image patches; if the sample image is audio, the multiple parts are different segments of the audio; if the sample image is text, the multiple parts are different sentences of the text. The server then uses the image classification model to perform a weighted sum of the attention weights of the other parts on the first part and the value vectors of the other parts to obtain the attention encoding vector for the first part. The above description uses the example of the server encoding the first part of the sample image through the image classification model to obtain the attention encoding vector of the first part. The method by which the server encodes other parts of the sample image is the same as the method described above for encoding the first part. The implementation process is described above and will not be repeated here.
[0098] Examples 1, 2, and 3 above illustrate the extraction of low-level features and semantic features of an image using the image classification model. In other possible implementations, the server can also obtain both low-level features and semantic features of an image through the image classification model, as illustrated in Example 4 below.
[0099] Example 4: The server inputs multiple sample images into an image classification model. The model performs convolution and pooling on the sample images to obtain their low-level features. The server then encodes these images using an attention mechanism within the same classification model to obtain their semantic features. Finally, the server fuses the low-level and semantic features to obtain the final image features of the sample images.
[0100] For example, the image classification model includes a first feature extraction unit and a second feature extraction unit. The first feature extraction unit is used to extract the low-level features of the image, and the second feature extraction unit is used to extract the semantic features of the image. After the server inputs the multiple sample images into the image classification model, it obtains the low-level features of the multiple sample images through the first feature extraction unit and the semantic features of the multiple sample images through the second feature extraction unit. When the server fuses the low-level features and semantic features of the multiple sample images, it can use a weighted summation method. The weights of the weighted summation are set by the technician according to the actual situation, such as 0.3, 0.5, or 0.8, etc., and this application embodiment does not limit this. The method by which the server obtains the low-level features and semantic features of the image through the first feature extraction unit and the second feature extraction unit is the same as in Examples 1 and 2 above, and the implementation process will not be described in detail here.
[0101] It should be noted that the above explanation uses the extraction of low-level features and semantic features of an image by an image classification model as an example. With the development of science and technology, the server can also use image classification models with other structures to obtain image features, and this application does not limit this.
[0102] In some embodiments, the image classification model includes a first image classification sub-model and a second image classification sub-model. When the server extracts image features from multiple sample images using the image classification model, it can extract features from the multiple sample images separately using the first image classification sub-model and the second image classification sub-model, respectively, to obtain the first image features output by the first image classification sub-model and the second image features output by the second image classification sub-model. In this case, each of the first and second image classification sub-models includes a feature extraction unit. For the same sample image, the first image features output by the first image classification sub-model and the second image features output by the second image classification sub-model may be different. In subsequent training, the server trains the first image classification sub-model using the first image features output by the first image classification sub-model and trains the second image sub-model using the second image features output by the second image classification sub-model, thereby completing the training of the image classification model.
[0103] The process by which the first image classification sub-model and the second image classification sub-model extract image features from the multiple sample images belongs to the same inventive concept as described above, and the implementation process will not be repeated here.
[0104] In some embodiments, the server employs different update methods to train the first image classification sub-model and the second image classification sub-model. For example, the server uses gradient update to train the first image classification sub-model and momentum update to train the second image classification sub-model. In this case, the first image classification sub-model is also referred to as the gradient update network, and the second image classification sub-model is also referred to as the momentum update network.
[0105] 302. The server inputs the image features of multiple sample images into the image classification model. The image classification model encodes the image features of the multiple sample images based on the attention mechanism to obtain the first weight of the multiple sample images.
[0106] The first weight is used to decompose image features, and each sample image corresponds to a first weight.
[0107] In one possible implementation, the server inputs the image features of the multiple sample images into an image classification model. The image classification model then performs pooling, fully connected, residual connected, and normalized operations on the image features of the multiple sample images to obtain the first weights of the multiple sample images.
[0108] In some embodiments, the image classification model includes a feature decoupling unit (DE). The process of the server encoding the image features of the multiple sample images based on an attention mechanism is implemented using this feature decoupling unit. For any sample image among the multiple sample images, the server inputs the image features of the sample image into the feature decoupling unit. Through the feature decoupling unit, the image features of the sample image are averaged and pooled to obtain the pooled features of the sample image. The server multiplies the pooled features of the sample image with a first fully connected matrix through the feature decoupling unit to obtain the first fully connected features of the sample image. The parameters of the first fully connected matrix are obtained during the training of the image classification model. The server performs residual connections on the first fully connected features of the sample image through the feature decoupling unit to obtain the residual features of the sample image. The residual connection is essentially adding the image features of the sample image to the first fully connected features of the sample image to retain the information in the image features of the sample image to the greatest extent. The server uses the feature decoupling unit to multiply the residual features of the sample image with the second fully connected matrix to obtain the second fully connected features of the sample image. The parameters of the second fully connected matrix are obtained during the training of the image classification model. The server then uses the feature decoupling unit to normalize the second fully connected features of the sample image to obtain the first weights of the sample image. In some embodiments, the feature decoupling unit uses a Sigmoid curve (S-shaped growth curve) to normalize the second fully connected features.
[0109] For example, the server uses the feature decoupling unit to encode the image features of the sample image using the following formula (1) to obtain the first weight of the sample image.
[0110] a=sigmoid(W2ReLU(W1avgpool(F))) (1)
[0111] Where 'a' is the first weight. Also known as attention weights, sigmoid() is the normalization function, F is the image feature, avgpool() is the average pooling function, and W1 is the first fully connected weight. W2 is the second fully connected weight. C represents the dimension of the image features, r represents the dimensionality reduction ratio, which is set by technicians according to the actual situation, and ReLU is the function for residual connection.
[0112] It should be noted that the above description is based on the example of the server processing the image features of any one of the multiple sample images through the feature decoupling unit to obtain the first weight of that sample image. The process of processing the image features of other sample images among the multiple sample images to obtain the first weight of the other sample images belongs to the same inventive concept as described above, and the implementation process will not be repeated.
[0113] In some embodiments, the image classification model includes a first image classification sub-model and a second image classification sub-model, each including a feature decoupling unit. During training of the image classification model, the server processes the first image features extracted by the feature extraction module of the first image classification sub-model using the feature decoupling unit, outputting multiple first weights corresponding to the first image classification sub-model, each first weight corresponding to a sample image. Similarly, the server processes the second image features extracted by the feature extraction module of the second image classification sub-model using the feature decoupling unit, outputting multiple first weights corresponding to the second image classification sub-model, each first weight corresponding to a sample image. The multiple first weights output by the first classification sub-model are used for training the first classification sub-model, and the multiple first weights output by the second classification sub-model are used for training the second classification sub-model.
[0114] It should be noted that the process by which the server processes image features through the feature decoupling unit of the first image classification sub-model or the second image classification sub-model belongs to the same inventive concept as described above, and the implementation process is described above and will not be repeated here.
[0115] 303. The server uses the image classification model to process the image features of the multiple sample images based on the first weight of the multiple sample images, and obtains the domain features and classification task features of the multiple sample images. The domain features are used to represent the environment in which the sample images were captured, and the classification task features are used to represent the type of the sample images.
[0116] In one possible implementation, for any one of the plurality of sample images, the server uses the image classification model to multiply the first weight of the sample image with the image features of the sample image to obtain the domain features of the sample image. The server then multiplies the second weight of the sample image with the image features of the sample image to obtain the classification task features of the sample image, and the sum of the second weight and the first weight is the target value.
[0117] Multiplying image features by the first weight can enhance some channels of the image features and weaken others. The enhanced channels are those related to the environment, and the weakened features are those related to classification. The resulting domain features can reflect the environment in which the sample image was captured. Multiplying image features by the second weight can enhance some channels of the image features and weaken others. The enhanced channels are those related to classification, and the weakened features are those related to the environment. The resulting classification task features can reflect the type of the sample image.
[0118] In this implementation, the server can decompose the image features of the sample image into domain features and classification task features using a first weight. Subsequently, the image classification model can be trained based on the domain features and classification task features, which can improve the efficiency of training the image classification model.
[0119] In some embodiments, the process of obtaining the domain features and classification task features of the sample image through the image classification model is performed by the feature decoupling unit of the image classification model.
[0120] For example, for any one of the multiple sample images, the server uses the image classification model and the following formula (2) to obtain the domain features and classification task features of the sample image based on the image features and the first weight of the sample image.
[0121]
[0122] Where F represents the image feature, a is the first weight, 1-a is the second weight, and 1 is the target value. d For domain features, F t Features for classification tasks.
[0123] It should be noted that the above description is based on the example of the server processing the image features of any one of the multiple sample images to obtain the domain features and classification task features of that sample image. The process of processing the image features of other sample images to obtain the domain features and classification task features of other sample images belongs to the same inventive concept as described above, and the implementation process will not be repeated.
[0124] In some embodiments, the image classification model includes a first image classification sub-model and a second image classification sub-model, each including a feature decoupling unit. For any sample image among the plurality of sample images, the server, through the feature decoupling unit of the first image classification sub-model, multiplies the first image feature of the sample image with the first weight of the sample image to obtain a first domain feature of the sample image, where the first weight is the first weight output by the feature decoupling unit of the first image classification sub-model. The server, through the feature decoupling unit of the first image classification sub-model, multiplies the first image feature of the sample image with the second weight of the sample image to obtain a first classification task feature of the sample image, where the second weight is the first weight output by the feature decoupling unit of the first image classification sub-model. The server, through the feature decoupling unit of the second image classification sub-model, multiplies the second image feature of the sample image with the first weight of the sample image to obtain a second domain feature of the sample image, where the first weight is the first weight output by the feature decoupling unit of the second image classification sub-model. The server uses the feature decoupling unit of the second image classification sub-model to multiply the second image feature of the sample image with the second weight of the sample image to obtain the second classification task feature of the sample image. The second weight is the first weight output by the feature decoupling unit of the second image classification sub-model.
[0125] It should be noted that the process by which the server obtains the domain features and classification task features of the sample image through the feature decoupling unit of the first image classification sub-model or the second image classification sub-model belongs to the same inventive concept as described above. The implementation process is described above and will not be repeated here.
[0126] 304. The server trains the image classification model based on the first difference information between each pair of domain features of the multiple sample images and the second difference information between each pair of classification task features of the multiple sample images.
[0127] In one possible implementation, the server trains the image classification model based on first difference information between the domain features of every two positive sample images in multiple positive sample image groups, each group comprising three positive sample images, which are sample images of the target type from the multiple sample images. The server also trains the image classification model based on second difference information between the classification task features of every two sample images in the multiple sample image groups, each group comprising three sample images from the multiple sample images.
[0128] In this context, positive and negative sample images are different types of sample images. In scenarios where real and non-real images need to be distinguished, the target type is a real image. Accordingly, the positive sample image is a real image, and the negative sample image is a non-real image. In some embodiments, real images are also referred to as liveness images, which are images obtained by photographing real people. The process of classifying the input image using an image classification model is also called liveness detection.
[0129] To provide a clearer explanation of the above embodiments, the following description will be divided into two parts.
[0130] The first part involves the server training the image classification model based on the first difference information between the domain features of every two positive sample images in multiple positive sample image groups.
[0131] In one possible implementation, the image classification model includes a first image classification sub-model and a second image classification sub-model. For any positive sample image group among the plurality of positive sample image groups, the server trains the first image classification sub-model based on the first difference information between the domain features of the first sample image and the domain features of the second sample image, and the first difference information between the domain features of the first sample image and the domain features of the third sample image. The first sample image, the second sample image, and the third sample image all belong to the positive sample image group.
[0132] For example, the first sample image and the second sample image belong to the same domain, while the first sample image and the third sample image belong to different domains. That is, the first sample image and the second sample image are sample images taken in the same environment, while the first sample image and the third sample image are sample images taken in different environments. The server determines first training parameters based on the first difference information between the first domain features of the first sample image and the second domain features of the second sample image. The first domain features of the first sample image are the domain features extracted by the first image classification sub-model, and the second domain features of the second sample image are the domain features extracted by the second image classification sub-model. The server determines second training parameters based on the first difference information between the first domain features of the first sample image and the third domain features of the third sample image. The third domain features of the third sample image are the domain features extracted by the second image classification sub-model. The server determines the gradient of the loss function of the first image classification sub-model based on the first training parameters and the second training parameters. The server trains the first image classification sub-model based on the gradient of the loss function of the first image classification sub-model. The first training parameter is used to represent the similarity between the first sample image and the second sample image, which is the first difference information between the domain features of the first sample image and the domain features of the second sample image; the second training parameter is used to represent the similarity between the first sample image and the third sample image, which is the first difference information between the domain features of the first sample image and the domain features of the third sample image.
[0133] The purpose of training the first image classification sub-model based on the first difference information between domain features is to make the domain features of sample images belonging to the same domain (environment) as close as possible, and the domain features of sample images belonging to different domains as far apart as possible. That is, to make the domain features of the first sample image as close as possible to the domain features of the second sample image, and to make the domain features of the first sample image as far apart as possible to the domain features of the third sample image, thereby reducing the influence of domain on image classification. See also Figure 4 For the first sample image among multiple sample images, the training objective is to minimize the distance between the domain features 401 of the first sample image and the domain features 402 of the multiple second sample images, and maximize the distance between the domain features 401 of the first sample image and the domain features 403 of the multiple third sample images.
[0134] In some embodiments, the server trains the first image classification sub-model using gradient descent.
[0135] In some embodiments, the number of the aforementioned third sample images is multiple. Training the first image classification sub-model using multiple third sample images can achieve better training results.
[0136] In some embodiments, sample images from different domains are stored in different image queues, and the server can determine the domain to which a sample image belongs by retrieving the list of sample images. Correspondingly, after the server extracts features from the multiple sample images using an image classification model, it can also store the image features of sample images from different domains in different feature queues for subsequent retrieval. For example, if the image classification model includes a first image classification sub-model and a second image classification sub-model, the server stores the domain features output by the first image classification sub-model in feature queue D. i1 In this process, the domain features output by the second image classification sub-model are stored in D. i2 In this context, 'i' represents the domain identifier, and 1 and 2 are used to distinguish between the first image classification sub-model and the second image classification sub-model. In this case, the server can directly retrieve the first, second, and third sample images from the feature queues corresponding to different domains, resulting in high efficiency.
[0137] For example, the server obtains the first training parameters and the second training parameters using the following formula (3), and determines the loss function of the first image classification sub-model based on the first training parameters and the second training parameters.
[0138]
[0139] in, Let be the loss function of the first image classification sub-model. Let be the loss function for the first sample image i, I be the number of the first sample images, P(i) be the domain identifier, and q be the loss function. i k is the domain feature of the first sample image. p For the domain features of the second sample image, k a Let A(i) be the domain feature of the third sample image, A(i) be the set of the third sample images, and τ be a hyperparameter set by technicians according to the actual situation. exp(q i ·k p / τ) is the first training parameter. This is the second training parameter.
[0140] In some embodiments, the process of training the first image classification sub-model based on the first sample image, the second sample image, and the third sample image is implemented by a first contrastive learning unit. Since the training process is based on the difference information between domain features, this first contrastive learning unit is also called a Domain Compare Learning (DCL) unit.
[0141] The second part involves the server training the image classification model based on the second difference information between the classification task features of every two sample images in multiple sample image groups.
[0142] In one possible implementation, the image classification model includes a first image classification sub-model and a second image classification sub-model. For any sample image group among the plurality of sample image groups, the server trains the second image classification sub-model based on the second difference information between the classification task features of the target sample image and the classification task features of the similar sample image, the second difference information between the classification task features of the target sample image and the classification task features of the relative sample image, and the gradient of the loss function of the first image classification sub-model. The similar sample image and the target sample image are sample images of the same type, and the relative sample image and the target sample image are sample images of different types. The target sample image, the similar sample image, and the relative sample image all belong to the plurality of sample images.
[0143] For example, the server determines the gradient of the loss function of the second image classification sub-model based on the second difference information between the classification task features of the target sample image and the classification task features of the similar sample image, and the second difference information between the classification task features of the target sample image and the classification task features of the relative sample image. The server trains the second image classification sub-model based on the gradients of the loss function of the first image classification sub-model and the second image classification sub-model. The third training parameter represents the similarity between the target sample image and the similar sample image, i.e., the second difference information between the classification task features of the target sample image and the similar sample image; the fourth training parameter represents the similarity between the target sample image and the relative sample image, i.e., the second difference information between the classification task features of the target sample image and the relative sample image. See, for example... Figure 5 This demonstrates a framework for contrastive learning based on second difference information. Figure 5 In this process, comparative learning is performed by determining the similarity between the classification task features of the target sample image and the classification task features of the similar sample image, as well as the similarity between the classification task features of the target sample image and the classification task features of the relative sample image.
[0144] The purpose of training the first image classification sub-model based on the second difference information between classification task features is to make the classification task features of sample images of the same type as close as possible, and the classification task features of sample images of different types as far apart as possible. In other words, it aims to make the classification task features of the target sample image as close as possible to the classification task features of similar sample images, and to make the classification task features of the target sample image as far apart as possible from the classification task features of the relative sample image, thereby improving the image classification model's ability to classify images. See also... Figure 6 For a target sample image among multiple sample images, the training aims to minimize the distance between the domain feature 601 of the target sample image and the domain feature 602 of multiple similar sample images, and maximize the distance between the domain feature 601 of the target sample image and the domain feature 603 of multiple relative sample images.
[0145] In some embodiments, the server uses the momentum update method to train the second image classification sub-model. The momentum update method is the process of training the second image classification sub-model by combining the gradient of the loss function of the first image classification sub-model and the gradient of the loss function of the second image classification sub-model.
[0146] The following describes a method for determining the gradient of the loss function of the second image classification sub-model based on the second difference information between the classification task features of the target sample image and the classification task features of the similar sample image, and the second difference information between the classification task features of the target sample image and the classification task features of the relative sample image.
[0147] In one possible implementation, the server determines a third training parameter based on a second difference information between the first classification task features of the target sample image and the second classification task features of the similar sample image. The first classification task features of the target sample image are the classification task features extracted by the first image classification sub-model, and the second classification task features of the similar sample image are the classification task features extracted by the second image classification sub-model. The server then determines a fourth training parameter based on the second difference information between the first classification task features of the target sample image and the third classification task features of the relative sample image. The third classification task features of the relative sample image are the classification task features extracted by the second image classification sub-model. Finally, the server determines the gradient of the loss function of the second image classification sub-model based on the third and fourth training parameters.
[0148] For example, the server obtains the third and fourth training parameters using the following formula (4), and determines the loss function of the second image classification sub-model based on the third and fourth training parameters.
[0149]
[0150] in, The loss function for the second image classification sub-model. Let q be the loss function for target sample image i, I be the number of target sample images, M(i) be the type identifier, and q be the loss function for the target sample image i. i k represents the classification task features of the target sample image. c For classification task features of similar sample images, k b For the classification task features of relative sample images, B(i) is the set of relative sample images, and τ is a hyperparameter set by technicians according to the actual situation, exp(q i ·k c / τ) is the third training parameter. Select the fourth training parameter.
[0151] In some embodiments, the server uses a momentum update method to train the second image classification sub-model. The momentum update formula is shown in formula (5) below.
[0152] θ k ←mθ k +(1-m)θ q (5)
[0153] Where m is the momentum update coefficient, m∈(0,1], θ q θ is the gradient of the first image classification sub-model. k The gradient of the second image classification sub-model.
[0154] In some embodiments, the process of training the second image classification sub-model based on the target sample image, similar sample images, and relative sample images is implemented by a second contrastive learning unit. Since the training process is based on the difference information between features of the classification task, this second contrastive learning unit is also called a Task Compare Learning (TCL) unit. In some embodiments, the first contrastive learning unit and the second contrastive learning unit are collectively referred to as a Domain-Task Compare Learning (DTCL) unit. In this way, a multi-example loss function designed to improve generalization is implemented, thus ensuring that the influence of the overall sample image is considered when determining the loss for each sample image. The resulting gradient direction is more global and avoids conflicting situations, thereby constructing the feature space in a globally favorable direction, making the feature space more compact and generalizable.
[0155] In one possible implementation, the server trains the image classification model based on first difference information between the domain features of every four positive sample images in the plurality of sample images, where the positive sample images are sample images of the target type. The server then trains the image classification model based on second difference information between the classification task features of every four sample images in the plurality of sample images.
[0156] In this model, every four positive sample images in the multiple samples include two positive sample images belonging to the first domain and two positive sample images belonging to the second domain, which are different domains. During training, the image classification model is trained based on the first difference information between the domain features of the two positive sample images belonging to the first domain and the two positive sample images belonging to the second domain. The goal of the training is to minimize the first difference information between the domain features of the two positive sample images in the first domain and the two positive sample images in the second domain.
[0157] In some embodiments, before performing step 304 above, the server may further process the classification task features of the sample images to enhance the expressive power of the classification task features.
[0158] In one possible implementation, the server encodes the classification task features of the multiple sample images based on an attention mechanism to obtain the target classification task features of the multiple sample images. After obtaining the target classification task features of the multiple sample images, the server trains the image classification model based on the second difference information between the target classification task features of every three sample images. The training method belongs to the same inventive concept as described in the second part above, and the implementation process will not be repeated.
[0159] In this implementation, the server can further process the classification task features of the sample images to obtain more expressive target classification task features. Training the image classification model based on these target classification task features can achieve better training results.
[0160] For example, the server encodes the classification task features of the multiple sample images based on the attention mechanism using the following formula (6) to obtain the target classification task features of the multiple sample images.
[0161]
[0162] in, Let F be the self-attention function. t For the classification task features of the sample image labeled t, The target classification task features are the sample images labeled t.
[0163] See Figure 7 A schematic diagram of the decoupling unit is shown. Figure 7 In the decoupling unit 700, the image features 702 are processed by the feature compression subunit 701 to obtain a first weight a and a second weight 1-a. Based on the first weight a and the image features 702, the decoupling unit 700 obtains a domain-related feature 703 and a classification task feature 704. Based on the second weight 1-a, the classification task feature 704, and the self-attention subunit 705, the decoupling unit 700 obtains a target classification task feature 706.
[0164] Through steps 301-304 above, the server trains the image classification model through comparative learning. After training, the image classification model extracts more accurate image features, and the accuracy of image classification based on image features will be improved.
[0165] In some embodiments, when classifying images using the image classification model, a classifier can be connected after the second image classification sub-model. That is, during use, only the second image classification sub-model is used for prediction, and the first image classification sub-model is not used for prediction.
[0166] In some embodiments, in addition to training the image classification model using the contrastive learning method described in step 304 above, the server can also train the image classification model in a supervised manner.
[0167] In one possible implementation, the server trains the image classification model based on the classification task features of multiple sample images and the target labels of those multiple sample images.
[0168] Among them, the target label is a label that the technicians annotated the sample image, and the target label is used to indicate the actual type of the sample image.
[0169] In this implementation, the image classification model can be fine-tuned by classifying task features and target labels, thereby further improving the image classification model's ability to classify images.
[0170] For example, the server uses the image classification model to predict the classification task features of the multiple sample images and outputs the predicted labels for those sample images. The server then trains the image classification model based on the third difference information between the predicted labels and the target labels of the multiple sample images.
[0171] The following will combine Figure 8The technical solutions provided in the embodiments of this application are explained in steps 301-304 above.
[0172] See Figure 8 The image classification model includes a first image classification sub-model and a second image classification sub-model. The server extracts features from the multiple sample images using the first feature extraction unit 801 of the first image classification sub-model, obtaining first image features 803 for the multiple sample images. The server extracts features from the multiple sample images using the second feature extraction unit 802 of the second image classification sub-model, obtaining second image features 804 for the multiple sample images. The server inputs the first image features 803 of the multiple sample images into a first feature decoupling sub-unit 805, which outputs first domain features 8051 and first classification task features 8052 for the multiple sample images. The server inputs the second image features 804 of the multiple sample images into a second feature decoupling sub-unit 806, which outputs second domain features 8061 and second classification task features 8062 for the multiple sample images. The server trains the image classification model based on the first difference information between the first domain feature 8051 and the second domain feature 8061 of multiple sample images, and the second difference information between the first classification task feature 8052 and the second classification task feature 8062. During training, the image classification model is trained based on cross-entropy through Domain Compare Learning (DCL) units and Task Compare Learning (TCL) units.
[0173] All of the above-mentioned optional technical solutions can be combined in any way to form the optional embodiments of this application, and will not be described in detail here.
[0174] The technical solution provided in this application encodes image features of sample images based on an attention mechanism during image classification model training, thereby obtaining first weights. These first weights are then used to process the image features, yielding domain features and classification task features of the sample images. When training the image classification model based on the first difference information between domain features and the second difference information between classification task features, it does not rely on labels, thus reducing the cost of training the image classification model and improving training efficiency.
[0175] In training the image classification model, the idea of contrastive learning in unsupervised training was utilized, and a contrastive-based supervised loss was designed to improve the network convergence speed. Less manually labeled data is required to achieve the same image classification ability.
[0176] In the context of liveness detection, firstly, domain-related information and task-related information are decoupled at the feature space level, removing domain information from image features that affects model generalization. Secondly, based on the idea of contrastive learning, contrast queues between domains and between tasks are designed for the decoupled dual spaces, further supervising the network to learn features with discriminative and transferable characteristics in the liveness task. The technical solution provided in this application improves the recognition accuracy of liveness detection by designing contrast queues on the basis of the traditional binary classification supervised deep learning framework for liveness detection.
[0177] Figure 9 This is a flowchart of an image classification method provided in an embodiment of this application. See also... Figure 9 Taking the server as the executing entity as an example, the methods include:
[0178] 901. The server inputs the target image into the image classification model, extracts features from the target image through the image classification model, and outputs the image features of the target image.
[0179] The image classification model is the one trained through steps 301-304 described above. The method for extracting features from the target image using this image classification model to obtain the image features is based on the same inventive concept as step 301 described above, and the implementation process is described in step 301 above, and will not be repeated here.
[0180] 902. The server processes the image features of the target image using the image classification model to obtain the classification task features of the target image, which are used to represent the type of the target image.
[0181] In one possible implementation, the server inputs the image features of the target image into an image classification model. The model performs pooling, fully connected layers, residual connections, and normalization on the image features to obtain a first weight for the target image. The server then multiplies the first weight by the image features of the target image using the image classification model to obtain the domain features of the target image. Finally, the server multiplies the second weight by the image features of the target image to obtain the classification task features of the target image. The sum of the second weight and the first weight is the target value.
[0182] The process described in the above embodiments belongs to the same inventive concept as steps 302 and 303 above. The implementation process is described in the relevant descriptions of steps 302 and 303 above, and will not be repeated here.
[0183] 903. The server uses the image classification model to predict the target image based on its classification task features and outputs the label of the target image.
[0184] The label of the target image is used to indicate the type of the target image.
[0185] In some embodiments, after the image classification model outputs the label of the target image, it can also perform the following steps based on the type of the target image.
[0186] In one possible approach, if the label of the target image indicates that the target image is a target type, the server identifies the target image. If the label of the target image indicates that the target image is not a target type, the server does not identify the target image.
[0187] In scenarios where real and non-real images are distinguished, the target image is classified as a real image. Therefore, if the target image is a real image, the server performs object recognition to determine the identity of the objects within it. If the target image is a non-real image, the server does not perform object recognition. For example, during liveness detection, if the server determines through the image classification model that the target image depicts a live person (e.g., a photograph of a real person), then object recognition is performed to identify the objects within it. If the server determines through the image classification model that the target image does not depict a live person (e.g., a hacked image or a photographic image), then object recognition is not performed.
[0188] In some embodiments, the above-mentioned liveness detection can also be applied to facial recognition scenarios in smart access control systems, real-name authentication processes based on facial recognition, account unblocking processes based on facial recognition, and other liveness detection scenarios.
[0189] The technical solution provided in this application provides an image classification model that classifies a target image. During the classification process, image features of the target image are extracted and processed to obtain the classification task features of the target image. Classifying the target image using these classification task features achieves a relatively accurate classification result.
[0190] Figure 10 This is a schematic diagram of the training device structure for an image classification model provided in an embodiment of this application. See also... Figure 10 The device includes: a weight acquisition module 1001, a first feature processing module 1002, and a training module 1003.
[0191] The weight acquisition module 1001 is used to input the image features of multiple sample images into the image classification model. Through the image classification model, the image features of the multiple sample images are encoded based on the attention mechanism to obtain the first weight of the multiple sample images.
[0192] The first feature processing module 1002 is used to process the image features of the multiple sample images based on the first weight of the multiple sample images through the image classification model, so as to obtain the domain features and classification task features of the multiple sample images. The domain features are used to represent the environment in which the sample images are captured, and the classification task features are used to represent the type of the sample images.
[0193] Training module 1003 is used to train the image classification model based on the first difference information between each pair of domain features of the multiple sample images and the second difference information between each pair of classification task features of the multiple sample images.
[0194] In one possible implementation, the first feature processing module 1002 is used to, for any sample image among the plurality of sample images, multiply the first weight of the sample image by the image features of the sample image using the image classification model to obtain the domain features of the sample image. Then, it multiplies the second weight of the sample image by the image features of the sample image to obtain the classification task features of the sample image, where the sum of the second weight and the first weight is the target value.
[0195] In one possible implementation, the training module 1003 is used to train the image classification model based on the first difference information between the domain features of every two positive sample images in a plurality of positive sample image groups, each of the positive sample image groups including three positive sample images, the positive sample images being sample images of the target type among the plurality of sample images.
[0196] The image classification model is trained based on the second difference information between the classification task features of every two sample images in multiple sample image groups, where each sample image group includes three sample images from the multiple sample images.
[0197] In one possible implementation, the image classification model includes a first image classification sub-model and a second image classification sub-model. The training module 1003 is used to train the first image classification sub-model for any positive sample image group among the plurality of positive sample image groups, based on the first difference information between the domain features of the first sample image and the domain features of the second sample image, and the first difference information between the domain features of the first sample image and the domain features of the third sample image. The first sample image, the second sample image, and the third sample image all belong to the positive sample image group.
[0198] For any of the multiple sample image groups, the second image classification sub-model is trained based on the second difference information between the classification task features of the target sample image and the classification task features of the similar sample images, the second difference information between the classification task features of the target sample image and the classification task features of the relative sample image, and the gradient of the loss function of the first image classification sub-model. The similar sample image and the target sample image are sample images of the same type, and the relative sample image and the target sample image are sample images of different types. The target sample image, the similar sample image, and the relative sample image all belong to the multiple sample images.
[0199] In one possible implementation, the training module 1003 is used to determine first training parameters based on first difference information between first domain features of the first sample image and second domain features of the second sample image, wherein the first domain features of the first sample image are domain features extracted by the first image classification sub-model, and the second domain features of the second sample image are domain features extracted by the second image classification sub-model. Based on the first difference information between the first domain features of the first sample image and third domain features of the third sample image, the module determines second training parameters, wherein the third domain features of the third sample image are domain features extracted by the second image classification sub-model. Based on the first and second training parameters, the module determines the gradient of the loss function of the first image classification sub-model. Based on the gradient of the loss function of the first image classification sub-model, the module trains the first image classification sub-model.
[0200] In one possible implementation, the training module 1003 is used to determine the gradient of the loss function of the second image classification sub-model based on second difference information between the classification task features of the target sample image and the classification task features of the similar sample image, and second difference information between the classification task features of the target sample image and the classification task features of the relative sample image. The second image classification sub-model is then trained based on the gradients of the loss function of the first image classification sub-model and the second image classification sub-model.
[0201] In one possible implementation, the training module 1003 is used to determine a third training parameter based on a second difference information between a first classification task feature of the target sample image and a second classification task feature of a similar sample image, wherein the first classification task feature of the target sample image is a classification task feature extracted by a first image classification sub-model, and the second classification task feature of the similar sample image is a classification task feature extracted by a second image classification sub-model. Based on the second difference information between the first classification task feature of the target sample image and a third classification task feature of a similar sample image, a fourth training parameter is determined, wherein the third classification task feature of the similar sample image is a classification task feature extracted by the second image classification sub-model. Based on the third and fourth training parameters, the gradient of the loss function of the second image classification sub-model is determined.
[0202] In one possible implementation, the device further includes:
[0203] The encoding module is used to encode the classification task features of the multiple sample images based on the attention mechanism, so as to obtain the target classification task features of the multiple sample images.
[0204] The training module 1003 is used to train the image classification model based on the first difference information between the domain features of the multiple sample images and the second difference information between the target classification task features of the multiple sample images.
[0205] In one possible implementation, the training module 1003 is also used to train the image classification model based on the classification task features of multiple sample images and the target labels of the multiple sample images.
[0206] In one possible implementation, the training module 1003 is further configured to use the image classification model to predict the classification task features of the plurality of sample images and output the predicted labels of the plurality of sample images. The image classification model is then trained based on third difference information between the predicted labels and the target labels of the plurality of sample images.
[0207] It should be noted that the image classification model training device provided in the above embodiments is only illustrated by the division of the above functional modules when training the image classification model. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the computer device can be divided into different functional modules to complete all or part of the functions described above. In addition, the image classification model training device and the image classification model training method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be repeated here.
[0208] The technical solution provided in this application encodes image features of sample images based on an attention mechanism during image classification model training, thereby obtaining first weights. These first weights are then used to process the image features, yielding domain features and classification task features of the sample images. When training the image classification model based on the first difference information between domain features and the second difference information between classification task features, it does not rely on labels, thus reducing the cost of training the image classification model and improving training efficiency.
[0209] Figure 11 This is a schematic diagram of the structure of an image classification device provided in an embodiment of this application. See also... Figure 11 The device includes: a feature extraction module 1101, a second feature processing module 1102, and a prediction module 1103.
[0210] Feature extraction module 1101 is used to input the target image into the image classification model, extract features from the target image through the image classification model, and output the image features of the target image;
[0211] The second feature processing module 1102 is used to process the image features of the target image through the image classification model to obtain the classification task features of the target image, wherein the classification task features are used to represent the type of the target image;
[0212] The prediction module 1103 is used to predict the target image based on the classification task features of the target image using the image classification model, and output the label of the target image.
[0213] It should be noted that the image classification device provided in the above embodiments is only illustrated by the division of the above functional modules when performing image classification. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the computer device can be divided into different functional modules to complete all or part of the functions described above. In addition, the image classification model training device and the image classification model training method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process can be found in the method embodiments, which will not be repeated here.
[0214] The technical solution provided in this application provides an image classification model that classifies a target image. During the classification process, image features of the target image are extracted and processed to obtain the classification task features of the target image. Classifying the target image using these classification task features achieves a relatively accurate classification result.
[0215] This application provides a computer device for performing the above-described method. This computer device can be implemented as a terminal or a server. The structure of the terminal will be described below:
[0216] Figure 12 This is a schematic diagram of the structure of a terminal provided in an embodiment of this application. The terminal 1200 can be a smartphone, tablet computer, laptop computer, or desktop computer. The terminal 1200 may also be referred to as user equipment, portable terminal, laptop terminal, desktop terminal, or other names.
[0217] Typically, terminal 1200 includes one or more processors 1201 and one or more memories 1202.
[0218] Processor 1201 may include one or more processing cores, such as a quad-core processor, an octa-core processor, etc. Processor 1201 may be implemented using at least one hardware form selected from DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). Processor 1201 may also include a main processor and a coprocessor. The main processor, also known as a CPU (Central Processing Unit), is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In some embodiments, processor 1201 may integrate a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the screen. In some embodiments, processor 1201 may also include an AI (Artificial Intelligence) processor, which is used to handle computational operations related to machine learning.
[0219] The memory 1202 may include one or more computer-readable storage media, which may be non-transitory. The memory 1202 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In some embodiments, the non-transitory computer-readable storage media in the memory 1202 are used to store at least one computer program, which is executed by the processor 1201 to implement the image classification model training method or image classification method provided in the method embodiments of this application.
[0220] In some embodiments, the terminal 1200 may also optionally include a peripheral device interface 1203 and at least one peripheral device. The processor 1201, memory 1202, and peripheral device interface 1203 can be connected via a bus or signal line. Each peripheral device can be connected to the peripheral device interface 1203 via a bus, signal line, or circuit board. Specifically, the peripheral device includes at least one of the following: radio frequency circuitry 1204, display screen 1205, camera assembly 1206, audio circuitry 1207, and power supply 1208.
[0221] Peripheral device interface 1203 can be used to connect at least one I / O (Input / Output) related peripheral device to processor 1201 and memory 1202. In some embodiments, processor 1201, memory 1202 and peripheral device interface 1203 are integrated on the same chip or circuit board; in some other embodiments, any one or two of processor 1201, memory 1202 and peripheral device interface 1203 can be implemented on separate chips or circuit boards, which is not limited in this embodiment.
[0222] The radio frequency (RF) circuit 1204 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The RF circuit 1204 communicates with communication networks and other communication devices via electromagnetic signals. The RF circuit 1204 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals back into electrical signals. Optionally, the RF circuit 1204 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, etc.
[0223] Display screen 1205 is used to display a user interface (UI). This UI may include graphics, text, icons, video, and any combination thereof. When display screen 1205 is a touch display screen, it also has the ability to collect touch signals on or above its surface. These touch signals can be input as control signals to processor 1201 for processing. In this case, display screen 1205 can also be used to provide virtual buttons and / or a virtual keyboard, also known as soft buttons and / or a soft keyboard.
[0224] The camera assembly 1206 is used to capture images or videos. Optionally, the camera assembly 1206 includes a front-facing camera and a rear-facing camera. Typically, the front-facing camera is located on the front panel of the terminal, and the rear-facing camera is located on the back of the terminal.
[0225] The audio circuit 1207 may include a microphone and a speaker. The microphone is used to collect sound waves from the user and the environment, and convert the sound waves into electrical signals that are input to the processor 1201 for processing, or input to the radio frequency circuit 1204 to realize voice communication.
[0226] The power supply 1208 is used to supply power to the various components in the terminal 1200. The power supply 1208 can be AC power, DC power, a disposable battery, or a rechargeable battery.
[0227] In some embodiments, the terminal 1200 further includes one or more sensors 1209. The one or more sensors 1209 include, but are not limited to: an acceleration sensor 1210, a gyroscope sensor 1211, a pressure sensor 1212, an optical sensor 1213, and a proximity sensor 1214.
[0228] Accelerometer 1210 can detect the magnitude of acceleration on the three coordinate axes of a coordinate system established with terminal 1200.
[0229] The gyroscope sensor 1211 can detect the orientation and rotation angle of the terminal 1200. The gyroscope sensor 1211 can work in conjunction with the accelerometer sensor 1210 to collect the user's 3D movements on the terminal 1200.
[0230] The pressure sensor 1212 can be installed on the side bezel of the terminal 1200 and / or on the lower layer of the display screen 1205. When the pressure sensor 1212 is installed on the side bezel of the terminal 1200, it can detect the user's grip signal on the terminal 1200, and the processor 1201 can perform left / right hand recognition or quick operation based on the grip signal collected by the pressure sensor 1212. When the pressure sensor 1212 is installed on the lower layer of the display screen 1205, the processor 1201 can control the operable controls on the UI interface based on the user's pressure operation on the display screen 1205.
[0231] The optical sensor 1213 is used to collect ambient light intensity. In one embodiment, the processor 1201 can control the display brightness of the display screen 1205 based on the ambient light intensity collected by the optical sensor 1213.
[0232] The proximity sensor 1214 is used to detect the distance between the user and the front of the terminal 1200.
[0233] Those skilled in the art will understand that Figure 12 The structure shown does not constitute a limitation on terminal 1200 and may include more or fewer components than shown, or combine certain components, or use different component arrangements.
[0234] The aforementioned computer equipment can also be implemented as a server. The structure of a server is described below:
[0235] Figure 13 This is a schematic diagram of a server structure provided in an embodiment of this application. The server 1300 can vary significantly due to different configurations or performance. It may include one or more Central Processing Units (CPUs) 1301 and one or more memories 1302. The one or more memories 1302 store at least one computer program, which is loaded and executed by the one or more processors 1301 to implement the methods provided in the above-described method embodiments. Of course, the server 1300 may also have wired or wireless network interfaces, a keyboard, and input / output interfaces for input and output. The server 1300 may also include other components for implementing device functions, which will not be elaborated upon here.
[0236] In an exemplary embodiment, a computer-readable storage medium is also provided, which stores at least one computer program. This computer program is loaded and executed by a processor to implement the training method or image classification method of the image classification model described above. For example, the computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a compact disc read-only memory (CD-ROM), magnetic tape, floppy disk, or optical data storage device, etc.
[0237] In an exemplary embodiment, a computer program product is also provided, including a computer program that, when executed by a processor, implements the training method or image classification method of the above-described image classification model.
[0238] In some embodiments, the computer program involved in the present application embodiments may be deployed and executed on a computer device, or executed on multiple computer devices located in one location, or executed on multiple computer devices distributed in multiple locations and interconnected through a communication network. Multiple computer devices distributed in multiple locations and interconnected through a communication network may constitute a blockchain system.
[0239] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.
[0240] The above are merely optional embodiments of this application and are not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. A training method for an image classification model, characterized in that, The method includes: The image features of multiple sample images are input into an image classification model. The image classification model encodes the image features of the multiple sample images based on an attention mechanism to obtain the first weights of the multiple sample images. For any sample image among the plurality of sample images, the image classification model multiplies the first weight of the sample image with the image features of the sample image to obtain the domain features of the sample image; the second weight of the sample image is multiplied with the image features of the sample image to obtain the classification task features of the sample image. The domain features are used to represent the environment in which the sample image was taken, and the classification task features are used to represent the type of the sample image. The sum of the second weight and the first weight is the target value. The image classification model is trained based on the first difference information between each pair of domain features of the multiple sample images and the second difference information between each pair of classification task features of the multiple sample images.
2. The method according to claim 1, characterized in that, The training of the image classification model based on the first difference information between each pair of domain features of the multiple sample images and the second difference information between each pair of classification task features of the multiple sample images includes: The image classification model is trained based on the first difference information between the domain features of every two positive sample images in multiple positive sample image groups. Each positive sample image group includes three positive sample images, and the positive sample images are sample images of the target type among the multiple sample images. The image classification model is trained based on the second difference information between the classification task features of every two sample images in multiple sample image groups, where each sample image group includes three sample images from the multiple sample images.
3. The method according to claim 2, characterized in that, The image classification model includes a first image classification sub-model and a second image classification sub-model. Training the image classification model based on the first difference information between the domain features of every two positive sample images in multiple positive sample image groups includes: For any positive sample image group among the plurality of positive sample image groups, the first image classification sub-model is trained based on the first difference information between the domain features of the first sample image and the domain features of the second sample image, and the first difference information between the domain features of the first sample image and the domain features of the third sample image. The first sample image, the second sample image and the third sample image all belong to the positive sample image group. The step of training the image classification model based on the second difference information between the classification task features of every two sample images in multiple sample image groups includes: For any of the plurality of sample image groups, the second image classification sub-model is trained based on the second difference information between the classification task features of the target sample image and the classification task features of similar sample images, the second difference information between the classification task features of the target sample image and the classification task features of the relative sample image, and the gradient of the loss function of the first image classification sub-model. The similar sample images and the target sample image are sample images of the same type, and the relative sample images and the target sample image are sample images of different types. The target sample image, the similar sample image, and the relative sample image all belong to the plurality of sample images.
4. The method according to claim 3, characterized in that, The training of the first image classification sub-model based on the first difference information between the domain features of the first sample image and the domain features of the second sample image, and the first difference information between the domain features of the first sample image and the domain features of the third sample image, includes: Based on the first difference information between the first domain features of the first sample image and the second domain features of the second sample image, the first training parameters are determined, wherein the first domain features of the first sample image are the domain features extracted by the first image classification sub-model, and the second domain features of the second sample image are the domain features extracted by the second image classification sub-model. Based on the first difference information between the first domain features of the first sample image and the third domain features of the third sample image, the second training parameters are determined, wherein the third domain features of the third sample image are the domain features extracted by the second image classification sub-model. Based on the first training parameters and the second training parameters, the gradient of the loss function of the first image classification sub-model is determined; The first image classification sub-model is trained based on the gradient of the loss function of the first image classification sub-model.
5. The method according to claim 3, characterized in that, The training of the second image classification sub-model based on the second difference information between the classification task features of the target sample image and the classification task features of similar sample images, the second difference information between the classification task features of the target sample image and the classification task features of the relative sample image, and the gradient of the loss function of the first image classification sub-model includes: Based on the second difference information between the classification task features of the target sample image and the classification task features of the similar sample image, and the second difference information between the classification task features of the target sample image and the classification task features of the relative sample image, the gradient of the loss function of the second image classification sub-model is determined. The second image classification sub-model is trained based on the gradient of the loss function of the first image classification sub-model and the gradient of the loss function of the second image classification sub-model.
6. The method according to claim 5, characterized in that, The determination of the gradient of the loss function of the second image classification sub-model based on the second difference information between the classification task features of the target sample image and the classification task features of the similar sample images, and the second difference information between the classification task features of the target sample image and the classification task features of the relative sample images, includes: Based on the second difference information between the first classification task features of the target sample image and the second classification task features of the similar sample images, a third training parameter is determined. The first classification task features of the target sample image are the classification task features extracted by the first image classification sub-model, and the second classification task features of the similar sample images are the classification task features extracted by the second image classification sub-model. Based on the second difference information between the first classification task features of the target sample image and the third classification task features of the relative sample image, a fourth training parameter is determined, wherein the third classification task features of the relative sample image are the classification task features extracted by the second image classification sub-model. Based on the third and fourth training parameters, the gradient of the loss function of the second image classification sub-model is determined.
7. The method according to claim 1, characterized in that, Before training the image classification model based on the first difference information between the domain features of the multiple sample images and the second difference information between the classification task features of the multiple sample images, the method further includes: The classification task features of the multiple sample images are encoded based on the attention mechanism to obtain the target classification task features of the multiple sample images. Training the image classification model based on the first difference information between the domain features of the multiple sample images and the second difference information between the classification task features of the multiple sample images includes: The image classification model is trained based on the first difference information between the domain features of the multiple sample images and the second difference information between the target classification task features of the multiple sample images.
8. The method according to claim 1, characterized in that, The method further includes: The image classification model is trained based on the classification task features of multiple sample images and the target labels of the multiple sample images.
9. The method according to claim 8, characterized in that, The training of the image classification model based on the classification task features of multiple sample images and the target labels of the multiple sample images includes: The image classification model makes predictions based on the classification task features of the multiple sample images and outputs the predicted labels of the multiple sample images. The image classification model is trained based on the third difference information between the predicted labels and the target labels of the multiple sample images.
10. An image classification method, characterized in that, The method includes: The target image is input into an image classification model, and the image classification model extracts features from the target image and outputs the image features of the target image; wherein, the image classification model is trained by the training method shown in claims 1 to 9; The image features of the target image are processed by the image classification model to obtain the classification task features of the target image, which are used to represent the type of the target image; The image classification model predicts the target image's label based on its classification task features and outputs the target image's label.
11. A training device for an image classification model, characterized in that, The device includes: The weight acquisition module is used to input the image features of multiple sample images into the image classification model, and through the image classification model, encode the image features of the multiple sample images based on the attention mechanism to obtain the first weight of the multiple sample images; The first feature processing module is used to, for any sample image among the plurality of sample images, multiply the first weight of the sample image with the image features of the sample image through the image classification model to obtain the domain features of the sample image; and multiply the second weight of the sample image with the image features of the sample image to obtain the classification task features of the sample image. The domain features are used to represent the environment in which the sample image was captured, and the classification task features are used to represent the type of the sample image. The sum of the second weight and the first weight is the target value. The training module is used to train the image classification model based on the first difference information between each pair of domain features of the plurality of sample images and the second difference information between each pair of classification task features of the plurality of sample images.
12. An image classification device, characterized in that, The device includes: The feature extraction module is used to input a target image into an image classification model, extract features from the target image using the image classification model, and output the image features of the target image; wherein, the image classification model is trained using the training method shown in claims 1 to 9; The second feature processing module is used to process the image features of the target image through the image classification model to obtain the classification task features of the target image, wherein the classification task features are used to represent the type of the target image; The prediction module is used to predict the label of the target image based on the classification task features of the target image using the image classification model.
13. A computer device, characterized in that, The computer device includes one or more processors and one or more memories, wherein at least one computer program is stored in the one or more memories, and the computer program is loaded and executed by the one or more processors to implement the training method of the image classification model as described in any one of claims 1 to 10.
14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores at least one computer program, which is loaded and executed by a processor to implement the training method for the image classification model as described in any one of claims 1 to 10.
15. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the training method for the image classification model according to any one of claims 1 to 10.