Fine-grained image classification model training method based on category-level soft target supervision

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a category-level soft target supervision method, and utilizing the EMA model and cross-entropy loss to optimize the target model, the problems of category discrepancies and training errors in fine-grained image classification are solved, achieving high-accuracy fine-grained image classification.

CN116563602BActive Publication Date: 2026-06-12ZHEJIANG UNIV OF TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ZHEJIANG UNIV OF TECH
Filing Date: 2023-04-04
Publication Date: 2026-06-12

Application Information

Patent Timeline

04 Apr 2023

Application

12 Jun 2026

Publication

CN116563602B

IPC: G06V10/764; G06V10/74; G06V10/40; G06N3/098

CPC: G06V10/765; G06V10/761; G06V10/40; G06N3/098; Y02T10/40

AI Tagging

Application Domain

Internal combustion piston engines Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Working condition self-adaptive distributed micro-channel comprehensive heat dissipation system, heat dissipation method and engineering machinery
CN115899025BLiquid coolingCoolant flow control
Gap game perception takeover decision method for highway cut-in scene
CN121947557BInternal combustion piston engines Inference methods
A traffic infrastructure monitoring data probability outlier diagnosis method based on a conditional diffusion model
CN119862510BMathematical models Internal combustion piston engines
Fuel injector and internal combustion engine with fuel injector
CN122206859AElectrical control Internal combustion piston engines
Vehicle equipped with a water-cooled high-pressure fuel pump
JP7873138B2Liquid coolingCoolant flow control

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing deep neural networks perform poorly in fine-grained image classification, especially due to subtle differences between categories, large internal differences, and the influence of factors such as viewpoint, background, and occlusion, making it difficult for models to effectively distinguish fine-grained images.

⚗Method used

A category-level soft target supervision method is adopted. The target model is initialized through the EMA model, and the similarity matrix is calculated to obtain the category-level soft label. The model training is optimized by combining cross-entropy loss and KL divergence. The EMA model is used to restrict the training direction of the target model and reduce training error.

🎯Benefits of technology

It improves the accuracy of fine-grained image classification, reduces the complexity of learning class relationships and the additional space requirement, and achieves efficient fine-grained image classification.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116563602B_ABST

Patent Text Reader

Abstract

The application relates to a fine-grained image classification model training method based on category-level soft target supervision, wherein a target model is pre-trained with labeled data; the EMA model is initialized with the parameters of the target model, a similarity matrix is calculated according to the parameters of the full connection layer in the EMA model, category-level soft labels are obtained based on the similarity matrix, and are associated with images; the total loss of model training is constructed based on the target model and the EMA model, the target model is updated, the EMA model is updated with the new target model, and new category-level soft labels are calculated with the new EMA model; the total loss is repeated and minimized, and the training of the fine-grained image classification model is realized. The application can achieve good effects in the face of fine-grained image classification problems, can retain the relationship between categories, does not need to store a pre-training model in an additional space, does not need a complex clustering process, does not need an additional pre-training model to obtain soft labels, and has high accuracy.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of computation, calculation, or counting, and particularly to a method for training a fine-grained image classification model based on category-level soft target supervision. Background Technology

[0002] Image classification is a classic problem in computer vision, aiming to categorize different images into different classes. In recent years, deep neural networks have achieved remarkable results in visual classification, becoming the preferred modeling tool for solving numerous machine learning tasks in computer vision. Large-scale neural networks trained under supervised learning, in particular, have shown significantly better generalization capabilities than other traditional models in image classification tasks. While deep neural networks have driven significant progress in image classification over the past few years, the granularity of categories in common image classification sets remains relatively coarse. For example, the category of "dog" can be further subdivided into Labrador Retrievers, Golden Retrievers, Border Collies, etc., leading to poor classification performance for these images in some networks. Coarse-grained classification is increasingly unable to meet the needs of practical production and daily life, and fine-grained image classification is a further research area addressing this problem.

[0003] In recent years, fine-grained image classification has seen widespread research demand and application scenarios in both industry and academia. Compared to ordinary image classification problems, fine-grained classification deals with image data that exhibits more similar appearance characteristics. Due to the very small granularity of classification, fine-grained image classification is extremely difficult, even for experts in some categories. There are three main reasons for this: 1. Subtle differences between subclasses: only minor differences exist in certain local areas; 2. Significant differences exist within subclasses; 3. Great influence from factors such as viewpoint, background, and occlusion. These difficulties make fine-grained image classification a highly challenging research task. In real life, there is a huge application demand for identifying different subclasses. For example, in ecological conservation, effectively identifying different species of organisms is a crucial prerequisite for ecological research. If low-cost fine-grained image recognition can be achieved using computer vision technology, it would be of great significance to both academia and industry.

[0004] To address the above issues, it's necessary to model the relationships between categories for model training. However, due to the one-hot nature of hard labels, they are unsuitable for distinguishing fine-grained images. Therefore, the concept of soft labels was proposed. The difference between hard and soft labels is that hard labels result in either 1 or 0 for classification, while soft labels assign a less definitive label based on the probability of each category. This introduces more correlation and information between categories, allowing the model to learn more. Label smoothing is one method for obtaining soft labels, but its effectiveness is insufficient. Since it simply adds random noise and fails to reflect the relationships between labels, its improvement on the model is limited, and it even carries the risk of underfitting. Another approach is to obtain soft labels through an additional pre-trained model, but this requires extra space to store the pre-trained model, resulting in wasted space and insufficient effectiveness. Summary of the Invention

[0005] This invention addresses the problems existing in the prior art and provides a method for training fine-grained image classification models based on category-level soft object supervision.

[0006] The technical solution adopted in this invention is a fine-grained image classification model training method based on category-level soft object supervision. The method pre-trains a target model with labeled data; initializes an EMA model with the parameters of the target model; calculates a similarity matrix based on the parameters of the fully connected layers in the EMA model; and obtains the category-level soft label t based on the similarity matrix. Clu Associated with images;

[0007] Given an input image, the target model is updated using the total loss generated by training the model based on the target model and the EMA model. The EMA model is then updated with the new target model, and a new category-level soft label is calculated using the new EMA model. This process is repeated and the total loss is minimized to train a fine-grained image classification model.

[0008] In this invention, the target model is typically trained for 40 epochs using labeled samples. During the training of the target model and the EMA model, before 150 epochs, the EMA model is updated every epoch based on the parameters of the objective function. After 150 epochs, the EMA model is updated every 3 epochs based on the parameters of the objective function. After updating the EMA model, the class-level soft label t is updated. Cla By minimizing the total loss, the network parameters are trained to complete fine-grained image classification learning, thereby achieving a high accuracy in fine-grained image classification.

[0009] Preferably, cross-entropy loss is used to train the model, and the objective function is:

[0010]

[0011]

[0012] g c =f(x) c ;φ)

[0013] Where f represents the target model, C represents the total number of categories, c corresponds to each category, and y c It is sample x c The tag, p c This is the probability of the model predicting the c-th class after normalization, where T is the temperature coefficient, 0. <T<5,g c It is sample x c The output obtained after normalization from the target model f, where φ is the parameter of the target model.

[0014] Preferably, the target model includes a feature extractor and a classifier, and the feature extractor and classifier of the target model are updated in each round (each epoch) based on gradient backpropagation of the total loss; the EMA model includes a feature extractor and a classifier, and the feature extractor and classifier of the EMA model are initialized to the feature extractor and classifier in the target model, and during training, the feature extractor and classifier of the EMA model are updated based on the current target model through exponential moving average.

[0015] Preferably, the fully connected layer V of the classifier in the EMA model K×C Including the weights of each category, where K is the dimension, this information can be used to approximate the relationships between categories, resulting in a similarity matrix S, where S = V. T Each row and column of V and S contains similarity information of one category to all categories. Each column of S is normalized using softmax to obtain the category-level soft label t. Cla .

[0016] Preferably, the total loss L for model training based on the input image, constructed from the target model and the EMA model, includes:

[0017] Based on the probability distribution predicted by the target model, calculate the cross-entropy loss L between it and the true hard label of the sample. Hard Because the target model may not be accurate in classifying categories, it can easily lead to a low final classification accuracy. To alleviate this problem, a cross-entropy loss based on the model's predictions and the true labels is introduced to optimize the model.

[0018] Based on the probability distribution predicted by the target model, calculate the KL divergence L between it and the category-level soft label. ClaWhile true labels can help the model optimize, they contain very little information because they are one-hot datasets. This makes it difficult to capture sufficient information for fine-grained images, hindering the model's ability to classify such images using only true labels. To enable the model to learn more relationships between categories, category-level soft labels (t) are introduced. Cla This allows the model to learn more knowledge, ultimately improving its accuracy in fine-grained image classification, based on category-level soft labels. Cla A KL divergence L is calculated from the output soft labels obtained by processing the samples through the target model. Cla ;

[0019] Based on the probability distribution predicted by the target model, calculate the KL divergence L between it and the output probability distribution of the EMA model. EMA While true labels and class-level soft labels can help optimize the model, the target model may also train in the wrong direction during training. This necessitates using an EMA (Exponential Motion Equalization) model to constrain the target model, ensuring it doesn't deviate too much from previous training and moves closer to the EMA model. A KL divergence L is calculated based on the soft labels obtained from the EMA model and the soft labels obtained from the target model. EMA .

[0020] Preferably, L = L Hard +λ1L Cla +λ2L EMA , where λ1 and λ2 are the weighting coefficients of the corresponding loss, λ1>0, λ2>0; generally, λ1 and λ2 are both taken as 1.

[0021] Preferably, the KL divergence L Cla satisfy,

[0022] L cla =KL(t) Cla ,p)

[0023] Wherein, the KL divergence is KL(t,p). C represents the total number of categories, where c corresponds to each category, and t... Cla is the class-level soft label, and p is the probability distribution predicted by the model after normalization.

[0024] Preferably, the KL divergence L EMA satisfy,

[0025] L EMA =KL(t) EMA ,p)

[0026] Where KL() is the KL divergence, t EMA denoted as the output obtained by the EMA model for the sample, and p is the probability distribution obtained by the target model for the sample.

[0027] Preferably, since the EMA is frozen and not used in training, the EMA model needs to be updated based on the parameters of the target model. The target model is updated by using a certain proportion of the EMA model's parameters, the remaining proportion of the target model's parameters, and a preset update frequency rule. New class-level soft labels are then calculated based on the new EMA model.

[0028] φ'←αφ'+(1-α)φ

[0029] Where φ′ is the EMA model parameter, φ is the current target model parameter, and α is the corresponding weight coefficient, α∈(0,1); generally, α=0.95.

[0030] Preferably, given an image that needs to be classified, the trained target model uses the correlation between categories learned during training to output the category corresponding to the image.

[0031] This invention relates to a fine-grained image classification model training method based on category-level soft object supervision. The method involves pre-training a target model with labeled data; initializing an EMA model using the parameters of the target model; calculating a similarity matrix based on the parameters of the fully connected layers in the EMA model; and obtaining category-level soft labels t based on the similarity matrix. Clu The image is associated with the target image; the target image is input and the total loss of the model training is constructed based on the target model and the EMA model to update the target model; the EMA model is updated with the new target model and the new category-level soft label is calculated using the new EMA model; the process is repeated and the total loss is minimized to achieve the training of a fine-grained image classification model.

[0032] The beneficial effects of this invention are as follows:

[0033] (1) Compared with other classification methods, which do not pay attention to the classification of fine-grained images and although they can achieve good results in the classification of some coarse-grained images, they are obviously insufficient in the classification of fine-grained images. This invention can learn the relationship between similar classes and achieve good results in the problem of fine-grained image classification.

[0034] (2) Compared with the hard label-based method, the soft label contains more knowledge and considers the relationship between categories, which enables the model to learn more knowledge and greatly helps to improve the classification accuracy. The present invention derives the relationship between each category based on the weight of the fully connected layer of the classifier, and calculates the matrix for obtaining the soft label, which not only preserves the relationship between categories, but also does not require additional space to store the pre-trained model.

[0035] (3) Compared with other soft-label-based methods, this invention does not require a complex clustering process or an additional pre-trained model to obtain soft labels;

[0036] (4) The effectiveness was validated on the CUB-200-2011, Stanford Dogs and MIT67 datasets. The model achieved an accuracy of 71.2% on the CUB-200-2011 dataset, 69.3% on the Stanford Dogs dataset, and 70% on the MIT67 dataset. Attached Figure Description

[0037] Figure 1 This is a flowchart of the present invention;

[0038] Figure 2 This is a schematic diagram of the model of the present invention;

[0039] Figure 3 This is a schematic diagram of the category-level soft label of the present invention. Detailed Implementation

[0040] The present invention will be further described in detail below with reference to embodiments, but the scope of protection of the present invention is not limited thereto.

[0041] This invention relates to a fine-grained image classification model training method based on category-level soft object supervision. The method involves pre-training a target model using labeled data; initializing an EMA model using the parameters of the target model; calculating a similarity matrix based on the parameters of the fully connected layers in the EMA model; performing softmax normalization on each column of the similarity matrix to obtain category-level soft labels, and using one column as a soft label for a single category image; calculating the cross-entropy loss between the target model's prediction of the input and its true label, the KL divergence between the target model's prediction of the input and its category-level soft label, and the KL divergence between the target model's prediction of the input and the EMA's prediction of the input; summing the cross-entropy loss and the two KL divergences as the total loss used for model training; updating the target model and then updating the EMA model based on the new target model, and using the new EMA model to calculate new category-level soft labels. Minimizing the total loss achieves consistency and decorrelation between fine-grained elements, ultimately enabling effective fine-grained image classification.

[0042] In conjunction with the embodiments, the method includes the following steps:

[0043] Step 1: Select a dataset and pre-train the target model with labeled samples to obtain a target model trained for 40 epochs. Here, a public dataset is selected, such as the CUB-200-2011 public dataset, which is a dataset containing 200 bird subclasses and a total of 11,788 bird images. The training dataset has 5,994 images, and the test set has 5,794 images. Each image provides image class labeling information.

[0044] The target model was trained using CUB-200-2011, with a ResNet18 network as the backbone and a 200-class classifier layer as the final layer. When training the target model on a labeled training set, cross-entropy loss was used to maintain model efficiency. The target model was trained for 40 epochs. The temperature parameter T was set to 3, and the batch size was 64. The above operations can be performed on the dataset using deep learning frameworks such as PyTorch. Images are input into a DataLoader, the data in the DataLoader is iterated, and the data is input into the encoder to obtain the model outputs. The loss is calculated, and the model is optimized using the SGD optimizer.

[0045] When training the target model with a labeled dataset, standard cross-entropy loss is used to increase robustness. The target loss is calculated as follows:

[0046]

[0047]

[0048] g c =f(x) c ;φ)

[0049] Where f represents the target model, C represents the total number of categories, c corresponds to each category, and y c It is sample x c Real labels, p c This is the probability of the model predicting the c-th class after softmax normalization, where T is the temperature parameter, 0. <T<5,g c It is a certain sample x c The output obtained after normalization from the target model f, where φ is the parameter of the target model.

[0050] Step 2: The target model includes a feature extractor and a classifier. The feature extractor and classifier of the target model are updated in each round based on gradient backpropagation of the total loss. The EMA model includes a feature extractor and a classifier. The feature extractor and classifier of the EMA model are initialized to the feature extractor and classifier in the target model. During training, the feature extractor and classifier of the EMA model are updated based on the current target model through exponential moving average.

[0051] The EMA model is initialized with the target model at the 40th epoch, and the feature extractor F... t Fixed, classifier C t fixed.

[0052] Step 3: Calculate a similarity matrix based on the parameters of the fully connected classification layer in the EMA model. Perform softmax normalization on each column of the similarity matrix to obtain category-level soft labels, and use one column as a soft label for a category image; the fully connected layer V of the EMA classification layer... K×C This includes the weights for each category, where K represents the dimension of the input vector and C represents the total number of categories. This information can be used to approximate the relationships between categories. Specifically, these weights V can be represented by S = V0 T V yields a similarity matrix S, where each row and column of S contains the similarity information of one category to all categories. Softmax normalization is then performed on each column of S to obtain the category-level soft labels t. Cla .

[0053] In this embodiment, the fully connected layer V of the classification layer of the initialized EMA model is... 512×200 The weights V of the 200 categories * These weights V can be extracted and expressed as S = V t V yields a 200×200 similarity matrix S, where each row and column of S contains the similarity information of one category to all categories. Softmax normalization is then performed on each column of S to obtain the category-level soft labels t. Cla .

[0054] Step 4: Based on the predictions of the classifier using the target model, obtain the output soft labels for the samples, and calculate the cross-entropy loss between the sample's true hard labels and the output soft labels. Based on the predictions of the classifier using the target model, obtain the output soft labels for the samples, and obtain the class soft labels based on the true labels. Calculate the KL divergence between the class-level soft labels and the output soft labels. Based on the predictions of the classifier using the EMA model, obtain the output soft labels for the samples, and calculate the KL divergence between the output soft labels obtained by the EMA-like model and the output soft labels obtained by the target model.

[0055] Step 4 includes the following steps:

[0056] Step 4.1: Because the target model may not be accurate in classifying the categories, it is easy to result in a low final classification accuracy. To alleviate this problem, the cross-entropy loss based on the prediction and the true label obtained by the model is introduced, as shown in Formula (1), to optimize the model.

[0057] Step 4.2: Although ground truth labels can help the model optimize, they contain very little information because they are one-hot data. This makes it difficult to capture enough information for fine-grained images, and relying solely on ground truth labels makes it challenging for the model to classify such images. To allow the model to learn more relationships between categories, category-level soft labels t are introduced. Cla This allows the model to learn more knowledge, ultimately improving its accuracy in classifying fine-grained images. Based on category-level soft labels... Cla A KL divergence is calculated from the soft labels of the output obtained by the target model and the samples.

[0058]

[0059] L Cla =KL(t) Cla ,p) (3)

[0060] Where KL represents the KL divergence, c corresponds to each category, C represents the total number of categories, and t c p represents the category-level soft tag corresponding to the c-th category. c t is the result obtained after performing a softmax normalization operation on a given vector. Cla represents the class-level soft label for all classes, and p is the probability distribution predicted by the model after normalization.

[0061] Step 4.3: Although true labels and class-level soft labels can help optimize the model, the target model may still train in the wrong direction during training. This necessitates using an EMA model to guide the target model, ensuring it doesn't deviate too much from previous training and aligns it closer to the EMA model. A KL divergence is calculated based on the output soft labels obtained from the samples processed by the EMA model and the output soft labels obtained from the samples processed by the target model.

[0062] L EMA =KL(t) EMA ,p) (4)

[0063] Where KL represents the KL divergence, t EMA denoted as the output obtained by the sample through the EMA model, and p is the probability distribution obtained by the sample through the target model.

[0064] Step 5: After the target model is updated, the EMA model is updated with certain weights and a preset update frequency rule, and new class-level soft labels are obtained based on the new EMA model.

[0065] In step 5, since the EMA is frozen and not used in training, the EMA model needs to be updated based on the parameters of the target model. A new EMA model is obtained by adding a certain proportion of the EMA model's parameters to the remaining proportion of the target model's parameters.

[0066] φ′←αφ′+(1-α)φ (5)

[0067] Where φ′ is the EMA model parameter, φ is the current target model parameter, and α is the corresponding weight coefficient, α=0.95.

[0068] After updating the EMA model, new class-level soft labels are calculated based on the new EMA model.

[0069] Step 6: Sum the above losses to obtain the total loss L. Train the target model by minimizing all loss functions to complete fine-grained image classification learning and achieve a high fine-grained image classification accuracy.

[0070] In step 6, the complete objective function L is:

[0071] L = L Hard +λ1L Cla +λ2L EMA (7)

[0072] Where λ1 and λ2 are the corresponding hyperparameters, λ1 = 1 and λ2 = 1.

[0073] The objective model is optimized based on the complete objective function L; the objective domain model is trained using the SGD optimizer with a momentum of 0.9 and a weight decay of 5*10. -4 The batch size is 64, and the learning rate is 0.1. During training, the EMA model is updated once every epoch between the 40th and 150th epochs, and once every three epochs after the 150th epoch. The hyperparameters λ1 = 1.0, λ2 = 1.0, and the epoch is set to 200.

[0074] Based on the present invention, the training target model can achieve the classification of fine-grained images and reduce the impact of fine-grained images on the image classification effect; given an image that needs to be classified, the trained target model uses the correlation between categories learned during training to output the category corresponding to the image.

[0075] Based on this method, the development of computer media, programs, and devices can be realized.

[0076] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0077] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0078] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0079] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0080] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.

[0081] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.

Claims

1. A fine-grained image classification model training method based on category-level soft object supervision, characterized in that: The method pre-trains a target model with labeled data; initializes an EMA model with the parameters of the target model, and includes a fully connected layer in the classifier of the EMA model. Including the weights of each category, with K as the dimension, we obtain a similarity matrix. , ,right Normalization of each column yields category-level soft labels. Associated with images; Input image, construct model training total loss based on target model and EMA model to update target model; total loss This includes: calculating the cross-entropy loss between the probability distribution predicted by the target model and the true hard labels of the samples. Based on the probability distribution predicted by the target model, calculate the KL divergence between it and the category-level soft label. Based on the probability distribution predicted by the target model, calculate the KL divergence between it and the output probability distribution of the EMA model. ; The EMA model is updated with the new target model, and new category-level soft labels are calculated using the new EMA model; the training of the fine-grained image classification model is achieved by repeating and minimizing the total loss.

2. The method for training a fine-grained image classification model based on category-level soft object supervision according to claim 1, characterized in that: A target model is pre-trained using labeled data, with the objective function being: ，，， in, Represent the target model, Indicates the total number of categories. For each category, It is a sample The tag, It is the probability of the model predicting the c-th class after normalization. It is the temperature coefficient. , It is a sample After the target model The output obtained after normalization These are the parameters of the target model.

3. The method for training a fine-grained image classification model based on category-level soft object supervision according to claim 1, characterized in that: The target model includes a feature extractor and a classifier, which are updated in each round of gradient backpropagation based on the total loss. The EMA model includes a feature extractor and a classifier, which are initialized to the feature extractor and classifier in the target model. During training, the feature extractor and classifier of the EMA model are updated based on the current target model using an exponential moving average.

4. The fine-grained image classification model training method based on category-level soft object supervision according to claim 1, characterized in that: ,in, and These are the weighting coefficients corresponding to the loss. , .

5. The method for training a fine-grained image classification model based on category-level soft object supervision according to claim 1, characterized in that: KL divergence satisfy, ， Wherein, the KL divergence is , , Indicates the total number of categories. For each category, As a class-level soft tag, It is the probability distribution predicted by the model after normalization.

6. The method for training a fine-grained image classification model based on category-level soft object supervision according to claim 1, characterized in that: KL divergence satisfy, ， in, Let KL divergence be the KL divergence. The output obtained by processing the sample using the EMA model. It is the probability distribution obtained by passing the sample through the target model.

7. The method for training a fine-grained image classification model based on category-level soft object supervision according to claim 1, characterized in that: After the target model is updated, the EMA model is updated with certain weights and a preset update frequency rule. ， in, These are EMA model parameters. These are the parameters of the current target model. These are the corresponding weighting coefficients. .

8. The method for training a fine-grained image classification model based on category-level soft object supervision according to claim 1, characterized in that: Given an image that needs to be classified, the trained target model uses the correlations between categories learned during training to output the category corresponding to the image.