A fine-grained recognition-oriented class-incremental learning method

By employing unsupervised localization of image patches, improved generative adversarial networks, and dual-branch networks, the problems of fine-grained classification accuracy and catastrophic forgetting in incremental learning are solved, achieving efficient fine-grained recognition and knowledge retention.

CN115423090BActive Publication Date: 2026-06-16NANJING UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV OF SCI & TECH
Filing Date
2022-08-21
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing incremental learning methods struggle to effectively address the classification accuracy problem at the fine-grained level, especially when faced with an imbalance between the number of known class samples and new class samples. This leads to the model biasing towards the new class and catastrophic forgetting, and there is a lack of effective fine-grained improvement schemes.

Method used

We employ unsupervised image depth descriptors to locate object-level image patches, combine them with weakly supervised part-level detectors, improve fully convolutional generative adversarial networks to generate old-class images, and retain old knowledge while learning new knowledge through a dual-branch network. We then fine-tune the classifier using a balanced salient sample set.

🎯Benefits of technology

It improves the classification accuracy of fine-grained recognition, alleviates the long-tail problem and catastrophic forgetting, and enables effective augmentation of known class samples without increasing storage space.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115423090B_ABST
    Figure CN115423090B_ABST
Patent Text Reader

Abstract

The application discloses a kind of fine-grained recognition-oriented class incremental learning method, the method utilizes image and component level detector to intercept image block from original image, to remove the noise in background, while obtaining the most discriminant part, then trains and uses double-branch network to extract the features of these image blocks, both can well retain learned knowledge, and can well learn new features, fine-grained features are obtained by splicing characteristics for classification, while using improved generative adversarial network to obtain generated image with old class features for resisting catastrophic forgetting, and the classification accuracy can be greatly improved compared with the original coarse-grained incremental learning method.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of incremental learning in machine learning, specifically involving a kind of incremental learning method for fine-grained recognition. Background Technology

[0002] Incremental learning refers to continuously processing the continuous flow of information in the real world, absorbing new knowledge while retaining and even integrating and optimizing old knowledge. The reference sample for artificial intelligence has never left humanity. Lifelong, incremental learning ability is one of the most important human capabilities. If robots can incrementally learn from their environment and tasks like humans, lifelong learning becomes possible. With the rapid development and widespread application of databases and internet technologies, various sectors of society have accumulated massive amounts of data, and this amount is increasing rapidly every day. Extracting useful information from this data and analyzing and processing it is a challenging task. Traditional batch learning methods cannot meet this demand; only incremental learning can effectively address it. Research on incremental learning models allows us to better understand and mimic the learning methods of the human brain and the structural mechanisms of biological neural networks at the system level, providing a technical foundation for developing new computational models and effective learning algorithms.

[0003] In traditional computer vision research, image analysis typically focuses on classification and retrieval within conventional categories such as "dog," "car," and "bird." However, in many practical applications, image objects often originate from finer-grained subcategories within a single conventional category, such as different breeds of "dog"—Husky, Alaskan Malamute, Bichon Frise, etc.; or different breeds of "car"—Audi, BMW, Mercedes-Benz, etc. Fine-grained image analysis is a popular research topic in computer vision addressing these issues. Its goal is to study several visual analysis tasks, including localization, recognition, and retrieval of object subcategories within these fine-grained images, offering broad application value in real-world scenarios. However, the smaller inter-class differences and larger intra-class differences between fine-grained subcategories make it a more challenging research topic compared to traditional image analysis.

[0004] Current research on incremental learning lacks a good solution for the fine-grained problem. However, many digital images in reality are fine-grained, which significantly impacts classification accuracy. Therefore, fine-grained improvements are needed to refine the coarse-grained models in incremental learning. Furthermore, during incremental learning, there is a significant gap between the number of known class samples and the number of new class samples stored in memory, creating a long-tail phenomenon that causes the model to favor new classes. Existing methods cannot effectively address this issue. Meanwhile, to mitigate catastrophic forgetting, data augmentation to broaden the known class samples is a good approach. New methods should perform augmentation without significantly increasing the number of samples. Summary of the Invention

[0005] The purpose of this invention is to provide a class incremental learning method for fine-grained recognition.

[0006] The technical solution to achieve the objective of this invention is as follows: Firstly, this invention provides a class incremental learning method for fine-grained recognition, comprising the following steps:

[0007] Step 1: Use image depth descriptors to locate object-level image patches from the original image in an unsupervised manner, while training part-level detectors in a weakly supervised manner to locate the most discriminative fine-grained image patches from the original image.

[0008] Step 2: Improve the fully convolutional generative adversarial network and train it together with the incremental model. In each incremental stage, use Gaussian noise to generate images of the old incremental categories, classify them using the classifier in the part detector to obtain a score, and if it is greater than the set threshold, add it to the dataset for training together.

[0009] Step 3: Train a dual-branch network using the captured object and component-level image patches. This network consists of a stable branch and a flexible branch. The former is used to retain incremental learning knowledge of old categories, while the latter is used to learn features of new categories. Finally, the features of the image patches are concatenated to form fine-grained features for classification. After the network is trained, the classifier is fine-tuned using a balanced set of salient samples to address the long-tail problem.

[0010] Furthermore, in step 1, the object of interest in the original image is located in unsupervised mode using a pre-trained model on the large-scale dataset ImageNet through the image depth descriptor; the part-level detector is trained using the image labels as supervision information to locate the most discriminative fine-grained image patches from the original image.

[0011] In the incremental learning stage For the original training dataset Suppose there is The image is fed into a pre-trained deep feature extractor. The corresponding feature map can be obtained. ,in These represent the height, width, and dimensions of the image; putting all the feature maps together yields the complete set. Applying principal component analysis along the depth direction yields the eigenvector corresponding to the largest eigenvalue. ; then The heatmap is obtained by weighting and summing the values ​​of each spatial location according to the channel: The image is then upsampled back to its original size, and the region of interest at the object level is obtained using zero threshold and maximum connected component analysis.

[0012] After feeding an image into a convolutional network with several layers to extract features, each feature map... Each vector can be considered as an image patch of the original image; using vectors sensitive to specific regions... Convolution further extracts features from the feature map, resulting in a heatmap. The regions with the highest response values ​​in the heatmap are where the most discriminative fine-grained features are located in the original image. Using these... Small convolutions serve as component detectors.

[0013] The VGG-16 network is used with its first 10 layers as low-level feature extractors and the remaining layers as high-level feature extractors. A two-way branch is designed, with the local branch added after the low-level feature extraction. Convolutional layers, global pooling layers, and classification layers; using , Feature extractors representing the global and local branches, respectively, are used. , Let each branch represent its own classifier. Then, the training loss function for the two branches is the following cross-entropy loss:

[0014] (1)

[0015] (2)

[0016] The training loss function of the component inspector of this invention is as follows:

[0017] (3)

[0018] Furthermore, step 2 improves the fully convolutional generative adversarial network to achieve a resolution that meets the requirements of the new increment stage. The old category images are then classified using the global branch in the component detector. If the classification score is greater than the set threshold, it is considered a reliable sample and added to the training set to train the model.

[0019] use and To represent the incremental phase of generative adversarial networks The corresponding generator and discriminator, assuming the training data Follows distribution , Follows Gaussian distribution ,So This represents the old category samples generated by the generator. The probability that the discriminator considers an image to be real is represented by the following losses for the generator and discriminator:

[0020] (4)

[0021] (5)

[0022] The training objective for each incremental stage of the improved generative adversarial network is as follows:

[0023] (6)

[0024] Furthermore, in step 3, the object-level and component-level detectors from step 1 are used to extract fine-grained object and component image patches from the training data of the current incremental stage, the salient sample set, and the reliable samples generated in step 2, and input them into the dual-branch network for training. The dual-branch network consists of a stable branch and a flexible branch. Each branch consists of several residual blocks of ResNet-18. After the two branches extract features from their respective residual blocks, they are added together with certain weights and used as the input to the next layer. This process is repeated to obtain the final feature map.

[0025] use and Let be the aggregate weights of the stable and flexible branches at the k-th layer. Then, the output feature of this layer can be represented by the following formula:

[0026] (7)

[0027] in and Representing stable and flexible features respectively, when the image of the incremental stage t is input into the dual-branch network, their corresponding features at layer k are obtained:

[0028] (8)

[0029] As the incremental phase progresses, the weights of stable and flexible branches will also change:

[0030] (9)

[0031] in and This represents the current incremental stage number and the maximum incremental stage number; a dual-branch network is used to extract features from the body-level image and the component-level image respectively, and then they are stretched to form a fine-grained feature representation:

[0032] (10)

[0033] The loss function of a dual-branch network consists of two parts: first, knowledge distillation loss, which is calculated by determining the Euclidean distance between the features extracted by the old and new models from the dataset in the current incremental stage; and second, cross-entropy multi-classification loss.

[0034] (11)

[0035] (12)

[0036] The overall loss function is a weighted sum of the two:

[0037] (13)

[0038] After training the incremental dual-branch network, the subset of samples whose fine-grained features are closest to the class center features is selected as the salient sample set for the new class.

[0039] After training with all the data, the classifier is fine-tuned using a balanced set of salient samples; during the prediction phase, the adjusted model is used for classification.

[0040] In a second aspect, the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the method described in the first aspect.

[0041] Thirdly, the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described in the first aspect.

[0042] Fourthly, the present invention provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the method described in the first aspect.

[0043] Compared with the prior art, the present invention has the following significant advantages: (1) It introduces fine-grained recognition methods into incremental learning, adopts an unsupervised approach, uses image depth descriptors to locate object-level interest image patches from the original image, thereby effectively removing noise that may exist in the image background; it adopts a weakly supervised approach, trains a series of small convolutional filters as part-level detectors, and finds the corresponding part with the largest thermal response in the original image, which is the most discriminative fine-grained image patch. No strong supervision information is required, only category labels are needed for training. (2) It improves the existing DCGAN and introduces it into the incremental learning process for training. In each incremental stage, Gaussian noise is used to generate images of the old incremental categories, and the classifier in the part detector is used to classify and obtain a score. If it is greater than the set threshold, it is added to the dataset for training. No additional storage space is required, and the images generated in the corresponding stage can be discarded after training. (3) The dual-branch network can both retain the learned knowledge well and learn new knowledge well, achieving a good balance between the two. The spliced ​​features have highly discriminative fine-grained information, which greatly improves the accuracy of classification. Fine-tuning the classifier using a balanced salient set can effectively remove the long-tail problem caused by uneven sample size. Attached Figure Description

[0044] Figure 1 This is a flowchart of the class incremental learning method for fine-grained recognition according to the present invention.

[0045] Figure 2 This is a flowchart of the incremental learning method for fine-grained recognition proposed in this invention.

[0046] Figure 3 This is a visualization of fine-grained image patches obtained by the incremental learning method for fine-grained recognition according to the present invention. Detailed Implementation

[0047] Combination Figure 1 , Figure 2 as well as Figure 3 This invention proposes a class-incremental learning method for fine-grained recognition, comprising the following steps:

[0048] Step 1: Using the image depth descriptor (DDT), object-level image patches are located from the original image in an unsupervised manner. Simultaneously, a part-level detector is trained in a weakly supervised manner to locate a series of the most discriminative fine-grained image patches from the original image. The structure is as follows: Figure 2 As shown;

[0049] In the incremental learning stage For the original training dataset Suppose there is The image is fed into a pre-trained deep feature extractor. The corresponding feature map can be obtained. ,in This represents the height, width, and dimensions of the image. Combining all the feature maps yields the complete set. By applying principal component analysis along the depth direction, we can obtain the eigenvector corresponding to the largest eigenvalue. Then... The heatmap is obtained by weighting and summing the values ​​of each spatial location according to the channel: The image is then upsampled back to its original size, and the region of interest at the object level can be obtained using zero threshold and maximum connected component analysis.

[0050] After feeding an image into a convolutional network with several layers to extract features, each feature map... Each vector can be considered a patch of the original image. Using regions sensitive to specific areas... Convolution further extracts features from the feature map, resulting in a heatmap. The regions with the highest response values ​​in the heatmap are where the most discriminative fine-grained features are located in the original image. This invention utilizes these... Small convolutions are used as part detectors. Specifically, this invention utilizes the first 10 layers of the VGG-16 network as low-level feature extractors and the later layers as high-level feature extractors. A two-branch design is employed: the global branch is a standard classification model, while the local branch incorporates features extracted after the low-level feature extraction. Convolutional layers, global pooling layers, and classification layers are used to capture fine-grained local features of an image. , Feature extractors representing the global and local branches, respectively, are used. , Let each branch represent its own classifier. Then, the training loss function for the two branches is the following cross-entropy loss:

[0051] (1)

[0052] (2)

[0053] The training loss function of the component inspector of this invention is as follows:

[0054] (3)

[0055] Fine-grained image patches obtained from the original image in step 1, such as Figure 3 As shown, the red box represents the object-level image block, and the green box represents the component-level image block.

[0056] Step 2: Improve the traditional fully convolutional generative adversarial network (DCGAN) and train it together with the incremental model. In each incremental stage, use Gaussian noise to generate images of the old incremental categories, classify them using the classifier in the part detector to obtain a score, and if it is greater than the set threshold, add it to the dataset for training together. This can enrich the salient sample set of the old categories, thereby mitigating catastrophic forgetting.

[0057] use and To represent the incremental phase of generative adversarial networks The corresponding generator and discriminator, assuming the training data Follows distribution , Follows Gaussian distribution ,So This represents the old category samples generated by the generator. The probability that the discriminator considers an image to be real is represented by the following losses for the generator and discriminator:

[0058] (4)

[0059] (5)

[0060] The training objective for each incremental stage of the improved generative adversarial network is as follows, and its structure is as follows: Figure 2 As shown:

[0061] (6)

[0062] Step 3: Using the object-level and part-level detectors from Step 1, extract fine-grained object and part image patches from the training data of the current incremental stage, the salient sample set, and the reliable samples generated in Step 2, and input them into the dual-branch network for training.

[0063] A two-branch network is trained using image patches at the object and component level. This network consists of a stable branch and a flexible branch. The former is used to retain as much knowledge of the old categories as possible during incremental learning, while the latter is used to learn as many features of the new categories as possible. Finally, the features of the image patches are concatenated to form fine-grained features for classification. After the network is trained, the classifier is fine-tuned with a balanced set of salient samples to address the long-tail problem.

[0064] The dual-branch network consists of a stable branch and a flexible branch. Each branch is composed of several residual blocks from ResNet-18. The stable branch is trained with a smaller learning rate to preserve learned old knowledge as much as possible, while the flexible branch is updated with a normal learning rate to better learn new knowledge. After extracting features from their respective residual blocks, the two branches are summed with certain weights and used as the input to the next layer. This process continues to obtain the final feature map, as shown in the diagram. Figure 2 As shown.

[0065] use and Let be the aggregate weights of the stable and flexible branches at the k-th layer. Then, the output feature of this layer can be represented by the following formula:

[0066] (7)

[0067] in and These represent stable and flexible features, respectively. When the image of the incremental stage t is input into the dual-branch network, the features corresponding to their k-th layer can be obtained:

[0068] (8)

[0069] As the incremental phase progresses, the weights of stable and flexible branches will change. In the initial phase, flexible branches are favored to facilitate faster learning of new knowledge, while in later phases, stable branches are favored to better retain learned knowledge and mitigate catastrophic forgetting.

[0070] (9)

[0071] in and This represents the current incremental stage number and the maximum incremental stage number. A dual-branch network is used to extract features from the body-level image and the part-level image respectively, and then these features are stretched to form a fine-grained feature representation.

[0072] (10)

[0073] The loss function of a dual-branch network consists of two parts: first, knowledge distillation loss, which is calculated by determining the Euclidean distance between the features extracted by the old and new models from the dataset in the current incremental stage; and second, cross-entropy multi-classification loss.

[0074] (11)

[0075] (12)

[0076] The overall loss function is a weighted sum of the two:

[0077] (13)

[0078] After training the incremental dual-branch network, the subset of samples whose fine-grained features are closest to the class center feature is selected as the salient sample set for the new class. Since the number of samples in the new class is significantly greater than that of the old class, this invention fine-tunes the classifier using a balanced salient sample set after training with all data. This mitigates catastrophic forgetting and addresses the long-tail problem. The adjusted model is then used for classification during the prediction phase.

Claims

1. A class incremental learning method for fine-grained recognition, characterized in that, Includes the following steps: Step 1: Use image depth descriptors to locate object-level image patches from the original image in an unsupervised manner, while training part-level detectors in a weakly supervised manner to locate the most discriminative fine-grained image patches from the original image. Step 2: Improve the fully convolutional generative adversarial network and train it together with the incremental model. In each incremental stage, use Gaussian noise to generate images of the old incremental categories, classify them using the classifier in the part detector to obtain a score, and if it is greater than the set threshold, add it to the dataset for training together. Step 3: Train a dual-branch network using the captured object and component-level image patches. This network consists of a stable branch and a flexible branch. The former is used to retain incremental learning knowledge of old categories, while the latter is used to learn features of new categories. Finally, the features of the image patches are concatenated to form fine-grained features for classification. After the network is trained, the classifier is fine-tuned using a balanced set of salient samples to address the long-tail problem.

2. The class incremental learning method for fine-grained recognition according to claim 1, characterized in that, Step 1 uses image depth descriptors and a pre-trained model on the large-scale ImageNet dataset to unsupervisedly locate objects of interest in the original image; then, image labels are used as supervisory information to train part-level detectors to locate the most discriminative fine-grained image patches from the original image. In the incremental learning stage For the original training dataset Suppose there is The image is fed into a pre-trained deep feature extractor. The corresponding feature map is obtained. ,in These represent the height, width, and dimensions of the image; putting all the feature maps together yields the complete set. Applying principal component analysis along the depth direction yields the eigenvector corresponding to the largest eigenvalue. ; then The heatmap is obtained by weighting and summing the values ​​of each spatial location according to the channel: The image is then upsampled back to its original size, and the region of interest at the object level is obtained using zero threshold and maximum connected component analysis. After feeding an image into a convolutional network with several layers to extract features, each feature map... Each vector can be considered as an image patch of the original image; using vectors sensitive to specific regions... Convolution further extracts features from the feature map, resulting in a heatmap. The regions with the highest response values ​​in the heatmap are where the most discriminative fine-grained features are located in the original image. Using these... Small convolutions serve as component detectors.

3. The class incremental learning method for fine-grained recognition according to claim 2, characterized in that, The VGG-16 network is used with its first 10 layers as low-level feature extractors and the remaining layers as high-level feature extractors. A two-way branch is designed, with the local branch added after the low-level feature extraction. Convolutional layers, global pooling layers, and classification layers; using , Feature extractors representing the global and local branches, respectively, are used. , Let each branch represent its own classifier. Then, the training loss function for the two branches is the following cross-entropy loss: (1) (2) The training loss function of the component inspector is as follows: (3)。 4. The class incremental learning method for fine-grained recognition according to claim 3, characterized in that, Step 2 improves the fully convolutional generative adversarial network to achieve a higher resolution in the new increment stage. The old category images are then classified using the global branch in the component detector. If the classification score is greater than the set threshold, it is considered a reliable sample and added to the training set to train the model. use and To represent the incremental phase of generative adversarial networks The corresponding generator and discriminator, assuming the training data Follows distribution , Follows Gaussian distribution ,So This represents the old category samples generated by the generator. The probability that the discriminator considers an image to be real is represented by the following losses for the generator and discriminator: (4) (5) The training objective for each incremental stage of the improved generative adversarial network is as follows: (6)。 5. The class incremental learning method for fine-grained recognition according to claim 4, characterized in that, In step 3, using the object-level and component-level detectors from step 1, fine-grained object and component image patches are extracted from the training data of the current incremental stage, the salient sample set, and the reliable samples generated in step 2, and input into the dual-branch network for training. The dual-branch network consists of a stable branch and a flexible branch. Each branch consists of several residual blocks of ResNet-18. After the two branches extract features from their respective residual blocks, they are added together with certain weights and used as the input to the next layer. This process is repeated to obtain the final feature map. use and Let be the aggregate weights of the stable and flexible branches at the k-th layer. Then, the output feature of this layer can be represented by the following formula: (7) in and Representing stable and flexible features respectively, when the image of the incremental stage t is input into the dual-branch network, their corresponding features at layer k are obtained: (8) As the incremental phase progresses, the weights of stable and flexible branches will also change: (9) in and This represents the current incremental stage number and the maximum incremental stage number; a dual-branch network is used to extract features from the body-level image and the component-level image respectively, and then they are stretched to form a fine-grained feature representation: (10) The loss function of a dual-branch network consists of two parts: first, knowledge distillation loss, which is calculated by determining the Euclidean distance between the features extracted by the old and new models from the dataset in the current incremental stage; and second, cross-entropy multi-classification loss. (11) (12) The overall loss function is a weighted sum of the two: (13) After training the incremental dual-branch network, the subset of samples whose fine-grained features are closest to the class center features is selected as the salient sample set for the new class.

6. The class incremental learning method for fine-grained recognition according to claim 5, characterized in that, After training with all the data, the classifier is fine-tuned using a balanced set of salient samples; during the prediction phase, the adjusted model is used for classification.

7. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method described in any one of claims 1-6.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the program implements the steps of the method described in any one of claims 1-6.

9. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the steps of the method described in any one of claims 1-6.