Continuous learning image classification method and device based on deep learning
A technology of deep learning and classification methods, applied in the field of continuous learning image classification based on deep learning
Active Publication Date: 2022-05-10
SUN YAT SEN UNIV
8 Cites 0 Cited by
AI-Extracted Technical Summary
Problems solved by technology
But this extension-based approach inevitably changes the structure of the classifier, especially the feature extractor part, and requires more and more...
Abstract
The invention discloses a continuous learning image classification method and device based on deep learning, and the method comprises the following steps: constructing a task continuous learning model with task specific batch normalization, and enabling parameters of all convolution kernels in a feature extractor of the task continuous learning model to be fixed in all tasks, when each new task is learned, parameters of a corresponding batch normalization layer BN in each convolution kernel and a specific classification head of the task are learned together; incremental training is carried out on the task continuous learning model, and when a new task comes, a specific batch normalization layer and a classification head of the new task are added; after incremental training is completed, a trained task continuous learning model is obtained, image tasks to be classified are input into the trained task continuous learning model, and the classification tasks are completed. According to the method, the problem of disastrous forgetting is effectively solved by utilizing batch normalized BN existing in a task continuous learning model.
Application Domain
Character and pattern recognitionNeural architectures +1
Technology Topic
Continual learningConvolution +3
Image
Examples
- Experimental program(1)
Example Embodiment
Detailed ways
[0042] In order to enable those skilled in the art to better understand the solution of the present application, the technical solution of the present invention will be described clearly and completely below with reference to the embodiments of the present application and the accompanying drawings. It should be understood that the accompanying drawings are only for exemplary purposes. The description should not be construed as a limitation on this patent. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of this application.
[0043] Reference in this application to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described in this application may be combined with other embodiments.
[0044] A CNN classifier consists of a feature extractor and a classification head. In CNN models, the feature extractor usually consists of multiple convolutional layers, and the classification head h( ) usually consists of one or two fully connected layers with a final softmax output. Each convolutional layer in the feature extractor includes multiple convolution kernels, BN (BatchNormalization) corresponding to each convolution kernel, and a nonlinear activation function.
[0045] In traditional task-incremental learning strategies, different tasks share the feature extractor part of the classifier, with a classification head specific to that task. At test time, for any new test image, the task identity of the image is known, so the shared feature extractor and corresponding task-specific classification heads are known, and the task-specific heads are combined to predict the class. This means that all convolution kernel parameters and the corresponding BN parameters of all convolution kernels increase incrementally as the task arrives. Changes in the parameters of the feature extractor (especially the convolution kernel part) are the origin of catastrophic forgetting, which leads to the degradation of the performance of the tasks learned in the past.
[0046] Deep learning techniques, especially convolutional neural networks, have been widely used in intelligent diagnosis of various diseases based on medical images. However, current intelligent diagnostic systems can only help diagnose a specific set of diseases, in part due to the difficulty of collecting training data for all disease diagnosis tasks simultaneously. This will lead to more and more independent intelligent diagnostic systems, making it difficult for medical centers to manage multiple disparate systems, and causing medical staff to spend more time learning to use various systems. One solution to this problem is to develop a single intelligent system that can progressively learn more and more tasks, each for diagnosing a specific set of diseases. In order to prevent the rapid expansion of the system scale, it is usually assumed that the underlying classifier has a feature extractor shared by tasks, but has multiple task-specific classifier heads, that is, multiple tasks use the same feature extractor to extract data features and obtain feature vectors. Each task classifies the data using a classification head specific to that task. And it is assumed that during the testing process, the task to which the data belongs is known, that is, the user knows which classification head should be applied for the test data. This problem is called the Task Incremtal Learning problem.
[0047] The premise of continuous learning is that over time, a classifier is constantly learning more and more tasks. Each time the classifier learns a new task, it classifies a new set of categories, and the categories learned by different tasks are generally considered to be non-overlapping. During the process of learning a new task, no data of any previously learned task is retained, only the training data for the new task is available.
[0048] When a CNN classifier is incrementally updated to handle multiple classification tasks, most existing TIL strategies assume that all these tasks share the same feature extractor, but have task-specific classification heads for different tasks. This means that the parameters of all convolution kernels and BN parameters in the feature extractor are incrementally updated across tasks, and all tasks share the same set of updated kernel parameters and BN parameters. The parameters in the feature extractor are the source of catastrophic forgetting in TIL. In order to be more adaptable to the current task, the parameters related to the old task are inevitably changed, and the data of the new task and the old task do not overlap with each other and have a large gap, which leads to the previous The performance of the learning task degrades. To mitigate or ideally avoid the catastrophic forgetting problem, it is necessary to update the feature extractor as little as possible or not at all.
[0049] see figure 1 , the continuous learning image classification method based on deep learning in this embodiment includes the following steps:
[0050] S1. Construct a task-continuous learning model with task-specific batch normalization. The parameters of all convolution kernels in the feature extractor of the task-continuous learning model are fixed in all tasks. When learning each new task, The parameters of the corresponding batch normalization layer BN in each convolution kernel are learned together with the task-specific classification head; the convolution kernels are pre-trained convolution kernels in the first stage, so that visually similar inputs from Similar feature vectors are produced in the feature extractor; the well-trained convolution kernels are obtained in the initial stage for subsequent incremental stages.
[0051] Understandably, since the parameters of all convolution kernels in the feature extractor are fixed across all tasks, it is crucial to obtain well-trained convolution kernels. In general, if two input data are visually different, an ideal feature extractor should output two different feature vectors, while two visually similar inputs should produce two similar feature vectors from the feature extractor, That is, the distance between the pictures with large differences should be larger in the feature space, and the distance between the feature vectors of similar pictures in the feature space should be closer. A well-trained convolution kernel can be obtained by: (Strategy 1) Train a feature extractor from scratch using the training set of the first task; (Strategy 2) Use a fixed pretrained feature extractor, especially when the pretrained features Use large-scale public datasets (such as ImageNet) or other relevant datasets for training when the extractor is good; (Strategy 3) Use the training set of the first task to fine-tune the pretrained feature extractor.
[0052] Further, the pre-trained convolution kernel is obtained in the following manner:
[0053] Use the training set of the first task to train the feature extractor from scratch, that is, in the first task, use all the current training data to train the parameters of the entire neural network, a new task comes, add a new set of batch normalization For the normalization layer and classification head, the parameters of the convolution kernel that have been trained in the first stage are fixed, and the newly added batch normalization layer and classification head are trained using the new incoming data.
[0054] In an embodiment of the present application, the above strategy 1 is used to train the convolution kernel, specifically:
[0055] make represents a collection of training data at different incremental stages,
[0056] of which π π‘ = , represents the training data corresponding to the task π‘, represents the image of the πth sample, represents the one-hot label of the πth sample, π π‘ Represents the number of training samples corresponding to task π‘. Since the classifier proposed by this method is multi-headed, the classifier for each stage π‘ is usually composed of a feature extractor π(β
) and a task-specific classification head β π‘ (β
) composition. for each input sample image , after the feature extractor π(β
) and the classification head β π‘ (β
) will get a vector , of which π π‘ Represents the number of categories corresponding to the training data in task π‘. vector Finally input the Softmax function to get the probability vector ,in Indicates input Probability of belonging to class π:
[0057]
[0058] Therefore, the cross-entropy loss used for model optimization is:
[0059]
[0060] Among them, ππ‘ represents the model parameters that need to be optimized for task π‘. When π‘=1, that is, in the initial stage, the model parameters that need to be optimized ,in represents the parameters of the convolutional layer, represents the parameters corresponding to batch normalization, Indicates the parameters corresponding to the classification header.
[0061] S2. Perform incremental training on the task continuous learning model;
[0062] In a convolutional neural network, each convolutional layer is usually followed by a batch normalization processing layer, and each batch normalized convolutional layer corresponds to a convolution kernel. In order to retain knowledge about old tasks and enable the model to learn new knowledge through new data, after the initial stage of training, the new training data π π‘ When it comes, the model needs to add new task-specific batch normalization layers and classification heads, where π‘>1. see figure 2 , when the training progresses to stage π‘, there will be a batch normalization layer of π‘ after each convolutional layer, but the model will decide which batch normalization layer and classification head the data flows to according to the current task. That is, when π‘>1, the model parameters to be optimized are , that is, the parameters of the convolution kernel in the feature extractor are not updated, but only the parameters of the batch normalization layer and classification head corresponding to the added task π‘ are updated.
[0063] After the training in stage π‘, since the task label corresponding to the input data is known for the task increment, for the test sample , the task continuous learning model will determine its flow direction according to the task label of the sample to get the vector , where the parameters corresponding to the feature extractor π(β
) are , for the sample The predicted labels for are:
[0064] .
[0065] Further, the definition of the batch normalization layer BN is as follows:
[0066] During classifier training, given a batch of input images, for the first l The first convolutional layer j Group convolution kernel, with represents the convolution output of an image, represents the set of convolutional outputs for all images in the batch, represents the output of the batch normalization layer BN, then for , the batch normalization layer BN operation is defined as:
[0067]
[0068] in, and respectively represent mean and standard deviation of all elements, and BN is a parameter that needs to be learned in BN. BN is specific to each convolution kernel. Each convolution kernel has a specific BN with specific parameters..
[0069] S3. After completing the incremental training, a trained task continuous learning model is obtained, and the image task to be classified is input into the trained task continuous learning model to complete the classification task.
[0070] Further, in order to eliminate the influence of sample imbalance, the index used in this embodiment to compare the performance of each method is the mean class recall rate (Mean Class Recall, MCR). Regardless of the number of samples in each class, MCR treats each class as equal, which can better reflect the performance of the model classification. Through the MCR changes of each model at different stages, in this embodiment, the degree of forgetting of the old knowledge by the model can be seen. For a certain class π, the recall is calculated as follows:
[0071]
[0072] Among them, π π represents the number of samples of class π, π π Indicates whether the sample π in class π is correctly classified, when π π When = 1, it means that the sample π is correctly classified into class π, when π π When =1, it means that the sample π is wrongly classified into other classes. After entering the incremental stage π‘ model training, the MCR of the model is expressed as:
[0073]
[0074] in, Represents the number of all known classes up to stage π‘.
[0075] In a specific embodiment of this application, the above method is applied to the medical image data MedMNIST-v2, and the effectiveness of the method proposed in this application is demonstrated by comparing with other classic continuous learning methods.
[0076] (1) Dataset introduction
[0077] MedMNIST-v2 is composed of several small datasets, including 12 2D medical image datasets and 6 3D medical image datasets. The experiments are mainly performed on 2D image datasets to evaluate the method of this application. MedMNIST-v2 each Small 2D datasets are a task. After excluding one multi-label task (i.e. ChestMNIST), one ordinal regression task (i.e. RetinaMNIST) and two tasks that share the same raw patient data as task OrganAMNIST (i.e. OrganCMNIST and OrganSM-NIST), this example The size of each image is 28 Γ 28 pixels for the selected 8 tasks. Eight selected tasks involved multiple image modalities, including pathology, dermoscopy, X-ray, CT, ultrasound, and microscopy images. Among them, PathMNIST, OCTMNIST and BloodMNIST are three-channel images, and the remaining five tasks are grayscale images. During the experiment, this embodiment can generate the corresponding three-channel by copying the value of a single channel to the other two channels image. Also, PathMNIST, OCTMNIST, TissueMNIST, and OrganAMNIST contain orders of magnitude more images per class than other datasets, so only 20% of the data in these datasets is randomly sampled for model training and evaluation. The statistics of each sub-data set used for the experiment are shown in Table 1. It can be seen that the data from different sub-data sets are very different, and there is also a certain difference in the number of categories between each data set. Therefore, it is quite challenging to solve the task continuous learning on the MedMNIST-v2 dataset.
[0078] Table 1 MedMNIST-v2 dataset statistics
[0079]
[0080] (2) Experimental setup:
[0081] Incremental setting: For the MedMNIST-v2 data set, the training data of each incremental stage is a sub-data set. This embodiment simulates multiple incremental scenarios by randomly shuffling the appearance order of different sub-data sets. The set-up is complex due to the large variability among the multiple sub-datasets and the presence of multiple image modalities. It is worth mentioning that this embodiment is the first work to perform task increment on this dataset. In order to demonstrate the stability of the method proposed in this example, all experiments on MedMNIST-v2 in this example are performed by running 5 repeated experiments to obtain the mean and standard deviation of the performance.
[0082] (3) Data augmentation: In the model training stage, the data augmentation of the image is to first adjust the input image to 32Γ32 size, then fill the surrounding with 4 pixels and randomly cut it to 32Γ32 size, and then randomly select it with a probability of 0.5 Flip the image horizontally, then randomly change the brightness of the image, and finally convert it to a tensor and normalize it. In the testing phase, the input image is first resized to 32Γ32 size, then converted into a tensor and normalized.
[0083] Hyperparameter selection: In the experiment, unless otherwise specified, ResNet18 is used as the skeleton network by default, the optimizer used is Stochastic gradient descent (SGD), and the batch size is 64 . The training epoch (Epoch) is 70, the initial learning rate (Learning Rate) is 0.001, and the learning rate is reduced by a factor of 10 at epochs 49 and 63, respectively. In the optimizer, the Weight Decay is 0.0002 and the Momentum is 0.9.
[0084] Comparison methods: The classic task continuous learning methods compared in this example are: LwF, MAS, EWC and RWalk. For a more comprehensive comparison, the methods originally used for class-continuous learning: iCaRL and UCIR are also adapted to the task-continuous learning scenario for comparison. It should be emphasized that in all experiments, the method proposed in this example and other classic methods will not retain old samples in the continuous learning stage, while iCaRL and UCIR will follow the settings of the original paper, each old class retains 20 samples. samples.
[0085] (4) Experimental analysis:
[0086] In order to demonstrate the performance of different models on the MedMNIST-v2 dataset as much as possible, this example randomly shuffles the order in which the selected 8 sub-datasets appear, and conducts experiments in 3 of them. In order to compare all methods fairly, in this example, hyperparameter adjustment is done as best as possible for all methods, and the accuracy of all methods on the training set is as high as possible, and the optimal model is screened out by using the validation set. In addition, in order to fairly compare the stability of all methods, each group of experiments in this example uses 5 identical random seeds to repeat the experiment, saves the accuracy under different random seeds, and calculates the average MCR and MCR of each stage. standard deviation of .
[0087] The Joint method establishes upper bounds on model performance at different stages. Due to the inconsistent difficulty of different tasks, the performance of the Joint method in three different incremental orders is also inconsistent at different stages. But after completing the learning of 8 tasks, no matter in any order, the final performance of the Joint method is about 80%. At the same time, among all the methods, the performance of the method proposed in this embodiment still far exceeds most of the methods without saving any old class samples. In the experiment of increment sequence 1, the performance of all methods is about 93% in the first stage because no increment is involved. In the end, the performance of the method proposed in this example is 74.80%, which is the same as the performance upper bound of Joint. The gap is only 5.55%, which is very close. Among the other methods compared, UCIR has the best performance, and its final performance is 65.95%, which is 8.85% behind the method proposed in this example. The worst method is LwF, whose final performance is only 20.33%, which shows that due to the large domain shift between tasks, using knowledge distillation without retaining old class samples cannot preserve old samples well. Knowledge. The classic method iCaRL, which also uses knowledge distillation, can finally achieve a performance of 24.83%, which is a 4.5% improvement compared to LwF. This performance improvement should be due to saving some old class samples. This shows that for this type of method, retaining some old class samples can effectively improve the performance of the model in the process of continuous learning, but due to the limited number of old class samples saved, the mitigation of catastrophic forgetting is also limited. After learning the second task DE, the performance of all models degrades more or less because the second task DE is a relatively difficult task. Since iCaRL and UCIR save some old samples, the performance of the first three tasks does not drop significantly, and there is no significant gap with the method proposed in this example. However, other methods such as RWalk, MAS and EWC have a sharp drop in performance on the second task, probably because these tasks are initially proposed to solve the task incremental learning, and the domain offset between tasks is not as current. The MedMNIST-v2 dataset is large, so the model can also mitigate the catastrophic forgetting of the model by limiting the changes of important parameters to a certain extent without saving part of the old data. However, in the face of the current task increment scenario that is more in line with reality, due to the large domain offset between tasks, the restrictions on parameter changes not only cannot well preserve knowledge related to old tasks, but also do not It is beneficial for the model to learn new knowledge. In incremental order 1, when the model learns the task BL, the performance of all methods is basically improved, because the task is relatively simple and has a large number of categories. Therefore, if the model performs well on this task, it can make up for the performance gap caused by poor performance on other tasks to a certain extent, and finally improve the overall performance of the model.
[0088] see image 3 , in another embodiment of the present application, a continuous learning image classification system 100 based on deep learning is provided, including a task continuous learning model building module 101, an incremental training module 102 and a classification module 103;
[0089] The task-continuous learning model building module 101 is used to construct a task-continuous learning model with task-specific batch normalization, and the parameters of all convolution kernels in the feature extractor of the task-continuous learning model are fixed in all tasks When learning each new task, the parameters of the corresponding batch normalization layer BN in each convolution kernel are learned together with the task-specific classification head; the convolution kernel is the pre-trained convolution in the first stage kernels such that visually similar inputs produce similar feature vectors from the feature extractor;
[0090] The incremental training module 102 is used for incremental training of the task continuous learning model. When a new task comes, a new task-specific batch normalization layer and a classification head are added. When the training progresses to the π‘ stage When , there will be a batch normalization layer of π‘ after each convolutional layer, and the task continuous learning model determines which batch normalization layer and classification head the data flows to according to the current task, that is, when π‘> 1, the feature extraction is not updated. Only the parameters of the batch normalization layer and classification head corresponding to the added new task are updated;
[0091] The classification module 103 is used to obtain a trained task continuous learning model after completing the incremental training, input the image task to be classified into the trained task continuous learning model, and complete the classification task.
[0092]It should be noted that the deep learning-based continuous learning image classification system of the present invention corresponds to the deep learning-based continuous learning image classification method of the present invention. The technical features and their beneficial effects are all applicable to the embodiment of the continuous learning image classification based on deep learning. For details, please refer to the description in the method embodiment of the present invention, which will not be repeated here, and is hereby declared.
[0093] In addition, in the implementation of the deep learning-based continuous learning image classification system in the above-mentioned embodiment, the logical division of each program module is only an example, and in practical applications, it may be required, for example, due to the configuration requirements of the corresponding hardware or the realization of the software In consideration of convenience, the above function allocation is completed by different program modules, that is, the internal structure of the deep learning-based continuous learning image classification system is divided into different program modules to complete all or part of the functions described above.
[0094] see Figure 4 , in one embodiment, an electronic device for implementing a deep learning-based continuous learning image classification method is provided, the electronic device 200 may include a first processor 201, a first memory 202 and a bus, and may also include storage in A computer program in the first memory 202 that can run on the first processor 201 , such as a deep learning-based continuous learning image classification program 203 .
[0095] Wherein, the first memory 202 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, Disks, CDs, etc. The first memory 202 may be an internal storage unit of the electronic device 200 in some embodiments, such as a mobile hard disk of the electronic device 200 . In other embodiments, the first memory 202 may also be an external storage device of the electronic device 200, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) device equipped on the electronic device 200. SecureDigital, SD) card, flash memory card (Flash Card), etc. Further, the first memory 202 may also include both an internal storage unit of the electronic device 200 and an external storage device. The first memory 202 can not only be used to store the application software installed in the electronic device 200 and various data, such as the code of the continuous learning image classification program 203 based on deep learning, etc., but also can be used to temporarily store the data that has been output or will be stored. output data.
[0096] The first processor 201 may be composed of integrated circuits in some embodiments, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or A combination of multiple central processing units (Central Processing Units, CPUs), microprocessors, digital processing chips, graphics processors, and various control chips, etc. The first processor 201 is the control core (Control Unit) of the electronic device, uses various interfaces and lines to connect various components of the entire electronic device, and runs or executes the program stored in the first memory 202 or modules, and call data stored in the first memory 202 to perform various functions of the electronic device 200 and process data.
[0097] Figure 4 Only an electronic device with components is shown, as will be understood by those skilled in the art, Figure 4 The illustrated structure does not constitute a limitation on the electronic device 200, and may include fewer or more components than those shown, or combine some components, or arrange different components.
[0098] The deep learning-based continuous learning image classification program 203 stored in the first memory 202 in the electronic device 200 is a combination of multiple instructions, and when running in the first processor 201, can achieve:
[0099] Build a task-continuous learning model with task-specific batch normalization whose parameters in the feature extractor of all convolution kernels are fixed across all tasks, and each new task is learned when learning each new task. The parameters of the corresponding batch normalization layer BN in the convolution kernels are learned together with the task-specific classification heads; the convolution kernels are pre-trained convolution kernels in the first stage, so that visually similar inputs are extracted from the features similar eigenvectors are generated in the generator;
[0100] Perform incremental training on the task continuous learning model. When a new task comes, add a new task-specific batch normalization layer and classification head. When the training progresses to the π‘ stage, there will be π‘ after each convolutional layer. Layer batch normalization layer, the task continuous learning model determines which batch normalization layer and classification head the data flows to according to the current task, that is, when π‘> 1, the parameters of the convolution kernel in the feature extractor are not updated, only the added The parameters of the batch normalization layer and classification head corresponding to the new task;
[0101] After the incremental training is completed, the trained task continuous learning model is obtained, and the image task to be classified is input into the trained task continuous learning model to complete the classification task.
[0102] Further, if the modules/units integrated in the electronic device 200 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a non-volatile computer-readable storage medium. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) .
[0103] Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a non-volatile computer-readable storage medium , when the program is executed, it may include the flow of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0104] The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.
[0105] The above-mentioned embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above-mentioned embodiments, and any other changes, modifications, substitutions, combinations, The simplification should be equivalent replacement manners, which are all included in the protection scope of the present invention.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.