A weakly supervised learning method based on explainable saliency map
By using a weakly supervised learning method that generates interpretable saliency maps, the problem of missing semantic region annotations in image data is solved. By using saliency maps to guide the attention mechanism module, the fine-grained visual classification performance of convolutional neural networks is improved, and the autonomous semantic region localization and stability of the model are enhanced.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 中船智控科技(武汉)有限公司
- Filing Date
- 2022-10-24
- Publication Date
- 2026-06-19
AI Technical Summary
The lack of semantic annotations in existing image data leads to insufficient learning ability of the attention mechanism module in the convolutional neural network model, affecting the stability of classification performance.
We employ a weakly supervised learning method based on interpretable saliency maps. This method generates interpretable saliency maps to guide the attention mechanism module, utilizes channel and spatial weights to perform autonomous visual localization of semantic regions with classification power in the image, and adds a regularization term to the loss function for supervision, thereby improving the model's classification performance.
This method achieves a stable improvement in the fine-grained visual classification performance of convolutional neural network models without relying on manual annotation, reducing labor costs and improving the classification stability of the model.
Smart Images

Figure CN115690492B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image processing technology, and specifically relates to a weakly supervised learning method based on interpretable saliency maps (CAM, ClassActivation Mapping). Background Technology
[0002] With the rapid development of computer vision technology, fine-grained visual classification (FGVC, Flow-edge Guided Video Completion) has become an important research direction and has wide applications in visual tasks. Compared to ordinary visual classification tasks (such as distinguishing between birds and cars), where target objects are usually quite different, fine-grained visual classification tasks (such as distinguishing between different species of birds or different brands of cars) typically require algorithmic models to have the ability to identify subtle differences between similar target objects. Therefore, it has great value and broad prospects in many visual scenarios across various fields such as agriculture, medicine, and industrial manufacturing.
[0003] To distinguish between different targets that are broadly similar in features but have only slight differences, fine-grained visual classification usually requires first identifying semantically meaningful parts of an image, and then performing feature mining on the semantic regions that have the greatest classification power to discover subtle differences between different categories.
[0004] The most state-of-the-art and best-performing methods for fine-grained visual classification tasks typically combine convolutional neural network models with attention mechanisms. By focusing on the model channels and spatial regions with the greatest classification power, the classification performance of convolutional neural network models can be significantly improved.
[0005] However, due to the lack of annotations for semantically meaningful regions in existing image data, the guidance ability of the attention mechanism module is poor. This makes the detection of semantically meaningful regions with classification power by existing methods completely dependent on the learning ability of the network model, thus affecting the stability of the attention mechanism module's improvement on the classification performance of the convolutional neural network model. Summary of the Invention
[0006] To address the issue of the enormous workload involved in manually annotating semantically meaningful regions in images, this patent application proposes a weakly supervised learning method based on interpretable saliency maps for fine-grained visual classification. This method aims to supervise and guide the attention mechanism module, thereby steadily improving the classification performance of convolutional neural network models.
[0007] The technical solution adopted by this invention to solve its technical problem is: a weakly supervised learning method based on interpretability saliency maps, comprising the following steps.
[0008] S1, extract the depth features of the input image data using a convolutional neural network model: extract the image depth features A=f(X)∈R of the input image X using a feature extractor f(·). C×H×W Where C represents the number of channels in the convolutional neural network model, and H and W represent the spatial dimensions of the extracted depth features; by using the extracted depth features A as the input to the classifier g(·), the classification confidence scores of the convolutional neural network model for different categories of the input image X can be obtained as output y=g(f(X))=g(A)∈R. K Where K represents the total number of categories, and the classification confidence score of the convolutional neural network model for the input image X belonging to the k-th category is represented as y. k ;
[0009] S2, reweight the original input image using the mask image generated from the image depth features: for the depth features A extracted by the feature extractor f(·), use A i ∈R H×W The feature map A represents the i-th channel, and the feature map A is processed using bilinear interpolation. i Upsampling is performed to obtain a feature map B with the same spatial dimensions as the original input image X. i ; For feature map B i By performing a renormalization operation to limit the value range to the interval [0, 1], a mask image generated based on the feature map of the i-th channel is obtained. Where the mask image S i The weighted image is obtained using the Hadamard inner product, with the same spatial dimensions as the input image X.
[0010] S3, Design the loss function to achieve weak supervision of the attention mechanism through interpretability saliency maps: Calculate and generate interpretability saliency maps to obtain channel weights and spatial weights: [Image...] When used again as input to a convolutional neural network model, a new set of confidence scores for different categories can be obtained. Convolutional neural network models for images The classification confidence score belonging to the k-th category is represented as: confidence score Compared to the confidence score y of the input image X k The resulting changes are marked as For all channels i∈{1,2,…,C} of deep feature A, calculate its classification contribution to obtain the vector β∈R. C Renormalization is performed to obtain the feature channel weights of the interpretability saliency map. The channel weights used to generate the interpretability saliency map are obtained; the weights of the interpretability saliency map are then obtained by weighting the feature map of each channel using the obtained channel weights. To autonomously visually locate semantic parts with classification power in images, the weights of the interpretability saliency map are renormalized and normalized to obtain the feature space weights based on the interpretability saliency map. And thus obtain
[0011] S4 takes the deep features A obtained from the feature extractor as input, performs a global average over each channel, and connects them to a small neural network consisting of fully connected layers with the same number of channels C in the quantity domain. The network output is then passed through a softmax function to obtain the channel weights α = F generated by the channel attention mechanism module. CA (A)∈R C The depth feature A is weighted using the channel weight α to obtain the updated depth feature A′. i =α i A i Let i = 1, 2, ..., C; then, the updated deep features A′ are used as input and connected to a small neural network with convolutional layers. The network output is passed through a softmax function to obtain the channel weights γ = F generated by the spatial attention mechanism module. SA (A′)∈R H×W The spatial weight γ is used to weight the depth feature A to obtain the updated depth feature. The weighted depth feature A″ i The input to classifier g(·) is used to calculate the classification confidence, realizing the combination of convolutional neural network model and channel attention mechanism module;
[0012] S5, combine the channel weights α generated by the channel attention mechanism module with the feature channel weights of the interpretable saliency map. Treating the two probability distributions as two one-dimensional distributions, the difference between them is measured using a symmetric Kullback-Leibler (KL) divergence measure, and this divergence is used as a regularization term added to the loss function during convolutional neural network model training. Divergence of Discretized Probability Distributions The computational spatial attention mechanism module generates channel weights γ and feature space weights of the interpretable saliency map. The divergence is used as another regularization term added to the loss function during the training of the convolutional neural network model. Where L S The included KL divergence calculates the difference between two two-dimensional discretized probability distributions. Finally, the loss function for training the convolutional neural network model is updated to L. tot =L cls +L C +L S L clsThe classification loss is calculated by comparing the classification confidence score of the output convolutional neural network model with the class label of the input image X. Weak supervision of the attention mechanism module is achieved through interpretable saliency maps, thereby improving the classification performance of the convolutional neural network model.
[0013] Furthermore, in step S4, the two-layer neural network connected to ReLU as the activation function is finally obtained to obtain the channel weight α generated by the channel attention mechanism module.
[0014] Furthermore, in step S4, the network is connected to a 3×3 convolutional kernel with 1 channel, ultimately obtaining the channel weight γ generated by the spatial attention mechanism module.
[0015] The beneficial effects of this invention are:
[0016] This invention generates interpretable saliency maps based on an interpretability method, which can be used to autonomously visually locate semantic regions with classification power in images. The channel weights and spatial weights obtained in this process can be used to guide the supervised attention mechanism module to focus on the channel feature maps and spatial regions with the highest classification power among the deep features. This is achieved by adding a loss regularization term L designed based on the saliency map weights and attention module weights to the training loss function. C and L S To achieve a weakly supervised learning process.
[0017] The convolutional neural network model trained by the weakly supervised learning method of this invention will have the ability to autonomously identify semantic parts in images that have classification power, and will no longer require the assistance of saliency maps in applications. Attached Figure Description
[0018] Figure 1 This is a flowchart of the process for generating a visual saliency map according to the present invention;
[0019] Figure 2 This is a flowchart of the weakly supervised learning method of the present invention. Detailed Implementation
[0020] The present invention will now be described in further detail with reference to specific embodiments in conjunction with the accompanying drawings. These embodiments are for illustrative purposes only and do not constitute a limitation thereof.
[0021] Research has shown that convolutional neural network (CNN) models can effectively extract deep features from image data; while the attention mechanism module, including channel attention and spatial attention modules, can guide the classification model, focusing on the feature channels and spatial regions with the highest classification efficiency among the features extracted by the CNN model. Combining these two approaches can effectively improve the fine-grained visual classification performance of CNN models.
[0022] However, in the training of convolutional neural network models, since most image data lacks annotations for semantically meaningful regions, the lack of semantic region annotations during training leads to poor guidance of the attention module's learning process. As a result, the effectiveness of existing methods in guiding the model to focus on semantically meaningful regions with classification power in images by combining the attention mechanism depends entirely on the learning ability of the convolutional neural network model.
[0023] The randomness during the training process of convolutional neural network models may cause the attention module in the model to focus on semantic regions unrelated to the classification target, such as the background. This can render the attention mechanism ineffective or even have a negative impact on the model's classification ability, thus affecting the stability of the attention mechanism's contribution to improving model classification performance. Furthermore, manually annotating image semantic regions requires significant human resources, and the annotation boundaries are also limited by subjective human factors and the diversity of annotation tool boundaries.
[0024] An autonomous visual localization method that utilizes interpretability-based methods to generate interpretable saliency maps can achieve supervised guidance of the attention mechanism module with reasonable computational resources while avoiding the massive human cost of manual annotation. This steadily improves the model's fine-grained visual classification performance. The general idea is as follows: Figure 1 As shown.
[0025] like Figure 2 As shown, the present invention discloses a weakly supervised learning method based on interpretability saliency maps, which includes the following steps.
[0026] S1 extracts the depth features of the input image data using a convolutional neural network model.
[0027] For a classifier g(·) based on a convolutional neural network model, this step can be roughly divided into two parts: 1. A network feature extractor f(·) for extracting the depth features A of the input image; 2. A network classifier g(·) that takes the extracted depth features A as input and outputs the class confidence.
[0028] The image depth features of the input image X are extracted using the feature extractor f(·).
[0029] A=f(X)∈R C×H×W ,
[0030] Where C represents the number of channels in the convolutional neural network model, and H and W represent the spatial dimensions of the extracted depth features.
[0031] Using the extracted deep features A as input to the classification network g(·), we can obtain the classification confidence scores of the convolutional neural network model for different categories of the input image X:
[0032] y = g(f(X)) = g(A) ∈ R K ,
[0033] Where K represents the total number of categories, and the classification confidence score of the convolutional neural network model for the input image X belonging to the k-th category is represented by y. k .
[0034] S2, using the mask image generated from the image depth features, reweights the original input image to obtain the weighted image.
[0035] For the depth feature A extracted by the feature extractor f(·), use A i ∈R H×W The feature map of the i-th channel is represented by a bilinear interpolation method that performs linear interpolation once in each of the two dimensions of the two-dimensional matrix. i Upsampling is performed to obtain a feature map B with the same spatial dimensions as the original input image X. i ; For feature map B i By performing a renormalization operation to limit the value range to the interval [0, 1], a mask image generated based on the feature map of the i-th channel can be obtained:
[0036]
[0037] Where the mask image S i Since the spatial dimensions are the same as the input image X, it can be obtained through the Hadamard inner product (the sum of corresponding element-wise multiplications of vectors, matrices, and tensors of the same dimension, used here). (This indicates) the weighted image obtained. image This can be viewed as the deep features A extracted by the convolutional neural network model. i Semantic enhancement processing of the original input image X.
[0038] S3, calculate and generate an interpretability saliency map to obtain channel weights and spatial weights.
[0039] Image after semantic enhancement When used again as input to a convolutional neural network model, a new set of confidence scores for different categories can be obtained:
[0040]
[0041] Convolutional neural network models for images The classification confidence score belonging to the k-th category is represented as: The model for semantically enhanced images Confidence score regarding the category k (as indicated by the category label) The confidence score y compared to the original input image X k The resulting changes:
[0042]
[0043] It reflects the feature map A of the i-th channel of the depth feature A extracted by the feature extractor f(·). i The contribution of the model to classifying images into category k.
[0044] For all channels i∈{1,2,…,C} of deep feature A, calculate its classification contribution to obtain the vector β∈R. C Then, renormalization is performed to obtain the feature channel weights of the interpretability saliency map:
[0045]
[0046]
[0047] This yields the channel weights used to generate the interpretability saliency map; by weighting the feature map of each channel using the obtained channel weights, the following can be obtained: Figure 1 The interpretability saliency plot weights are shown below:
[0048]
[0049] It can be used for autonomous visual localization of semantic parts in images that have classification power. Interpretability saliency maps are the visual representation of interpretability saliency maps.
[0050] Renormalizing and rebalancing the weights δ of the interpretability saliency map yields the feature space weights based on the interpretability saliency map:
[0051]
[0052]
[0053] Channel weights obtained by an autonomous visual localization method that generates interpretable saliency maps based on interpretable methods. and spatial weights It can be used in subsequent steps to supervise, guide channels, and spatial attention modules.
[0054] S4, Constructing a channel-space attention module to reweight the feature map: The channel attention mechanism module and the spatial attention mechanism module used to implement the attention mechanism consist of two lightweight neural networks.
[0055] 1. Channel Attention Mechanism Module. The deep features A obtained from the feature extractor f(·) are used as input. After averaging across all channels (i.e., averaging all elements of the high-dimensional features for each channel), the input is connected to a small neural network consisting of two fully connected layers with the same number of neurons and channels C, using ReLU as the activation function. The network output is then processed by a Softmax function to obtain the channel weights α = F generated by the channel attention mechanism module. CA (A)∈R C The original depth feature A is weighted using the channel weight α to obtain the updated depth feature:
[0056] A′ i =α i A i , i = 1, 2, ..., C.
[0057] 2. Spatial Interchannel Attention Mechanism Module. The updated deep features A′ are used as input and connected to a small neural network with a 3×3 size and a single convolutional kernel channel. The network output is then processed by a Softmax function to obtain the channel weights generated by the spatial attention mechanism module.
[0058] γ=F SA (A′)∈R H×W ,
[0059] The original depth feature A is weighted using the spatial weight γ to obtain the updated depth feature:
[0060]
[0061] The weighted depth feature A″ i The input to classifier g(·) is used to calculate the classification confidence, thus combining the convolutional neural network model with the channel attention mechanism module.
[0062] S5. The loss function is designed to provide weak supervision of the attention mechanism module through interpretability saliency maps, thereby improving the classification performance of the convolutional neural network model.
[0063] In the absence of annotations for semantically meaningful regions in the original image, an interpretability-based saliency map is generated to autonomously visually locate semantically meaningful regions with classification power in the image. The feature channel weights of the interpretability saliency map generated by this method are discussed. Feature space weights of interpretability saliency maps The supervised channel attention mechanism module and the spatial attention mechanism module focus on the channel feature map and spatial region with the greatest classification power in deep feature A.
[0064] Considering the channel weights α generated by the channel attention mechanism module and the feature channel weights of the interpretable saliency map All are of size R C Furthermore, the normalized one-dimensional vector can be viewed as two one-dimensional probability distributions. Here, the symmetric Kullback-Leibler (KL) divergence is used to measure the difference between the two probability distributions and is added as a regularization term to the loss function during convolutional neural network model training.
[0065]
[0066] The specific form of the divergence of the discretized probability distribution can be written as:
[0067]
[0068] Similarly, we can compute the channel weights γ generated by the spatial attention mechanism module and the feature space weights of the interpretability saliency map. The divergence is added as another regularization term to the loss function during convolutional neural network model training:
[0069]
[0070] Where L S The included KL divergence calculates the difference between two two-dimensional discretized probability distributions:
[0071]
[0072] Finally, the loss function for training the convolutional neural network model is updated as follows:
[0073] L tot =L cls +L C +L S ,
[0074] Where L cls The classification loss is a comparison between the output classification confidence score of a convolutional neural network model and the class label of the input image X.
[0075] The above description is merely for illustrative purposes and not intended to limit the scope of the invention. Any person skilled in the art may make changes or modifications to the disclosed technical content to create equivalent embodiments. Those skilled in the art should understand that any modifications or equivalent substitutions that do not depart from the spirit and scope of the invention are covered within the scope of the claims of the invention.
Claims
1. A weakly supervised learning method based on an explainability saliency map, characterized in that: comprising the steps of S1 extracts the depth features of the input image X using the feature extractor f(·) of the convolutional neural network model. Where C represents the number of channels in the convolutional neural network model, and H and W represent the spatial dimensions of the extracted depth features; the depth feature A is used as the input to the classifier g(·) to obtain the classification confidence scores of the convolutional neural network model for different categories of input images X. Where K represents the total number of categories, and the convolutional neural network model determines whether the input image X belongs to the [number]th category. k The classification confidence score of a category is represented by y. k ; S2, define the deep feature A. i Feature map of each channel The feature map is processed using bilinear interpolation. A i Upsampling is performed to obtain a feature map with the same spatial size as the input image X. B i For feature maps B i By performing a renormalization operation, the range of values is limited to the interval [0, 1], and the result is obtained based on the first... i Mask image generated from feature maps of each channel mask image S i The weighted image is obtained using the Hadamard inner product, with the same spatial dimensions as the input image X. ; S3, image As input to a convolutional neural network model, a new set of confidence scores for different categories is obtained. Convolutional neural network models for images Belongs to the k The classification confidence score of a category is expressed as Confidence score Compared to y k The resulting changes are marked as For all channels of depth feature A Calculate its classification contribution to obtain a vector. Renormalization is performed to obtain the feature channel weights of the interpretability saliency map. , The weights of the interpretability saliency map are obtained by weighting the feature maps of each channel. The feature space weights based on the interpretability saliency map are obtained by renormalizing and normalizing the weights δ. , ; S4 takes the deep feature A as input, performs a global average across all channels, and connects it to a two-layer neural network with ReLU activation function, consisting of fully connected layers of the same number of channels C. The output is then processed by a softmax function to obtain the channel weights generated by the channel attention mechanism module. Utilizing the channel weight α We weight the deep feature A to obtain the updated deep feature. The updated depth feature A′ is then used as input and connected to a 3×3 convolutional kernel with 1 channel. The output is then processed by a softmax function to obtain the spatial weights generated by the spatial attention mechanism module. Using this spatial weight γ We weight the deep feature A to obtain the updated deep feature. ; deep features The combination of the convolutional neural network model and the channel attention mechanism module is realized as the input of the classifier g(·) for calculating the classification confidence. S5, combine the channel weights α generated by the channel attention mechanism module with the feature channel weights of the interpretable saliency map. We emphasize constructing two one-dimensional probability distributions, using a symmetric KL divergence measure to evaluate the difference between them, and then adding this as a regularization term to the loss function during convolutional neural network model training. The divergence of a discretized probability distribution is written as... ; Computing the spatial weights generated by the spatial attention mechanism module γ Feature space weights of interpretability saliency maps The divergence is used as another regularization term added to the loss function during the training of the convolutional neural network model. ,in L S The included KL divergence calculates the difference between two two-dimensional discretized probability distributions. Finally, the loss function of the convolutional neural network model training is updated to... ,in L cls The classification loss is calculated by comparing the classification confidence score of the output convolutional neural network model with the class label of the input image X; weak supervision of the attention mechanism module is achieved through interpretable saliency maps.