A fine-grained image classification method based on multi-scale feature fusion and attention mechanism

This fine-grained image classification method, which incorporates multi-scale feature fusion and attention mechanisms, addresses the problem of insufficient utilization of multi-scale information in traditional models, achieving high accuracy and good interpretability, and improving classification performance.

CN122244500APending Publication Date: 2026-06-19JIANGSU UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGSU UNIV
Filing Date
2026-02-06
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Traditional models struggle to capture multi-scale information and local features in fine-grained image classification, resulting in insufficient classification accuracy, and interpretability enhancement techniques may affect performance.

Method used

We employ a multi-scale feature fusion and attention mechanism, extracting multi-scale contextual information through Shuffle Attention and deformable receptive field convolution, and combining channel shuffling and max pooling operations to enhance feature representation capabilities.

🎯Benefits of technology

It improves the performance of fine-grained image classification, adaptively focuses on key regions of the image, reduces background interference, and maintains high accuracy and good interpretability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244500A_ABST
    Figure CN122244500A_ABST
Patent Text Reader

Abstract

This invention proposes a fine-grained image classification method based on multi-scale feature fusion and attention mechanisms, belonging to the fields of computer vision and deep learning. The method includes: designing a novel fine-grained image classification model employing multi-scale feature fusion and attention mechanisms; firstly, introducing ShuffeAttention into the novel convolutional neural network InceptionNeXt to enhance the model's cross-channel information sharing and information representation capabilities; secondly, designing a deformable multi-scale context extraction unit: this unit receives feature maps from the previous stage output; internally, it consists of K parallel deformable convolutional branches, each using a different dilation rate to capture contextual information; finally, the outputs of all branches are fused.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image classification, specifically to a fine-grained image classification method based on multi-scale feature fusion and attention mechanisms. Background Technology

[0002] Fine-grained image classification (FGIC) is an important topic in computer vision, exhibiting a higher sensitivity to subtle differences between categories compared to general image classification tasks. General image classification typically deals with coarse-grained categories where differences are readily apparent, such as distinguishing basic categories of animals or objects. Fine-grained image classification, however, requires models to identify and differentiate subtle differences within categories, often those that are visually very similar and distinguishable only by details and features. Categories in fine-grained classification tasks are frequently very similar in appearance, with differences usually lying in local features or details, such as bird species, car brands, or plant subspecies. Traditional classification methods may struggle to handle these subtle differences. Fine-grained image classification requires models to capture detailed features in images, such as subtle variations in texture, shape, and color, which typically necessitates sophisticated feature extraction methods and higher model accuracy. Fine-grained image classification often involves complex background information, requiring models to ignore this background noise and focus on the details of objects.

[0003] Many fine-grained classification tasks involve categories of objects with very subtle differences in appearance; these objects' shape, texture, or color features vary at different scales. Traditional models often only capture information at a single scale and struggle to handle objects of varying sizes or detailed features. For example, in fine-grained classification of birds or plants, subtle differences in bird feathers or the texture of plant leaves may present different visual features at different scales, and traditional models may fail to fully understand these details. In fine-grained classification, certain local regions of an image are more important than others, but many traditional convolutional neural networks pay equal attention to all regions when processing images, lacking the ability to dynamically weight regions based on their importance. This makes the model prone to overlooking small but crucial features, leading to reduced classification accuracy.

[0004] Some prototype-based interpretable models suffer from insufficient accuracy in fine-grained classification tasks. Because these models typically rely on fixed prototypes to represent each object class, they may fail to capture enough detail when dealing with complex and subtly different objects, thus affecting the model's accuracy. Interpretable models usually provide a transparent classification process and decision-making basis, but these models often sacrifice some classification performance to improve interpretability. Especially in fine-grained classification tasks, balancing high accuracy with good interpretability is a challenge. Traditional interpretability enhancement techniques may lead to performance degradation, making it difficult to meet the demands of efficient classification in real-world applications. Summary of the Invention

[0005] To address the aforementioned problems of insufficient local feature capture capability and inadequate utilization of multi-scale information, this invention provides a fine-grained image classification method based on multi-scale feature fusion and attention mechanism.

[0006] To achieve the above objectives, the present invention adopts the following technical solution: A fine-grained image classification method based on multi-scale feature fusion and attention mechanism includes the following steps: S1, The input image is processed by a convolutional neural network to extract features, resulting in multi-level feature maps; S2, input the feature map obtained in S1 into the feature fusion module, and from high level to low level, use the Shuffle Attention mechanism to weight the features of each layer of feature map to obtain attention-enhanced features; S3 uses a set of deformable receptive field convolutions to extract multi-scale contextual information at this level, upsamples the features processed at the current level, and concatenates them with the features of the next level as the new input for this level. S4. The feature maps obtained in S3 are fused and max pooling is applied to obtain an image encoding p representing the score of the presence of all prototypes in the image. p is used as the input to a linear classification layer with weights ω. c The prototypes are linked to categories and used as a scoring system. By accumulating scores from the relevant existing prototype parts, a score for each category is output, which can be used for classification decisions.

[0007] Furthermore, S1 specifically includes: S11 first takes the feature map X extracted by the convolutional layer as input, and uses the model parameters of the ImageNet pre-trained InceptionNeXt as the initial parameters of the SAInceptionNeXt feature extraction layer to extract features on the CUB200-2011 and Stanford Cars datasets. S12, the model mainly consists of 4 stages, each stage outputs features at different scales, and the number of blocks in the 4 stages are 3, 3, 9, and 3 respectively; S13 uses the SAInceptionNeXt backbone network to perform spatial dimensionality reduction and channel dimensionality enhancement, outputting a set of multi-level feature maps.

[0008] Furthermore, the attention mechanism module in S2 includes: At the beginning of each attention unit, X k It is divided into two branches along the channel dimension, namely X k0 and X k1 One branch generates a channel attention map using the relationships between channels, while the other branch generates a spatial attention map using the spatial relationships between features, enabling the model to know "what to focus on" and "where to focus on".

[0009] The X k0 and X k1 Represented as:

[0010] Where C, H, and W represent the number of channels, spatial height, and width, respectively.

[0011] Channel attention branch through input X k0 Global average pooling (GAP) is used to obtain global information for each group, generating a channel statistics vector. s ; The importance of each channel is calculated by performing a linear transformation on the obtained global information; Channel attention is obtained by using the Sigmoid activation function and then applied to X. k0 Adjust the weights of the channel features. The channel statistics s The calculation is expressed as:

[0012] Where X k0 This represents the feature map of the k0th channel, with dimensions H×W, F gp This represents the global average pooling function. This indicates that the summation result is normalized by dividing the sum by H×W to obtain the average value.

[0013] The final output of the channel attention is calculated as follows:

[0014] in and This is a parameter used to move and scale s.

[0015] Spatial attention branch through X k1 Compute spatial attention using group normalization; Adjustments are made using learned weights and biases; Spatial attention is obtained by using the sigmoid activation function and then applied to X. k1 Adjust the weights of spatial features. Feature aggregation and channel rearrangement combine information extracted from different channels or feature groups by aggregating all sub-features to represent the input data more comprehensively. By employing a channel shuffling operator to achieve cross-group information flow, this method allows information to be exchanged between different feature groups along the channel dimension, thereby enhancing the expressive power of features.

[0016] The final output of the spatial attention is calculated as follows:

[0017] in and It is a parameter.

[0018] Furthermore, the multi-scale feature fusion module in S3 specifically includes: Based on the output features, multi-scale context is extracted through a set of parallel convolutional branches.

[0019] Specifically, for a set of expansion rates containing K different basic receptive fields... This generates K context features, where K=3.

[0020] A lightweight convolutional layer with shared parameters is used to predict offsets based on input features. Deformable convolution operations are then performed on the features using the offsets and their corresponding dilation rates. The outputs of K branches are then fused, where K=3, to form the comprehensive contextual features of the current layer.

[0021] The deep semantic context is fused with the shallow detailed features in an iterative manner, starting from the deepest layer and progressing to the shallowest. The final output is a feature map.

[0022] Furthermore, S4 specifically includes: S41, apply max pooling to the D feature maps obtained in S23 to obtain a tensor p representing the presence score of all prototypes in the image. The resulting tensor p represents the presence score of all D prototypes in the image. For example, the image encoding p can be [0.9, 0.0, 0.0, 0.1, 0.8, 1.0], which means that the first, fifth, and sixth prototypes exist in the image. S42, the image encoding p obtained in S31 is used as having ω c The input to the linear classification layer connects the prototype to the category and serves as a scoring system. The output score for each category is the sum of the products of the prototype existence score and the category weights of the linear layer. S43, by accumulating the scores of the relevant existing prototype parts, outputs the score for each category, which can be used for classification decisions; S44 uses Alignment Loss (LA), Tanh Loss (LT), and Cross Entropy Loss (LC). In actual training, the weighted sum of Alignment Loss, Tanh Loss, and Cross Entropy Loss is used as the total loss function. The overall loss function is expressed as:

[0023] Where L A It is the alignment loss, L T It is Tanh loss, L C It is cross-entropy loss S45 uses evaluation metrics such as Accuracy, Precision, Recall, and F1-score to measure the model's predictive ability. S46 employs the Adam optimizer to update weights during training.

[0024] The present invention has at least the following beneficial effects: This invention employs a fine-grained image classification method based on multi-scale feature fusion and channel shuffling, which can extract local image features and use multi-scale feature information for inference.

[0025] This invention employs a fine-grained image classification model that combines ShuffleAttention channel shuffling to process image features, adaptively focusing on key regions in the image, capturing local image details, and reducing background interference, thereby improving classification performance.

[0026] This invention employs multi-scale feature fusion to integrate feature maps from deep and shallow layers of the neural network, making full use of features at different depths to enhance the model's feature representation capabilities while reducing computational load and maintaining excellent performance. Attached Figure Description

[0027] Figure 1 This is a flowchart of the overall model of the fine-grained image classification method based on multi-scale feature fusion and attention mechanism provided by the present invention. Figure 2 This is a schematic diagram of the ShuffeAttention module provided by the present invention; Figure 3 This is a schematic diagram of the multi-scale feature fusion module provided by the present invention. Detailed Implementation

[0028] The present invention will be further described below with reference to the accompanying drawings and specific embodiments. It should be noted that the technical solution and design principle of the present invention will be described in detail below with reference to only one optimized technical solution, but the protection scope of the present invention is not limited thereto.

[0029] The embodiments described above are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments. Any obvious improvements, substitutions or modifications that can be made by those skilled in the art without departing from the essence of the present invention shall fall within the protection scope of the present invention.

[0030] like Figure 1 As shown, this invention provides a fine-grained image classification method based on multi-scale feature fusion and attention mechanism, comprising the following steps: S1, The input image is processed by a convolutional neural network to extract features, resulting in multi-level feature maps; S2, input the feature map obtained in S1 into the feature fusion module, and from high level to low level, use the Shuffle Attention mechanism to weight the features of each layer of feature map to obtain attention-enhanced features; S3 uses a set of deformable receptive field convolutions to extract multi-scale contextual information at this level, upsamples the features processed at the current level, and concatenates them with the features of the next level as the new input for this level. S4. The feature maps obtained in S3 are fused and max pooling is applied to obtain an image encoding p representing the score of the presence of all prototypes in the image. p is used as the input to a linear classification layer with weights ω. c The prototypes are linked to categories and used as a scoring system. By accumulating scores from the relevant existing prototype parts, a score for each category is output, which can be used for classification decisions.

[0031] like Figure 1 As shown, in a preferred embodiment of the present invention, step S1 specifically includes: S11 first takes the feature map X extracted by the convolutional layer as input, and uses the model parameters of the ImageNet pre-trained InceptionNeXt as the initial parameters of the SAInceptionNeXt feature extraction layer to extract features on the CUB200-2011 and Stanford Cars datasets. S12, the model mainly consists of 4 stages, each stage outputs features at different scales, and the number of blocks in the 4 stages are 3, 3, 9, and 3 respectively; S13 uses the SAInceptionNeXt backbone network to perform spatial dimensionality reduction and channel dimensionality enhancement, outputting a set of multi-level feature maps.

[0032] like Figure 2 As shown, in a preferred embodiment of the present invention, the attention module in step S2 specifically includes: At the beginning of each attention unit, X k It is divided into two branches along the channel dimension, namely X k0 and X k1 One branch generates a channel attention map using the relationships between channels, while the other branch generates a spatial attention map using the spatial relationships between features, enabling the model to know "what to focus on" and "where to focus on".

[0033] The X k0 and X k1 Represented as:

[0034] Where C, H, and W represent the number of channels, spatial height, and width, respectively.

[0035] Channel attention branch through input X k0 Global average pooling (GAP) is used to obtain global information for each group, generating a channel statistics vector. s ; The importance of each channel is calculated by performing a linear transformation on the obtained global information; Channel attention is obtained by using the Sigmoid activation function and then applied to X. k0 Adjust the weights of the channel features. The channel statistics s The calculation is expressed as:

[0036] Where X k0 Let f(x) represent the feature map of the k0th channel, with dimensions H×W, and Fgp represent the global average pooling function. This indicates that the summation result is normalized by dividing the sum by H×W to obtain the average value.

[0037] The final output of the channel attention is calculated as follows:

[0038] in and This is a parameter used to move and scale s.

[0039] Spatial attention branch through X k1 Compute spatial attention using group normalization; Adjustments are made using learned weights and biases; Spatial attention is obtained by using the sigmoid activation function and then applied to X. k1 Adjust the weights of spatial features. Feature aggregation and channel rearrangement combine information extracted from different channels or feature groups by aggregating all sub-features to represent the input data more comprehensively. By employing a channel shuffling operator to achieve cross-group information flow, this method allows information to be exchanged between different feature groups along the channel dimension, thereby enhancing the expressive power of features.

[0040] The final output of the spatial attention is calculated as follows:

[0041] in and It is a parameter.

[0042] like Figure 3 As shown, in a preferred embodiment of the present invention, the multi-scale feature fusion in step S3 specifically includes: A novel mechanism for cross-layer feature iterative fusion in the middle of the network is designed, abandoning the traditional multi-scale fusion mode of post-processing on a single deep feature map. For the output features, multi-scale context is extracted through a set of parallel convolutional branches.

[0043] Specifically, for a set of expansion rates D = {r_1,r_2,...,r_K} containing K different basic receptive fields, K contextual features are generated accordingly.

[0044] A lightweight convolutional layer with shared parameters is used to predict offsets based on input features. Deformable convolution operations are then performed on the features using the offsets and their corresponding dilation rates. The outputs of K branches are then fused, where K=3, to form the comprehensive contextual features of the current layer.

[0045] The deep semantic context is fused with the shallow detailed features in an iterative manner, starting from the deepest layer and progressing to the shallowest. The final output is a feature map.

[0046] In a preferred embodiment of the present invention, step S4 specifically includes: S41, apply max pooling to the D feature maps obtained in S23 to obtain a tensor p representing the presence score of all prototypes in the image. The resulting tensor p represents the presence score of all D prototypes in the image. For example, the image encoding p can be [0.9, 0.0, 0.0, 0.1, 0.8, 1.0], which means that the first, fifth, and sixth prototypes exist in the image. S42, the image encoding p obtained in S31 is used as having ω c The input to the linear classification layer connects the prototype to the category and serves as a scoring system. The output score for each category is the sum of the products of the prototype existence score and the category weights of the linear layer. S43, by accumulating the scores of the relevant existing prototype parts, outputs the score for each category, which can be used for classification decisions; S44 uses Alignment Loss (LA), Tanh Loss (LT), and Cross Entropy Loss (LC). In actual training, the weighted sum of Alignment Loss, Tanh Loss, and Cross Entropy Loss is used as the total loss function. The alignment loss (LA) is expressed as follows:

[0047] in and These are two views of the same image block.

[0048] The Tanh loss (Tanh Loss LT) is expressed as follows:

[0049] The overall loss function is expressed as:

[0050] Where LA is the alignment loss, LT is the Tanh loss, and LC is the cross-entropy loss.

[0051] S45 uses the evaluation metrics Accuracy, Recall, Precision, and F1-score to measure the model's predictive ability. The formulas for calculating these metrics are as follows:

[0052]

[0053]

[0054]

[0055] S46 employs the Adam optimizer to update weights during training.

[0056] The Adam optimizer algorithm mainly includes the following three steps: 1) First, calculate the first-order moment estimate and the second-order moment estimate. 2) Since the first-order and second-order moment estimates may have large deviations in the early stages of training, Adam performs deviation correction on these two moment estimates to ensure that they are more stable at the start of training.

[0057] 3) Update each parameter using the corrected information. The calculation for the parameter update is expressed as follows:

[0058] in, θ It is a parameter. α It's the learning rate. It is the corrected first-order moment estimate. It is the corrected second-order moment estimate. It is a very small constant used to prevent division by zero.

Claims

1. A fine-grained image classification method based on multi-scale feature fusion and attention mechanism, characterized in that, Includes the following steps: S1, The input image is processed by a convolutional neural network to extract features, resulting in multi-level feature maps; S2, input the feature map obtained in S1 into the feature fusion module, and from high level to low level, use the Shuffle Attention mechanism to weight the features of each layer of feature map to obtain attention-enhanced features; S3 uses a set of deformable receptive field convolutions to extract multi-scale contextual information at this level, upsamples the features processed at the current level, and concatenates them with the features of the next level as the new input for this level. S4. The feature maps obtained in S3 are fused and max pooling is applied to obtain an image encoding p representing the score of the presence of all prototypes in the image. p is used as the input to a linear classification layer with weights ω. c, The prototypes are linked to categories and used as a scoring system; by accumulating scores from the relevant existing prototype parts, a score for each category is output, which can be used for classification decisions. S1 specifically includes: S11 first takes the feature map X extracted by the convolutional layer as input, and uses the model parameters of the ImageNet pre-trained InceptionNeXt as the initial parameters of the SAInceptionNeXt feature extraction layer to extract features on the CUB200-2011 and Stanford Cars datasets. S12, the model mainly consists of 4 stages, each stage outputs features at different scales, and the number of blocks in the 4 stages are 3, 3, 9, and 3 respectively; S13 uses the SAInceptionNeXt backbone network to perform spatial dimensionality reduction and channel dimensionality enhancement, outputting a set of multi-level feature maps.

2. The fine-grained image classification method based on multi-scale feature fusion and attention mechanism according to claim 1, characterized in that, The attention mechanism module in S2 specifically includes: At the beginning of each attention unit, X k It is divided into two branches along the channel dimension, namely X k0 and X k1 One branch generates a channel attention map using the interrelationships between channels, while the other branch generates a spatial attention map using the spatial relationships between features, enabling the model to know "what to focus on" and "where to focus on". The X k0 and X k1 Represented as: ; Where C, H, and W represent the number of channels, spatial height, and width, respectively; Channel attention branch through input X k0 Global average pooling (GAP) is used to obtain global information for each group, generating a channel statistics vector. s ; The importance of each channel is calculated by performing a linear transformation on the obtained global information; Channel attention is obtained by using the Sigmoid activation function and then applied to X. k0 Adjust the weights of the channel features. The channel statistics s The calculation is expressed as: ; Where X k0 This represents the feature map of the k0th channel, with dimensions H×W, F gp This represents the global average pooling function; This indicates that the summation result is normalized, that is, the sum is divided by H×W to obtain the average value; The final output of the channel attention is calculated as follows: ; in and This is the parameter used to move and scale s; Spatial attention branch through X k1 Compute spatial attention using group normalization; Adjustments are made using learned weights and biases; Spatial attention is obtained by using the sigmoid activation function and then applied to X. k1 Adjust the weights of spatial features. Feature aggregation and channel rearrangement combine information extracted from different channels or feature groups by aggregating all sub-features to represent the input data more comprehensively. By employing a channel shuffling operator to achieve cross-group information flow, this method allows information to be exchanged between different feature groups along the channel dimension, thereby enhancing the expressive power of features. The final output of the spatial attention is calculated as follows: ; in and It is a parameter.

3. The fine-grained image classification method based on multi-scale feature fusion and attention mechanism according to claim 1, characterized in that, The multi-scale feature fusion module in S3 includes: For the output features, multi-scale context is extracted through a set of parallel convolutional branches; Specifically, for a set of expansion rates containing K different basic receptive fields... K context features are generated accordingly; A lightweight convolutional layer with shared parameters is used to predict the offset based on the input features. The offset and the corresponding dilation rate are used to perform deformable convolution operations on the features. The outputs of K branches are fused together, where K=3, to form the comprehensive context features of the current layer. The deep semantic context is fused with the shallow detailed features in an iterative manner, starting from the deepest layer and progressing to the shallowest layer; the final output is a feature map.

4. The fine-grained image classification method based on multi-scale feature fusion and attention mechanism according to claim 1, characterized in that, S4 specifically includes: S41, apply max pooling to the D feature maps obtained in S3 to obtain a tensor p that represents the presence score of all prototypes in the image. The resulting tensor p represents the presence score of all D prototypes in the image. For example, the image encoding p can be [0.9, 0.0, 0.0, 0.1, 0.8, 1.0], which means that the first, fifth, and sixth prototypes exist in the image. S42, the image encoding p obtained in S31 is used as having ω c The input to the linear classification layer connects the prototype to the category and serves as a scoring system; the output score for each category is the sum of the product of the prototype existence score and the category weight of the linear layer. S43, by accumulating the scores of the relevant existing prototype parts, outputs the score for each category, which can be used for classification decisions; S44 uses Alignment Loss (LA), Tanh Loss (LT), and Cross Entropy Loss (LC). In actual training, the weighted sum of Alignment Loss, Tanh Loss, and Cross Entropy Loss is used as the total loss function. The overall loss function is expressed as: ; Where L A It is the alignment loss, L T It is Tanh loss, L C It is cross-entropy loss; S45 uses evaluation metrics such as Accuracy, Precision, Recall, and F1-score to measure the model's predictive ability. S46 employs the Adam optimizer to update weights during training.