A component-aware, fine-grained classification method and system for small samples

By performing structure perception and segmentation on the target instance bounding box, generating a mask image and fusing global and local features, and combining a meta-learning training paradigm and loss optimization, the problems of difficulty in capturing local features and insufficient generalization with small samples in fine-grained classification are solved, achieving stable and efficient fine-grained classification.

CN122090182BActive Publication Date: 2026-06-30CENT SOUTH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CENT SOUTH UNIV
Filing Date
2026-04-23
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, local discriminative features are difficult to capture effectively during fine-grained classification, and the model's generalization ability is insufficient under small sample conditions, making it prone to overfitting.

Method used

A component-aware, small-sample, fine-grained classification method is proposed. This method generates a mask image by performing structure perception and segmentation on the target instance bounding box, extracts and fuses global and local features, adopts a support set-query set meta-learning training paradigm, and optimizes the feature extraction network using angle anisotropy discrimination loss and supervised contrastive loss.

Benefits of technology

It significantly enhances the ability to distinguish fine-grained categories that are highly similar in appearance but have different component arrangements, and has good generalization performance and scalability, enabling it to obtain stable and high-confidence classification results under small sample conditions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122090182B_ABST
    Figure CN122090182B_ABST
Patent Text Reader

Abstract

This invention discloses a small-sample fine-grained classification method and system based on component perception, relating to the fields of computer vision and intelligent perception technology. The method includes training and inference phases. In the training phase: support sets and query sets are acquired for each sampled category; structural perception and segmentation are performed on the target instance bounding box to generate a key component mask image; global features are extracted and local features are extracted based on the mask image, and fused to obtain target fusion features; feature prototypes are obtained based on the target fusion features in the support set; loss is calculated based on query set features and feature prototypes, and network parameters are updated. In the inference phase: using the trained network, target fusion features of labeled samples of candidate categories are extracted in the same manner, and the average is taken as the feature prototype; target fusion features of the samples to be classified are extracted, and the classification result is output after matching. This invention can effectively capture local discriminative features and improve the generalization ability of small-sample fine-grained classification.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision and intelligent perception technology, and particularly relates to a small-sample fine-grained classification method and system based on component perception. Background Technology

[0002] Fine-grained visual recognition aims to distinguish objects that are highly similar in appearance but have significant semantic differences (such as bird subspecies, aircraft models, insulator cracks, etc.), and has important application value in fields such as ecological protection, industrial quality inspection, and intelligent security. However, this type of task faces two core challenges:

[0003] First, local discriminative features are difficult to capture. Differences between fine-grained categories often exist only in local parts (such as beak shape, wing texture, or tiny cracks in insulators), while the overall appearance is highly similar. Existing methods mostly employ global feature aggregation strategies, causing subtle local discriminative information to be easily obscured by background, pose changes, or overall target similarity. Some studies have attempted to introduce part annotation or weakly supervised attention mechanisms, but the former relies on expensive manual annotation (requiring experts to box out each part individually), while the latter is prone to getting stuck in local optima when samples are scarce, making it difficult to stably locate key parts.

[0004] Second, the model's generalization ability is severely insufficient under small sample conditions. In practical applications, obtaining high-quality, fine-grained labeled data is difficult (high professional threshold and high collection cost), with only 1 to 5 usable samples per class. Most existing few-sample learning methods focus on general object recognition, learning global feature representations through cross-task training and assuming that these global feature representations can maintain discriminativeness in new categories. However, in fine-grained scenarios, due to the highly localized discriminative information, global features are extremely prone to overfitting under small sample conditions—the model may only remember the overall appearance of the training samples and fail to generalize to subtle changes in component arrangement or texture in new categories. Furthermore, traditional loss functions (such as cross-entropy) lack explicit constraints on intra-class compactness and inter-class separability, further exacerbating the risk of overfitting.

[0005] In summary, existing technologies urgently need a fine-grained classification method that can automatically perceive key components of a target, fuse global and local features, and enhance the discriminative power of the feature space under conditions of scarce samples, in order to overcome the above-mentioned shortcomings. Summary of the Invention

[0006] To address the aforementioned deficiencies in the existing technology, the present invention aims to provide a small-sample fine-grained classification method and system based on component perception, in order to solve the technical problems in the existing technology where local discriminative features are difficult to capture effectively during fine-grained classification, and where the model has insufficient generalization ability and is prone to overfitting under small-sample conditions.

[0007] This invention solves the above-mentioned technical problems through the following technical solution: a small-sample fine-grained classification method based on component perception, comprising a training phase and an inference phase, wherein the training phase includes:

[0008] Multiple categories are sampled from the sample dataset; for each category, its support set and query set are obtained; both the support set and query set contain at least one target instance bounding diagram with category labels;

[0009] The target instance block diagram is subjected to structure awareness and segmentation to generate mask images of each key component of the target;

[0010] The global features of the target instance bounding box are extracted using a feature extraction network, and the local features of each key component are extracted based on the mask image; the global features and the local features are then fused to obtain the target fused features.

[0011] For each category, the feature prototype of that category is obtained based on the average value of the fused features of each target in its support set;

[0012] The loss value is calculated based on the fused features of each target in the query set and the feature prototypes of the corresponding categories, and the feature extraction network is updated accordingly;

[0013] The inference phase includes: obtaining at least one labeled target instance bounding map for each candidate category; using a trained feature extraction network, extracting target fusion features of the target instance bounding maps for each candidate category and calculating the average value as the feature prototype for that category, in the same manner as the training phase, according to the same structure perception and segmentation, feature extraction and fusion methods; and extracting target fusion features of the samples to be classified in the same manner as the training phase.

[0014] The target fusion features of the sample to be classified are matched with the feature prototypes of each candidate category to output fine-grained classification results.

[0015] This invention generates mask images of key components by performing structure awareness and segmentation on the target instance bounding diagram. Based on these mask images, local and global features are extracted and fused, enabling the model to explicitly focus on key component regions of the target (such as bird beaks, wing textures, and micro-cracks in insulators). Compared to existing global feature aggregation methods, this invention avoids the problem of subtle local differences being obscured by background or pose changes, significantly enhancing the ability to distinguish fine-grained categories that are highly similar in appearance but have different component arrangements.

[0016] This invention employs a support set-query set meta-learning training paradigm: for each sampled category, a feature prototype is constructed using the average fused features of the support set samples, and the feature extraction network is updated based on the loss calculated between the fused features of the query set samples and the corresponding feature prototype. This mechanism forces the model to learn the general ability to quickly construct discriminative prototypes from a small number of samples, rather than memorizing specific samples, thus achieving good generalization performance under small sample conditions and effectively mitigating overfitting.

[0017] During the inference phase, this invention utilizes a pre-trained feature extraction network to calculate and average the fused features of labeled samples from each candidate category in real time, thereby obtaining the feature prototype for that category. This approach does not rely on a fixed set of categories known during the training phase, enabling flexible handling of scenarios where the number of candidate categories differs from the number of training categories. Furthermore, when adding a new category, only a small number of labeled samples are needed to dynamically generate the feature prototype, demonstrating good scalability and practicality.

[0018] By fusing global features with component-level local features, the resulting target fused features possess both overall semantic consistency and local structural sensitivity. Building upon this, the difference between the support set prototype and query set features is used as the optimization objective, resulting in clearer inter-class boundaries and a more compact intra-class distribution in the feature space. This allows for stable and high-confidence classification results even with only a small number of labeled samples.

[0019] This invention presents a complete differentiable process from component perception, segmentation, feature extraction and fusion to prototype matching, where all trainable modules (such as feature extraction networks) can be jointly optimized using a loss function. This avoids the error accumulation problem inherent in the traditional two-stage method of "detecting components first and then classifying," thus improving overall recognition performance.

[0020] Furthermore, the process of obtaining the sample dataset includes:

[0021] The original image is used to detect objects using a pre-trained object detection model to obtain the bounding box information of the object instances;

[0022] The region containing the target instance is cropped from the original image based on the bounding box information to obtain the target instance bounding box diagram;

[0023] The target instance block diagram is categorized to obtain the corresponding category labels;

[0024] The sample dataset is constructed based on the target instance diagram and its category labels.

[0025] This invention further defines the construction method of the sample dataset, realizes the automatic localization and extraction of target instances, avoids the high cost and low efficiency of manual cropping or manual annotation of target regions, and can quickly and in batches construct high-quality small sample fine-grained classification datasets from the original images, which is especially suitable for image scenes containing complex backgrounds or multiple targets.

[0026] Furthermore, the target detection model is built based on the YOLOv8 network architecture and includes:

[0027] The attention mechanism module set at the input end is used to enhance the spatial and channel dimensions of the input image.

[0028] The backbone network, using the HGNetV2 architecture, is used to extract multi-scale features from the feature maps enhanced by the attention mechanism module.

[0029] The neck network employs a C2f structure and SPPF pooling modules to perform cross-layer fusion of multi-scale features output by the backbone network.

[0030] The head network is configured with a multi-scale detection head, which is used to output the bounding box information and confidence level of the target instance based on the fusion features output by the neck network.

[0031] The target detection model structure defined in this invention enhances the model's sensitivity to target contours, edges, and local structures, improving the detection accuracy of multi-scale targets, especially small targets. At the same time, it reduces the number of parameters and computational cost through grouped convolution and channel reuse in HGNetV2, achieving efficient and accurate target instance localization.

[0032] Further, the target instance block diagram is subjected to structure awareness and segmentation to generate mask images of each key component of the target, including:

[0033] The target instance block diagram is structurally perceived using a multimodal visual large model to identify key component regions;

[0034] Using the key component regions as prompting information, the target instance block diagram is segmented at the pixel level using a segmentation model to obtain mask images of each key component of the target.

[0035] This invention utilizes a multimodal visual large model to automatically identify key component regions, and then uses these regions as prompts to guide a segmentation model for pixel-level segmentation, obtaining a mask image of the key components. This operation achieves automatic localization and fine segmentation of key components under zero-sample / few-sample conditions, completely eliminating the need for manual annotation of component-level ground truth, significantly reducing data preparation costs. Simultaneously, the semantic understanding capability of the visual large model ensures the accuracy of component region identification, and the segmentation model guarantees pixel-level accuracy of the mask, providing reliable spatial constraints for subsequent local feature extraction.

[0036] Furthermore, a feature extraction network is used to extract global features of the target instance bounding map, and local features of each key component are extracted based on the mask image; the global features and local features are fused to obtain the target fused features, specifically including:

[0037] The target instance bounding map is feature-encoded using a feature extraction network to obtain an intermediate feature map.

[0038] Global average pooling is performed on the intermediate feature map to obtain global features;

[0039] For each key component, its mask image is adjusted to the same spatial size as the intermediate feature image, and then multiplied element-wise with the intermediate feature image. The multiplication result is then subjected to global average pooling to obtain the local features of the key component.

[0040] The global features are fused with the local features to obtain the initial fused features;

[0041] The initial fused features are extracted using convolutional layers to obtain the final target fused features.

[0042] This invention ensures that global and local features originate from the same intermediate feature map, avoiding computational redundancy caused by repeated forward propagation. At the same time, mask alignment and element-wise multiplication operations ensure that local features strictly correspond to key component regions, and the subsequent convolutional layer refinement further enhances the nonlinear expressive power of the features, thereby obtaining more discriminative fused features.

[0043] Furthermore, the loss value is a weighted sum of the angular anisotropy discrimination loss and the supervised contrast loss. The angular anisotropy discrimination loss is used to constrain the angular relationship between the target fusion feature and the feature prototype of the corresponding category. The supervised contrast loss is used to bring the fusion features of similar samples closer together and push away the fusion features of dissimilar samples.

[0044] Angle anisotropy discriminant loss directly constrains the angular relationship between the target fused features and the corresponding category feature prototypes, prompting similar features to converge towards the prototypes and dissimilar features to move away, forming a clear angular discrimination boundary. Supervised contrastive loss brings similar samples closer together and pushes dissimilar samples further away at the sample level, further reducing intra-class variance. The joint optimization of these two methods enhances the discriminative ability of the feature space at both the "sample-prototype" and "sample-sample" levels, significantly improving the accuracy and stability of fine-grained classification for small samples.

[0045] Furthermore, the angular anisotropy discrimination loss is expressed as:

[0046] ;

[0047] , ;

[0048] ;

[0049] in, is the anisotropic loss; N is the total number of samples in the query set of the categories sampled in the current task; This is the gradient adjustment coefficient; , These are the positive class angle transformation result and the negative class angle transformation result corresponding to the fusion feature of the i-th target, respectively; Angle discrimination threshold; For the i-th target, fuse its features with its true category. The included angle between the feature prototypes; For the fusion feature of the i-th target and its negative class prototype index The angle between the feature prototypes of the corresponding categories, and the negative class prototype index. This refers to the index of the class whose feature prototype is most similar to the target fused feature of the sample among all classes that are different from the true class of the sample; m is the angular boundary, representing the angle between the two classes. Based on this, add a preset positive angle value; Let be the angle between the fused feature of the i-th target and the feature prototype of the c-th category; For the fusion feature of the i-th target; Let T be the feature prototype of the c-th category; the superscript T is the matrix transpose. To find the norm sign;

[0050] The supervised comparison loss is expressed as:

[0051] ;

[0052] in, To monitor and compare losses; For positive sample pair masks, the value is 1 when the fused feature of the i-th target and the fused feature of the j-th target belong to the same category, and 0 otherwise. The number of positive samples whose features are fused with the i-th target belong to the same category; Let be the normalized similarity between the fused features of the i-th target and the fused features of the j-th target; Let be the normalized similarity between the fused features of the i-th target and the fused features of the r-th target; It is the sum of the normalized similarity indices between the fused feature of the i-th target and the fused features of all other targets except itself.

[0053] The angular anisotropy discrimination loss introduces angular margins, angular discrimination thresholds, and gradient adjustment coefficients to achieve asymmetric constraints on positive and negative class angles. Specifically, the positive class transformation adds an angular margin to the angle between the sample and its true class feature prototype, forcing the sample to cluster more closely around the true class prototype; the negative class transformation uses the angle between the sample and the non-true class feature prototype most similar to it, causing the model to focus on the most difficult-to-distinguish negative class. Items with insufficient positive class similarity When the negative class similarity is too large, the two loss terms work together to effectively compress the intra-class angular distance and expand the inter-class angular boundary.

[0054] The supervised contrastive loss employs a log-probability form based on normalized positive sample pairs: for each sample i (i.e., each target instance bounding box i, each target instance bounding box corresponding to a target fusion feature), the proportion of its similarity index with all positive samples j to the sum of its similarity indices with all other samples r is calculated, and the negative logarithm is taken. This loss constrains the feature space at the sample-to-sample level by maximizing the similarity of positive sample pairs and minimizing the similarity of negative sample pairs, thereby bringing fusion features of similar samples closer together and distancing fusion features of dissimilar samples further apart.

[0055] The two types of losses are given explicit mathematical expressions, providing calculable optimization objectives for model training. Anisotropic discriminative loss operates on the "sample-prototype" relationship, while supervised contrastive loss operates on the "sample-sample" relationship. Their combined use can simultaneously enhance the discriminative ability of the feature space from two levels, improving the accuracy and stability of fine-grained classification with small samples.

[0056] Furthermore, before utilizing the feature extraction network, the method also includes: performing super-resolution reconstruction of the target instance bounding map to enhance image details;

[0057] The super-resolution reconstruction employs a stepwise recovery mechanism based on a diffusion model, including:

[0058] A progressive restoration process with M steps is constructed. The target instance block diagram to be restored is used as the condition input. For step t, the prediction model is used to predict the amount of image restoration based on the intermediate state of the current step and the target instance block diagram, and the intermediate state is updated. After M steps of iteration, the enhanced image is obtained.

[0059] The super-resolution reconstruction operation of this invention can effectively improve the detail clarity of low-resolution, blurry, and noisy target instance images, restore high-frequency edge and texture information, and provide higher quality input images for subsequent structure perception, component segmentation and feature extraction. It is especially suitable for image quality degradation scenarios caused by long-distance shooting, device shaking or compressed transmission, and enhances the adaptability of the method of this invention to complex imaging conditions.

[0060] Based on the same concept, the present invention also provides a component-aware small-sample fine-grained classification system, including a memory, a processor, and a computer program or instructions stored in the memory, wherein the processor executes the computer program or instructions to implement the component-aware small-sample fine-grained classification method as described above.

[0061] Based on the same concept, the present invention also provides a computer-readable storage medium having a computer program or instructions stored thereon, which, when executed by a processor, implements the component-aware small-sample fine-grained classification method as described above. Attached Figure Description

[0062] To more clearly illustrate the technical solution of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only one embodiment of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0063] Figure 1 This is a flowchart illustrating the training process of the component-aware small-sample fine-grained classification method in this embodiment of the invention.

[0064] Figure 2 This is a schematic diagram of the mask of key components in an embodiment of the present invention;

[0065] Figure 3 This is a visualization of the mask image restored to an instance in an embodiment of the present invention. Detailed Implementation

[0066] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0067] The technical solution of the present invention will be described in detail below with reference to specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0068] Example 1

[0069] This invention provides a few-shot fine-grained classification method based on component perception, comprising a training phase and an inference phase. In the training phase, a few-shot learning task is constructed (each task includes a support set and query set for multiple selected categories). A large visual model is used to assist in structural perception and component segmentation, extracting and fusing global and component-level local features. A feature prototype is constructed based on the support set, and the feature extraction network is jointly optimized using angular anisotropy discriminant loss and supervised contrastive loss. In the inference phase, the trained feature extraction network is used to calculate the feature prototype in real time for labeled samples of candidate categories. After extracting fused features from the samples to be classified, matching and classification are performed. Figure 1 As shown, the training phase of this method includes the following steps:

[0070] Step S11: Construct the sample dataset.

[0071] First, a pre-trained object detection model is constructed to automatically extract object instances from the original image. In this embodiment, the object detection model is built based on the YOLOv8 network architecture, and its specific structure is as follows:

[0072] Attention mechanism module: Located at the network input, it includes spatial attention subunits and channel attention subunits. The spatial attention subunit highlights the target region spatially, enhancing the target contour, edges, and local structural information; the channel attention subunit filters effective feature responses along the channel dimension, suppressing background interference and redundant information. After the input image (or initial feature map) is processed by the spatial attention subunit and the channel attention subunit respectively, the two outputs are fused to obtain a recalibrated enhanced feature map, which is then fed into the subsequent backbone network.

[0073] Backbone Network: The default YOLOv8 CSPDarknet backbone network is replaced with the HGNetV2 backbone network. HGNetV2 consists of an initial convolutional unit, multiple HGBlocks, and multiple downsampling units connected sequentially. The enhanced input features first enter the initial convolutional unit, which extracts shallow edge and texture features; its output serves as the input to the first HGBlock. Each HGBlock performs convolutional transformations, channel reorganization, and feature aggregation on the input features, and its output is fed into the corresponding downsampling unit. The downsampling unit compresses the feature map at a higher resolution before feeding its output into the next HGBlock. Through sequential processing at each stage, multi-scale feature maps at different semantic levels are gradually obtained. Specifically, the HGBlocks employ grouped convolutions and channel reuse to reduce the number of parameters and computational cost; the downsampling units use cascaded small receptive field convolutions and deformable convolutions to enhance the representation of target pose changes, shape deformations, and local structural differences while reducing resolution.

[0074] The neck network employs a C2f structure combined with an improved SPPF pooling module. The deep features output from the backbone network possess strong semantic expressive power, while the shallow features output from the front of the backbone network retain rich edge, texture, and positional information. Specifically, the deep features output from the backbone network are first input into the SPPF pooling module to expand the receptive field and aggregate deep semantic information. Subsequently, the deep features processed by the SPPF pooling module are fed together with the corresponding shallow features into the C2f unit, which performs cross-layer feature fusion and outputs the fused multi-scale feature map. Through this approach, the effective combination of shallow detail information and deep semantic information is achieved, enhancing the network's representation and recognition capabilities for targets of different scales (especially small and fine-grained targets).

[0075] Head Network: A three-level detection head is set up based on a multi-scale detection strategy, receiving feature maps of different scales from the neck network to complete the detection of small, medium, and large targets. Each detection head employs a decoupled structure of classification and regression branches. The classification branch outputs target category information (coarse category), while the regression branch outputs target location, bounding box parameters, and target confidence information. The outputs of all detection heads are aggregated to form target instance information (including target bounding box, spatial location, target category, and confidence). By separating the classification and localization tasks, mutual interference between the two tasks is reduced, improving the accuracy and stability of the detection results.

[0076] This object detection model is pre-trained on an image dataset containing object bounding boxes and coarse category labels, and the localization loss, classification loss, and confidence loss are jointly optimized. After training, the object detection model can automatically detect and accurately locate object instances in the input original image.

[0077] For each original image, the pre-trained object detection model described above is used for inference, outputting the bounding box information (including spatial coordinates and width / height) and confidence score of each object instance in the image. Valid object regions are filtered based on a preset confidence threshold, and then the region containing the object instance is cropped from the original image based on the bounding box information, resulting in an object instance bounding map. This cropping process effectively suppresses most complex backgrounds and irrelevant regions while preserving the structural information of the object.

[0078] For each cropped target instance bounding box, manual or semi-automatic annotation is performed according to its fine-grained category to obtain the corresponding category label (such as "Little Egret" or "Great Egret"). All target instance bounding boxes and their category labels are combined to form a sample dataset. This sample dataset is used for subsequent small-sample fine-grained classification training.

[0079] Step S12: Sample multiple categories from the sample dataset; for each category, obtain its support set and query set, both of which contain at least one target instance bounding box with category labels.

[0080] From the constructed sample dataset, C classes (e.g., C=5 or C=10) are randomly sampled according to a uniform distribution, with each class corresponding to a fine-grained class label. The sampled classes do not overlap, and random sampling is performed in each training iteration to enhance task diversity.

[0081] For each sampled category c, K samples (K≥1, usually 1, 3, or 5) are randomly selected from all target instance bounding boxes contained in that category. Each sample includes a target instance bounding box and its corresponding category label. These samples constitute the support set for that category. The support set is used to subsequently compute the feature prototypes for that category.

[0082] For each class c, Q samples (Q≥1, typically 5, 10, or 15) are randomly selected from the remaining samples of that class (i.e., samples not included in the support set). Each sample also includes the target instance bounding box and its class label. These samples constitute the query set for that class. The query set is used to calculate the loss and update the feature extraction network parameters. There is no sample overlap between the support set and the query set to ensure that no information leakage occurs during training.

[0083] The support sets and query sets of all the above categories are merged to form a few-sample task for the current training iteration. In each training iteration, new categories (or tasks) are sampled independently, enabling the model to learn the ability to quickly generalize from a small number of samples.

[0084] Step S13: Perform structure perception and segmentation on the target instance block diagram to generate mask images of each key component of the target.

[0085] For each target instance bounding box obtained in step S12 (including all samples in the support set and query set), a pre-loaded multimodal visual large model (e.g., GroundingDINO, GLIP, or OWLv2) is used for structure-aware analysis. Specifically, the target instance bounding box is input into the multimodal visual large model. Through its semantic understanding and fine-grained structure awareness capabilities, the model automatically identifies key components of the target object (e.g., the head, body, wings, and tail of a bird) and outputs candidate regions (i.e., key component regions) corresponding to each key component. Candidate regions are typically given in the form of bounding boxes, but can also be given as point coordinates or doodles. This process requires no manual annotation of component positions and is entirely completed by the zero-shot or few-shot generalization capabilities of the multimodal visual large model.

[0086] Using the aforementioned key component regions as prompts, a high-precision segmentation model (e.g., SegmentAnything Model, SAM) is invoked to perform pixel-level segmentation of the same target instance bounding box. For each key component, the corresponding candidate region is used as input prompts for the segmentation model, which outputs a fine-grained mask image of that key component, where each pixel value indicates whether the pixel belongs to that component (usually a binary mask or a probability mask).

[0087] Each mask image has the same spatial dimensions (width W, height H) as the target instance bounding image, so that it can be aligned with the intermediate feature maps of the feature extraction network later.

[0088] In actual computation, the mask image needs to be resized to perform element-wise multiplication with the intermediate feature map generated by the feature extraction network. Since the feature extraction network typically performs multiple downsampling operations (e.g., with a stride of 32), the spatial size of the intermediate feature map is much smaller than the target instance bounding box. Therefore, in step S14 (feature extraction), each mask image is adjusted to the same spatial size as the intermediate feature map using methods such as bilinear interpolation. This adjustment operation is part of the feature extraction step; in this step, only the original size mask image needs to be generated.

[0089] To visually demonstrate the component segmentation effect, please refer to... Figure 2 and Figure 3 . Figure 2 The mask images of each key component generated by the segmentation model are shown, where each key component corresponds to a binary mask region (white represents the component region). Figure 3 The mask image is restored using semi-transparent pseudo-colors of different colors and overlaid on the original target instance bounding image, clearly showing the actual position and outline of each key component in the instance. This visualization result can be used to verify segmentation accuracy and analyze model interpretability.

[0090] Step S14: Perform super-resolution reconstruction on the target instance block diagram to enhance image details.

[0091] This embodiment employs a progressive restoration mechanism based on a diffusion model to achieve super-resolution reconstruction. This mechanism includes a pre-trained prediction model used to estimate the amount of image restoration at each step during the progressive restoration process.

[0092] To train the prediction model, training sample pairs are first constructed: using a high-resolution image as a reference, corresponding low-quality images are generated through one or more methods, such as downsampling, adding noise, and blurring / degradation, forming "low-quality-high-resolution" image pairs. During prediction model training, a diffusion process with a total number of steps (denoted as M steps, where M is a positive integer, e.g., M equals 100) is defined, progressively adding noise starting from the high-resolution image to obtain a series of intermediate states. The prediction model is designed to receive three inputs: the intermediate state of the current step, the conditional input (i.e., the low-quality image), and the current step number, and output the amount of image restoration required for that step. The training goal is to make the model's predicted restoration amount as close as possible to the actual noise or residual, thereby enabling the model to learn an inverse restoration mapping that progressively approximates high-resolution images from low-quality images.

[0093] For the target instance bounding map to be reconstructed (which may be a low-quality image), perform the following operations:

[0094] The target instance block diagram to be reconstructed is used as the initial state, i.e., the state at step M, and this target instance block diagram is also used as a condition input. The total number of iterations is set to M (e.g., M is 50 or 100); starting from step M, the number of iterations decreases sequentially to step 1, and the following operations are performed at each step:

[0095] The intermediate state of the current step, the conditional input (i.e., the target instance bounding box), and the current step number are input into the prediction model. The model outputs the image restoration amount for this step. The intermediate state is updated using this restoration amount: the new intermediate state is equal to the current intermediate state plus the predicted restoration amount. Then, the next step (step number decremented by one) is entered to continue the iteration.

[0096] After M iterations, the final state obtained is the enhanced image after super-resolution reconstruction.

[0097] The enhanced image output from the above process replaces the original target instance bounding box and is input into the subsequent structure perception and segmentation module and feature extraction network. This super-resolution reconstruction step can effectively recover high-frequency edge, texture, and detail information in low-quality images, improving the accuracy of subsequent part perception and fine-grained classification.

[0098] Step S15: Use a feature extraction network to extract global features of the target instance bounding box and extract local features of each key component based on the mask image; fuse the global features and local features to obtain the target fused features.

[0099] The reconstructed target instance bounding map from step S14 (or the original target instance map if step S14 is not present) is input into the feature extraction network. The feature extraction network can employ architectures such as residual networks (e.g., ResNet-50) or VisionTransformers. After a series of convolutional or transform operations, an intermediate feature map is output. This intermediate feature map has a smaller spatial size compared to the original input image, but an increased number of channels (e.g., reaching 2048 channels). This intermediate feature map contains deep semantic information about the target instance while preserving its spatial relationships.

[0100] A global average pooling operation is performed on the intermediate feature map. Specifically, the average value of all spatial locations in each channel of the intermediate feature map is taken to obtain a one-dimensional vector. The length of this vector is equal to the number of channels in the intermediate feature map. This one-dimensional vector is the global feature of the target instance, representing the overall semantic information of the entire image.

[0101] For each key component mask image generated in step S13 (e.g., head, body, wings, etc.), perform the following sub-operations:

[0102] Since the spatial size of the intermediate feature map is smaller than that of the original target instance bounding map, the mask image of each key component needs to be reduced to the same spatial size as the intermediate feature map using methods such as bilinear interpolation. The adjusted mask image still maintains a binary or probability distribution form, with pixel values ​​ranging from 0 to 1.

[0103] The adjusted mask image is multiplied element-wise with the intermediate feature image at corresponding spatial locations. After multiplication, the response of non-component regions is set to zero, and only the regions belonging to the critical component retain their original feature responses.

[0104] The multiplication result is then subjected to global average pooling again, which averages the results across all spatial locations in each channel, resulting in a one-dimensional vector. The length of this vector is also equal to the number of channels in the intermediate feature map. This vector represents the local features of the key component.

[0105] Repeat the above steps for all key components to obtain the local feature vector of each component.

[0106] For the same target instance bounding box, its corresponding global features and all local features are concatenated along the channel dimension (i.e., concatenated end-to-end) to form a longer initial fused feature vector. The length of this vector is equal to (number of key components plus 1) multiplied by the number of feature channels. The initial fused feature is then input into one or more convolutional layers (or fully connected layers) for non-linear transformation. In a specific embodiment, a fully connected layer containing 256 neurons (which can be considered a 1×1 convolution) is used for dimensionality reduction mapping, followed by processing with an activation function (such as ReLU) to output the final target fused feature. The dimension of this target fused feature can be set as needed (e.g., 256-dimensional or 512-dimensional). This final target fused feature simultaneously contains global information about the entire target and local discriminative information about each key component, which is used for subsequent feature prototype construction and loss calculation.

[0107] Each sample (i.e., a target instance bounding box) corresponds to a target fusion feature.

[0108] Step S16: For each category, obtain the feature prototype of that category based on the average value of the target fusion features in its support set.

[0109] After completing step S15, for each category sampled in the current few-shot task, the corresponding target fusion features for each target instance bounding box in its support set have been obtained through the feature extraction network. The support set typically contains one or more labeled samples of that category (e.g., 1, 5, or 10 samples per category). These target fusion features are then grouped according to category.

[0110] For each category, perform the following:

[0111] Extract the target fusion feature vectors from all samples in the support set for that category. Assuming the support set contains K samples, there are K corresponding target fusion features, each with the same dimension (e.g., 256 dimensions). Add these K target fusion features element-wise along their corresponding dimensions, then divide by K (i.e., take the arithmetic mean) to obtain a new vector. This new vector is the feature prototype for that category.

[0112] The feature prototype has the same dimensionality as the fused feature vector for a single target. This feature prototype represents the standard feature representation for this category, aggregating semantic information from multiple samples in the support set, and exhibits better stability.

[0113] The calculated feature prototypes for each category are used in subsequent loss calculation steps. During training, because the sampled categories differ each time and the support set samples may change, the feature prototypes are dynamically recalculated for each training task, rather than remaining fixed. This dynamic calculation mechanism forces the feature extraction network to learn the ability to quickly construct discriminative prototypes from a small number of samples.

[0114] In one specific embodiment, assume the current task samples 5 categories, and each category's support set contains 5 samples. The target fusion feature output by the feature extraction network has a dimension of 256. For the category "Little Egret," the 5 samples in its support set yield 5 256-dimensional target fusion features. Adding these 5 target fusion features element-wise and dividing by 5 yields a new 256-dimensional vector, which is the feature prototype for the "Little Egret" category. Other categories (such as "Great Egret," "Intermediate Egret," etc.) calculate their respective feature prototypes similarly.

[0115] Step S17: Calculate the loss value based on the fused features of each target in the query set and the feature prototypes of the corresponding categories, and update the feature extraction network.

[0116] After completing steps S15 and S16, for the current small sample task, we have obtained:

[0117] Let the target fusion feature of all query samples be denoted as . ;

[0118] The feature prototype of each category is denoted as the feature prototype of the c-th category. ;

[0119] The true category label for each query sample .

[0120] Suppose there are N query samples in the current task (the sum of the number of query samples for all categories).

[0121] To enhance the discriminative power of target fusion features, anisotropic discriminative loss and supervised contrastive loss are jointly employed to constrain the feature space. Specifically, the anisotropic discriminative loss constrains the angular relationship between the target fusion features and the corresponding class feature prototypes, making the distribution of similar samples more compact on the unit hypersphere and forming clearer angular discriminative boundaries between different classes. In this embodiment, the anisotropic discriminative loss is expressed as:

[0122] (1)

[0123] (2)

[0124] (3)

[0125] (4)

[0126] in, is the anisotropic loss; N is the total number of samples in the query set of the categories sampled in the current task; is the gradient adjustment coefficient, a preset hyperparameter (e.g., set to 10), used to control the sensitivity of the loss function to the difference between positive and negative classes; This is the positive class angle transformation result corresponding to the fusion feature of the i-th target, which represents the positive class similarity after introducing additional discriminant margin; The negative class angle transformation result corresponding to the fusion feature of the i-th target; The angle discrimination threshold is a preset constant (e.g., 0.5) used to define the acceptable range of positive and negative class similarity. For the i-th target, fuse its features with its true category. The included angle between the feature prototypes is calculated using formula (4); For the fusion feature of the i-th target and its negative class prototype index The angle between the feature prototypes of the corresponding categories; m is the angle boundary, representing the angle between the features. Based on this, add a preset positive angle value; Let be the angle between the fused feature of the i-th target and the feature prototype of the c-th category; For the fusion feature of the i-th target; Let T be the feature prototype of the c-th category; the superscript T is the matrix transpose. To find the norm sign.

[0127] Negative class prototype index Defined as: among all categories that are different from the true category of the sample (i.e. The index of the category whose feature prototype is most similar to the target fusion feature of the sample. In this embodiment, the negative class prototype index is... It can be represented as:

[0128] (5)

[0129] in, Among all the categories that satisfy the condition, we choose the category c that maximizes the cosine similarity. That is, we find the index of the incorrect category that is most similar to the features of the current sample.

[0130] Supervised contrastive loss is used to treat samples of the same class as positive pairs and samples of different classes as negative pairs. By explicitly compressing intra-class distances and increasing inter-class distances, it further reduces intra-class variance and improves the stability of feature representations. In this embodiment, the supervised contrastive loss is expressed as:

[0131] (6)

[0132] in, To monitor and compare losses; For positive sample pair masks, the value is 1 when the fused feature of the i-th target and the fused feature of the j-th target belong to the same category, and 0 otherwise. The number of positive samples whose features are fused with the i-th target belong to the same category; Let be the normalized similarity between the fused features of the i-th target and the fused features of the j-th target; Let be the normalized similarity between the fused features of the i-th target and the fused features of the r-th target; It is the sum of the normalized similarity indices between the fused feature of the i-th target and the fused features of all other targets except itself.

[0133] The anisotropy discrimination loss and the supervised comparison loss are weighted and summed according to preset weighting coefficients to obtain the total loss value L:

[0134] (7)

[0135] in, The weighting coefficients are preset (e.g., 0.5). The total loss considers both "sample-prototype" discrimination and "sample-sample" comparison discrimination.

[0136] Then, the gradient of the total loss L with respect to all trainable parameters in the feature extraction network (and possibly the fusion layer) is calculated using the backpropagation algorithm. An optimizer (such as Adam or SGD) is used to update the network parameters based on the gradient, causing the loss value to gradually decrease. A parameter update is performed after each small sample task (one episode). Steps S12 to S17 are repeated until the feature extraction network converges.

[0137] The reasoning phase includes the following steps:

[0138] Step S21: Obtain supporting samples for candidate categories (i.e., target instance diagrams in the support set).

[0139] For each candidate (fine-grained) category that needs to be distinguished during inference (e.g., there may be 60 candidate categories in a real-world application, but only 50 categories have been learned during training), obtain at least one labeled target instance bounding box for that category. These labeled samples can be pre-collected and annotated reference images, typically providing one or more (e.g., five) samples per category. These samples will be used to compute the feature prototypes for that category.

[0140] Step S22: Perform structure perception and segmentation on the supporting samples of the candidate categories.

[0141] For each target instance bounding box of each candidate category, perform structure awareness and segmentation in exactly the same way as during the training phase:

[0142] By leveraging a multimodal vision model to perform structure perception on the target instance bounding box, key component regions (such as head, body, wings, etc.) of the target are automatically identified. Using the identified key component regions as cue information, a segmentation model is used to perform pixel-level segmentation of the target instance bounding box, generating a mask image for each key component.

[0143] This process is exactly the same as step S13 in the training phase, outputting the mask images of each key component for each supporting sample.

[0144] Step S23: Preprocess the supporting samples for the candidate categories.

[0145] If the training phase includes the super-resolution reconstruction step in step S14, then the same super-resolution reconstruction needs to be performed on the target instance bounding maps for each candidate category here: each target instance bounding map is input into a progressive recovery mechanism based on a diffusion model, and after M iterations, an enhanced image is obtained for subsequent processing. If super-resolution reconstruction is not used in the training phase, the original target instance bounding maps are used directly.

[0146] Step S24: Extract the target fusion features of the candidate category supporting samples.

[0147] For each supporting sample (or the enhanced image after super-resolution reconstruction) of each candidate category, the target fused features of each supporting sample are obtained using the trained feature extraction network, following the same feature extraction and fusion method as step S15 in the training phase. Specifically, this includes:

[0148] The target instance bounding map is input into the feature extraction network and feature encoding is performed to obtain an intermediate feature map. Global features are extracted from the intermediate feature map through global average pooling. For each key component, its mask image is adjusted to the same spatial size as the intermediate feature map, and then multiplied element-wise and pooled to obtain the local features of that component. The global features are concatenated and fused with all local features, and then refined through convolutional layers to obtain the final target fused features.

[0149] Step S25: Calculate the feature prototype for each candidate category.

[0150] For each candidate category, the element-wise average of the target fusion feature vectors of all supporting samples in that category is taken to obtain the feature prototype of that category. That is, if a category has K supporting samples, and the target fusion feature dimension of each supporting sample is D, then the sum of each dimension of the K target fusion features is divided by K to form a new D-dimensional vector, which serves as the feature prototype of that category. This prototype represents the standard feature representation of that category.

[0151] Step S26: Perform the same processing on the samples to be classified.

[0152] Obtain a target instance bounding diagram (unknown category) to be classified, and process it according to the same process as steps S22 to S24 above to obtain the target fusion features of the sample to be classified.

[0153] Step S27: Match and output classification results.

[0154] The target fused features of the sample to be classified are matched with the feature prototypes of each candidate category. The matching method uses cosine similarity: the cosine similarity between the target fused features of the sample to be classified and the feature prototypes of each candidate category is calculated (i.e., the dot product of the two vectors divided by the product of their respective magnitudes). The larger the cosine similarity value, the closer the features are.

[0155] The candidate category with the highest cosine similarity is selected as the fine-grained classification result for the sample to be classified. This similarity value can also be output as the classification confidence score.

[0156] If the candidate category set in the actual application differs from the category set in the training phase (for example, only 50 bird species were learned in the training phase, while 60 species need to be identified in the inference phase), the above steps in the inference phase do not require any modification: simply add the labeled samples of the 10 newly added species to the candidate category support set and recalculate the feature prototypes of these categories. This demonstrates the good scalability of this invention for incremental categories with small samples.

[0157] The above online reasoning process makes full use of the overall semantic information of the target and the local structural information of key components extracted in the aforementioned feature extraction and fusion step (step S15). Through the deep fusion of the two, it achieves effective differentiation between categories that are highly similar in appearance but have subtle differences in component arrangement.

[0158] By employing a hierarchical collaborative reasoning architecture that proceeds from object detection (steps S11-S12), image enhancement (step S14), structure perception and segmentation (step S13), to feature extraction and fusion (step S15) and prototype matching, this invention ensures high recognition accuracy while also taking into account processing efficiency and system stability in engineering applications, thus meeting the requirements for real-time deployment.

[0159] This invention was deployed and tested on a Titan V platform with x86 architecture under Linux. Using a small-sample training scenario (1-5 samples per class), the system achieved an accuracy of 93.16% for fine-grained recognition of multiple classes in a 5-way 5-shot scenario (5 classes selected per task, 5 target instance bounding boxes provided for each class) and 83.36% in a 5-way 1-shot scenario (5 classes selected per task, 1 target instance bounding box provided for each class). The end-to-end inference frame rate remained stable at 16 FPS, fully meeting the dual requirements of real-time performance and accuracy.

[0160] The main technical advantages of this invention include:

[0161] Super-resolution reconstruction technology (step S14) effectively restores high-frequency details of low-quality images, improving feature extraction quality from the source.

[0162] By using a multimodal visual large model to assist in structural perception (step S13), the automatic identification and fine segmentation of key components can be achieved without the need for expensive manual component annotation.

[0163] By using a dual-branch (global features and local features) feature fusion (step S15) and a multi-prototype discrimination mechanism (step S16), the fine-grained classification capability is significantly improved, effectively addressing small sample sizes, long-tail categories, and cross-domain scenarios.

[0164] By jointly optimizing the anisotropic discriminative loss and the supervised contrastive loss (step S17), the discriminativeness of the feature space is enhanced from both the "sample-prototype" perspective and the "sample-sample" contrastive perspective, thereby achieving inter-class separation and intra-class aggregation, and improving the clarity of the model's classification boundary and the reliability of the output confidence.

[0165] This invention not only possesses excellent engineering deployability and hardware adaptability, but also continuously tracks model inference behavior, confidence distribution, and resource consumption through a built-in performance monitoring module, providing data support for long-term stable operation and iterative optimization in complex environments.

[0166] Example 2

[0167] This invention also provides a component-aware small-sample fine-grained classification system, which includes a memory, a processor, and a computer program or instructions stored in the memory. The processor executes the computer program or instructions to implement the component-aware small-sample fine-grained classification method of this invention.

[0168] Although not shown, the system includes a processor that performs various appropriate operations and processes based on programs and / or data stored in read-only memory (ROM) or loaded from a storage portion into random access memory (RAM). The processor may be a multi-core processor or may contain multiple processors. In some embodiments, the processor may include a general-purpose main processor and one or more specialized coprocessors, such as a central processing unit, graphics processing unit (GPU), neural network processor (NPU), digital signal processor (DSP), etc. Various programs and data required for system operation are also stored in RAM. The processor, ROM, and RAM are interconnected via a bus. Input / output (I / O) interfaces are also connected to the bus.

[0169] The processor and memory described above are used together to execute programs / instructions stored in the memory. When the program / instructions are executed by the computer, they can implement the methods, steps, or functions described in the above embodiments.

[0170] Although not shown, embodiments of the present invention also provide a computer-readable storage medium having a computer program or instructions stored thereon, which, when executed by a processor, implements the component-aware small-sample fine-grained classification method of the present invention.

[0171] Readable storage media include both permanent and non-permanent, removable and non-removable media that can store information by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0172] The above description only discloses specific embodiments of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or modifications that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.

Claims

1. A component-aware, small-sample, fine-grained classification method, comprising a training phase and an inference phase, characterized in that, The training phase includes: Multiple categories are sampled from the sample dataset; for each category, its support set and query set are obtained; both the support set and query set contain at least one target instance bounding diagram with category labels; The target instance block diagram is subjected to structure awareness and segmentation to generate mask images of each key component of the target; The global features of the target instance bounding box are extracted using a feature extraction network, and the local features of each key component are extracted based on the mask image; the global features and the local features are then fused to obtain the target fused features. For each category, the feature prototype of that category is obtained based on the average value of the fused features of each target in its support set; The loss value is calculated based on the fused features of each target in the query set and the feature prototypes of the corresponding categories, and the feature extraction network is updated accordingly; The inference phase includes: obtaining at least one labeled target instance bounding map for each candidate category; using a trained feature extraction network, extracting target fusion features of the target instance bounding maps for each candidate category and calculating the average value as the feature prototype for that category, in the same manner as the training phase, according to the same structure perception and segmentation, feature extraction and fusion methods; and extracting target fusion features of the samples to be classified in the same manner as the training phase. The target fusion features of the sample to be classified are matched with the feature prototypes of each candidate category to output fine-grained classification results.

2. The small-sample fine-grained classification method based on component perception according to claim 1, characterized in that, The process of obtaining the sample dataset includes: The original image is used to detect objects using a pre-trained object detection model to obtain the bounding box information of the object instances; The region containing the target instance is cropped from the original image based on the bounding box information to obtain the target instance bounding box diagram; The target instance block diagram is categorized to obtain the corresponding category labels; The sample dataset is constructed based on the target instance diagram and its category labels.

3. The small-sample fine-grained classification method based on component perception according to claim 2, characterized in that, The target detection model is built based on the YOLOv8 network architecture and includes: The attention mechanism module set at the input end is used to enhance the spatial and channel dimensions of the input image. The backbone network, using the HGNetV2 architecture, is used to extract multi-scale features from the feature maps enhanced by the attention mechanism module. The neck network employs a C2f structure and SPPF pooling modules to perform cross-layer fusion of multi-scale features output by the backbone network. The head network is configured with a multi-scale detection head, which is used to output the bounding box information and confidence level of the target instance based on the fusion features output by the neck network.

4. The small-sample fine-grained classification method based on component perception according to claim 1, characterized in that, The target instance block diagram is subjected to structure-aware segmentation to generate mask images of each key component of the target, including: The target instance block diagram is structurally perceived using a multimodal visual large model to identify key component regions; Using the key component regions as prompting information, the target instance block diagram is segmented at the pixel level using a segmentation model to obtain mask images of each key component of the target.

5. The small-sample fine-grained classification method based on component perception according to claim 1, characterized in that, Global features of the target instance bounding box are extracted using a feature extraction network, and local features of each key component are extracted based on the mask image; the global features and local features are fused to obtain the target fused features, specifically including: The target instance bounding map is feature-encoded using a feature extraction network to obtain an intermediate feature map. Global average pooling is performed on the intermediate feature map to obtain global features; For each key component, its mask image is adjusted to the same spatial size as the intermediate feature image, and then multiplied element-wise with the intermediate feature image. The multiplication result is then subjected to global average pooling to obtain the local features of the key component. The global features are fused with the local features to obtain the initial fused features; The initial fused features are extracted using convolutional layers to obtain the final target fused features.

6. The small-sample fine-grained classification method based on component perception according to claim 1, characterized in that, The loss value is a weighted sum of the angular anisotropy discrimination loss and the supervised contrast loss. The angular anisotropy discrimination loss is used to constrain the angular relationship between the target fusion feature and the feature prototype of the corresponding category. The supervised contrast loss is used to bring the fusion features of similar samples closer together and push away the fusion features of dissimilar samples.

7. The component-aware, small-sample, fine-grained classification method according to claim 6, characterized in that, The angular anisotropy discrimination loss is expressed as: ; , ; ; in, is the anisotropic loss; N is the total number of samples in the query set of the categories sampled in the current task; This is the gradient adjustment coefficient; , These are the positive class angle transformation result and the negative class angle transformation result corresponding to the fusion feature of the i-th target, respectively; Angle discrimination threshold; For the i-th target, fuse its features with its true category. The included angle between the feature prototypes; For the fusion feature of the i-th target and its negative class prototype index The angle between the feature prototypes of the corresponding categories, and the negative class prototype index. This refers to the index of the class whose feature prototype is most similar to the target fused feature of the sample among all classes that are different from the true class of the sample; m is the angular boundary, representing the angle between the two classes. Based on this, add a preset positive angle value; Let be the angle between the fused feature of the i-th target and the feature prototype of the c-th category; For the fusion feature of the i-th target; Let T be the feature prototype of the c-th category; the superscript T is the matrix transpose. To find the norm sign; The supervised comparison loss is expressed as: ; in, To monitor and compare losses; For positive sample pair masks, the value is 1 when the fused feature of the i-th target and the fused feature of the j-th target belong to the same category, and 0 otherwise. The number of positive samples whose features are fused with the i-th target belong to the same category; Let be the normalized similarity between the fused features of the i-th target and the fused features of the j-th target; Let be the normalized similarity between the fused features of the i-th target and the fused features of the r-th target; It is the sum of the normalized similarity indices between the fused feature of the i-th target and the fused features of all other targets except itself.

8. The small-sample fine-grained classification method based on component perception according to any one of claims 1 to 7, characterized in that, Before utilizing the feature extraction network, the process also includes: super-resolution reconstruction of the target instance bounding map to enhance image details; The super-resolution reconstruction employs a stepwise recovery mechanism based on a diffusion model, including: A progressive restoration process with M steps is constructed. The target instance block diagram to be restored is used as the condition input. For step t, the prediction model is used to predict the amount of image restoration based on the intermediate state of the current step and the target instance block diagram, and the intermediate state is updated. After M steps of iteration, the enhanced image is obtained.

9. A component-aware, small-sample, fine-grained classification system, comprising a memory, a processor, and a computer program or instructions stored in the memory, characterized in that, The processor executes the computer program or instructions to implement the component-aware small-sample fine-grained classification method as described in any one of claims 1 to 8.

10. A computer-readable storage medium having a computer program or instructions stored thereon, characterized in that, When the computer program or instructions are executed by the processor, they implement the component-aware small-sample fine-grained classification method as described in any one of claims 1 to 8.