A two-stage prototype classification method and system for small sample image recognition
By adopting a decoupled two-stage recognition method, combined with bi-branch feature extraction and prototype discrimination, the problem of difficulty in distinguishing subtle differences and unstable confidence calibration in small sample image instance recognition is solved. This achieves high-precision, low-latency end-to-end target recognition, which is suitable for target recognition and judgment under complex imaging conditions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- STATE GRID HUBEI ELECTRIC POWER RES INST
- Filing Date
- 2026-04-24
- Publication Date
- 2026-06-26
AI Technical Summary
In scenarios involving small sample and fine-grained image instance recognition, single-stage multi-class target detection suffers from problems such as difficulty in distinguishing subtle differences, sensitivity to ROI bias, and unstable confidence calibration due to feature sharing and general classification head design. Existing two-stage solutions fail to specifically model fine-grained and small sample scenarios, resulting in fragile discrimination boundaries and large fluctuations in confidence.
A decoupled two-stage identification method is adopted, which improves fine-grained discrimination ability and small sample generalization and calibration performance by using a YOLOv8-based detection model and an EMO-based dual-branch feature extraction network, combined with learnable temperature prototype discrimination, momentum-updated multi-prototype memory, prototype angle loss and supervised contrast loss.
It significantly improves fine-grained discrimination capability and small-sample generalization and calibration performance, achieving high-precision, low-latency end-to-end target recognition. It is suitable for target recognition and analysis under complex imaging conditions, meeting the real-time and accuracy requirements of tactical scenarios.
Smart Images

Figure FT_1 
Figure FT_2 
Figure FT_3
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and intelligent perception, specifically to an automatic image instance category recognition method and system for small sample fine-grained conditions. Background Technology
[0002] In general task scenarios, single-stage multi-class object detection networks are commonly used to simultaneously complete localization and classification in perception and recognition processes. However, in scenarios involving small samples and fine-grained image instance recognition, these methods, due to shared features between detection and classification and a generic classification head design, struggle to adequately characterize the subtle differences between different device models. This leads to confusion between similar models, unstable confidence calibration, and limited overall performance. Furthermore, the generalization ability of single-stage multi-class detection is further reduced by the scale and uneven distribution of training data, under conditions of missing details, changing viewpoints, and resolution fluctuations, making it difficult to meet the accuracy and robustness requirements for instance category recognition. To alleviate these problems, engineering practice typically decouples detection and classification into a two-stage process to reduce the representational compromises caused by task coupling. However, existing two-stage schemes often use generic single-branch features and standard losses in their classification modules, failing to specifically model the uncertainties caused by fine-grained, small sample sizes and detection box jitter. This results in fragile discrimination boundaries, sensitivity to ROI bias, and large confidence fluctuations. While conventional data preprocessing and enhancement (such as basic geometric / color enhancement, general super-resolution, etc.) improve image quality and data validity to some extent, their improvements are limited and cannot fundamentally solve the core contradiction of insufficient fine-grained features and inadequate generalization with small samples. Therefore, there is an urgent need for a method based on a two-stage approach, oriented towards fine-grained and small-sample constraints, and introducing targeted representation aggregation and discrimination optimization mechanisms at the classification module level. This would compensate for the performance gap of single-stage multi-class detection in this type of task and improve the overall stability and usability of recognition. Summary of the Invention
[0003] This invention aims to address the challenges of single-stage multi-class target detection in small-sample, fine-grained image instance recognition scenarios, which suffer from difficulties in distinguishing subtle differences, sensitivity to ROI bias, and unstable confidence calibration due to feature sharing and generic classification head design. To overcome these bottlenecks, this invention proposes a decoupled two-stage recognition method and system. While maintaining the simplicity and reliability of the detection module, it introduces bi-branch feature aggregation, learnable temperature-based prototype discrimination, momentum-updated multi-prototype memory, and a joint loss design combining prototype angle loss and supervised contrast loss for the classification module. This significantly improves fine-grained discrimination capability and small-sample generalization and calibration performance.
[0004] The technical solution of this invention is: a two-stage prototype classification method for small sample image recognition, characterized by comprising the following steps:
[0005] Step S1: Data Acquisition and Preprocessing
[0006] S11. Collect a small sample dataset of images;
[0007] S12. Perform basic data augmentation on the original samples, such as random scaling, horizontal flipping, color perturbation, and random erasure, to increase data diversity;
[0008] Step S2: Detection Model Training
[0009] S21. The detection model is trained using an architecture modified based on YOLOv8;
[0010] S22. Save the optimal weight file of the detection model to participate in subsequent S41 inference;
[0011] Step S3: Classification Model Training S31. Construct a dual-branch feature extraction network based on EMO, and fuse shallow texture branch and deep shape branch features with global GeM pooling and learnable weights to initially train the classification model;
[0012] S32. Connect to a prototype projection head with learnable temperature, generate normalized features, and automatically adjust temperature parameters. ;
[0013] S33. Combining angular interval loss and supervised contrast loss, the class prototype vector is dynamically updated using a multi-prototype bank;
[0014] S34. Thaw and fine-tune some of the backbones until convergence to obtain the optimal classification model;
[0015] Step S4: Reasoning Stage
[0016] S41. Input the original image, call the detection model, and output a set of target candidate boxes;
[0017] S42. Perform super-resolution reconstruction on each candidate box in the target candidate box set to restore details;
[0018] S43. Feed the super-resolution target candidate boxes into the optimal classification model to obtain the class probability distribution;
[0019] S44. Select the category with the highest class probability as the classification result category in the inference stage, and combine it with the target location obtained in the target detection stage S41 to accurately map it to the original image to obtain the final target location and category result.
[0020] In step S21, the original YOLOv8 network is lightened by modifying four aspects: the backbone network, the neck network, the detector head, and the loss function.
[0021] S211 Backbone Network
[0022] (1) Attention backbone
[0023] The input original image first enters the improved Stem layer, which uses depthwise separable convolution combined with a lightweight coordinate attention mechanism;
[0024] (2) Lightweight network backbone
[0025] The default YOLOv8 backbone CSPDarknet is replaced with the HGNet V2 backbone. HGNet V2 introduces HGBlock (grouped residual block) in each feature extraction unit, which greatly reduces the number of parameters and computation through grouped convolution and channel reuse techniques. At the same time, in the downsampling stage, cascaded small receptive field convolution and deformable convolution are used to capture the deformation features of instances under different poses.
[0026] S212 neck network
[0027] The system employs a C2F architecture combined with an improved SPPF pooling module.
[0028] C2F cross-layer connectivity fully integrates shallow details with deep semantics, enhancing the ability to identify targets at multiple scales;
[0029] SPPF is replaced with a combination of depthwise separable convolution and dilated convolution to expand the effective receptive field and improve the robustness of detecting small targets at a distance.
[0030] S213 Header Network
[0031] The multi-scale detection strategy employs a three-level detection head (P3 / P4 / P5) to decouple the classification branch from the regression branch; the classification branch uses a lightweight convolutional structure to reduce inference latency; and the localization branch embeds a learnable attention gate to improve the localization accuracy of small instance boxes.
[0032] S213 Loss Function
[0033] A unified IoU Focal Loss-inv is proposed, which models multiple geometric metrics such as IoU, GIoU, DIoU, and CIoU in a unified manner. It adaptively increases the gradient weights of low-IoU and difficult-to-detect samples, thereby enhancing the model's learning ability for difficult examples and small targets. Simultaneously, a modulation factor consistent with the classification Focal Loss is introduced, ensuring consistency in the weighting mechanism for difficult samples between the localization and classification branches, achieving collaborative optimization. This modulation factor is a dynamic weighting function constructed based on prediction confidence or error level, used to suppress the loss contribution of easy samples and strengthen the gradient of difficult samples, thus achieving adaptive attention to difficult examples during training.
[0034] In step S31, (1) Design of the dual-branch feature extraction network structure
[0035] Based on the EMO backbone network, a dual-branch feature extraction structure is introduced to explicitly separate and enhance discriminative information at different levels; wherein:
[0036] On the one hand, a shallow texture branch is constructed, which extracts shallow semantic information from the front feature layers of the EMO network. Furthermore, it enhances the model through multi-layer lightweight convolution, focusing on characterizing high-frequency local features such as details of the fuselage skin, serial numbers, and paint textures, thereby improving the sensitivity of the classification model to differences in details.
[0037] On the other hand, a deep shape branch is constructed, which adopts the deep structure of the EMO network and combines methods such as dilated convolution to expand the receptive field, so as to extract the overall geometric and contour information of the wing layout, air intake structure, and vertical tail shape. This enhances the ability to represent the differences in the macroscopic structure of instances;
[0038] The output features of the two branches are compressed into fixed-dimensional representations using GeM pooling, and then adaptively fused through a learnable linear weighting mechanism to obtain a unified image representation that takes into account both local texture and overall shape information. The dual-branch features are then weighted and fused after L2 normalization to obtain the fused features.
[0039] ;
[0040] in These are learnable weights;
[0041] In step S32, features are fused. Normalized features generated by the prototype projection head , After being adjusted to a learnable temperature scale, optimized discriminative features are obtained. ;
[0042] in:
[0043] ;
[0044] To characterize the distribution differences of different categories in the feature space, a learnable temperature parameter is introduced to scale the features, thereby obtaining optimized discriminative features. :
[0045]
[0046] in, This refers to the temperature parameter.
[0047] In step S33,
[0048] (1) Angular interval discrimination constraint
[0049] An angular interval constraint Arcface is introduced in the prototype discrimination process to make similar samples more closely packed on a unit sphere, and to form clearer angular boundaries between the target class classification logits of different categories, thereby improving the discriminativeness of the classification model;
[0050] The target class classification logits are calculated as follows:
[0051]
[0052] in, Represents the similarity scaling parameter. This represents the ArcFace interval parameter. y represents the angle between the current sample feature vector and its class prototype vector, and y represents the current sample class label;
[0053] (2) Supervised comparative learning constraints
[0054] Simultaneously, a supervised contrastive learning mechanism is introduced at the batch level, treating samples of the same class as positive pairs and samples of different classes as negative pairs. By explicitly compressing intra-class distance and increasing inter-class distance, intra-class variance is further reduced, improving the classification model's ability to identify subtle differences; this contrastive loss... The calculation is as follows:
[0055] ; in, For a sample set of the same type, For temperature comparison, The number of samples in the current batch.
[0056] : No. Optimized discriminant features for each sample
[0057] : with sample Feature vectors of positive samples of the same category
[0058] : In the batch, except for the sample All sample feature vectors outside,
[0059] : with sample A collection of samples belonging to the same category.
[0060] Size of the positive sample set
[0061] :sample Cosine similarity with positive samples
[0062] :sample Cosine similarity with other samples
[0063] The temperature parameter is compared and learned to adjust the smoothness of the similarity distribution.
[0064] (3) Multi-prototype banking mechanism
[0065] Optimized discriminative features extracted from input samples , Real category Calculate the input sample and category of A prototype Cosine similarity, select the nearest prototype index :
[0066] ;
[0067] in, It is represented by a prototype vector. :category Next Each prototype vector represents a sub-distribution center of the category in the feature space; The number of prototypes maintained by category y; :category A set of multiple prototype vectors; represent and Cosine similarity between the two; Select the prototype and update according to the momentum rule:
[0068]
[0069] in, is the momentum coefficient, with a value range of (0,1). Indicates the updated y-th class and th... One prototype vector; This indicates the y-th class before the update. One prototype vector; This represents the true category label of the current input sample; This represents the prototype index among the k prototypes of the y-th class that is most similar to the features of the current sample.
[0070] In step S34, after completing the construction of the discriminative feature space, the classification model is optimized using a phased parameter unfreezing and stepwise fine-tuning strategy, including the following steps:
[0071] (1) Initial stage: Freezing the core
[0072] Freeze the parameters of the bottom and middle layers of the feature extraction network, and only update the parameters of the high-level feature layers and the prototype classification head;
[0073] The learning rate for this phase is set as the initial learning rate. It is used to quickly establish stable classification and discrimination boundaries;
[0074] (2) Intermediate stage: partial thawing
[0075] After training reaches the preset number of rounds Alternatively, after the validation set loss converges, gradually unfreeze the high-level parameters of the backbone network and adjust the learning rate to... ,in ; Constants, scaling factors;
[0076] This stage is used to improve feature representation capabilities, enabling the model to adapt to fine-grained differences;
[0077] (3) Later stage: global fine-tuning
[0078] Further unfreeze all network parameters and continue to reduce the learning rate to ,in This is to achieve fine-tuning of global parameters and stable convergence; The scaling factor used for the second thawing;
[0079] (4) Convergence control mechanism
[0080] During training, the following control strategy is introduced:
[0081] Learning rate scheduling: employing cosine annealing or piecewise decay strategies;
[0082] Gradient clipping: Limits the gradient norm to no more than a threshold G;
[0083] Weight regularization: controls model complexity by using a weight decay coefficient λ;
[0084] Mixed precision training: Improves training efficiency and stabilizes numerical computation;
[0085] (5) Model selection strategy
[0086] Based on the validation set performance metrics, an early stopping mechanism is triggered when the performance no longer improves for E consecutive periods, and the optimal model parameters are selected as the final output.
[0087] Step S41 Target Detection Stage:
[0088] The original image is scale-normalized, the optimal weight file of the detection model saved in S22 is loaded, and a set of target candidate boxes is output.
[0089] The target candidate boxes are filtered using a nonmaximum suppression strategy. The specific steps are as follows:
[0090] (1) Sort all target candidate boxes output by the detection model from high to low according to the detection confidence;
[0091] (2) Select the target candidate box with the highest current confidence as the baseline box and add it to the final detection result set;
[0092] (3) Calculate the degree of overlap between the baseline box and the other target candidate boxes, and use the intersection-over-union (IoU) ratio as the metric;
[0093] (4) Delete all target candidate boxes whose IoU with the baseline box is greater than a preset threshold to remove highly overlapping redundant detections;
[0094] (5) Repeat steps (2) to (4) in the remaining candidate boxes until all target candidate boxes have been processed;
[0095] Finally, a set of target candidate boxes with high confidence and low redundancy is obtained.
[0096] Step S42 Local super-resolution reconstruction based on target candidate boxes Each candidate bounding box in the target candidate box set obtained by S41 is sent to the super-resolution module for resolution reconstruction. This process enhances the wing edges, fuselage outline, and paint details while simultaneously outputting a heat map to indicate the structural areas that the model focuses on enhancing during the super-resolution process.
[0097] The super-resolution module first recovers high-frequency texture and edge information in low-resolution, blurred, or compressed images through image super-resolution and detail enhancement techniques. This invention uses ResShift as the core reconstruction engine to achieve state equation learning based on diffusion model residual transfer. Specifically, the super-resolution module maps low-resolution images to latent space representations using a VQGANA autoencoder, then constructs a residual transfer diffusion chain within the latent space to gradually recover high-resolution details. Finally, the UNet state prediction module learns the residual function to guide backsampling, while combining multi-scale features and attention mechanisms to improve the quality of texture and structure reconstruction. Various degradation and enhancement strategies (noise, compression, scaling, etc.) are introduced during training to enhance the model's robustness in real low-quality scenes, thus providing reliable input for structure perception and fine-grained classification. S43 Prototype-based classification reasoning After completing local super-resolution, the target candidate boxes are grouped into inference batches and fed into the best classification model trained in step S3. The similarity distribution of each target candidate box in the category prototype space is output, and the most likely category prediction is given accordingly. For extremely small targets with insufficient detailed information, in order to avoid introducing unreliable classification results, they are screened based on the confidence threshold of the detection stage and marked as suspicious targets or objects to be reviewed, thereby avoiding the spread of misjudgments while ensuring overall reliability. Based on the enhanced target candidate boxes, the optimized discriminant features are extracted using the best classification model. For optimized discriminative features Its relationship with the prototype vector of class c The similarity is calculated as follows: ; Further classification logits are obtained by temperature scaling: ; The symbols are defined as follows: : Represents the feature vector of the input sample after passing through the feature extraction network and normalization processing; : Represents the prototype vector corresponding to the c-th class, used to characterize the central representation of this class in the feature space; : Represents the category index, c ∈{1,2,…,C}, where C is the total number of categories; : Represents the similarity between the input feature vector and the c-th prototype, calculated using the vector dot product form, and is equivalent to cosine similarity under feature normalization conditions; : Represents the temperature scaling parameter, used to adjust the smoothness or discriminative power of the similarity distribution; it can be a fixed constant or a learnable parameter. : Represents the category logits corresponding to class c, used in subsequent Softmax calculations of class probabilities; The category logits are then... Input the Softmax function and obtain the predicted probability of the corresponding class through exponential normalization transformation; Step S44: Output the reasoning result Finally, the inference process follows the location anchoring and semantic re-attachment process to complete the result integration; during the detection phase, S41 completely saves its relative coordinates in the input original image. As the spatial reference throughout the pipeline, subsequent super-resolution reconstruction (S42) and classification inference (S43) are performed within the cropped Local Region of Interest (ROI), i.e., the target candidate box. The final output strictly adopts the original position of the target candidate box, with added classification semantic attributes, forming a structured result: including the target index, original image spatial location, predicted category, classification confidence, and verification status. For subtypes with similar classification confidence or targets marked as suspicious, the system retains their detection box positions and marks them with verification labels, associating multiple candidate categories with the same spatial anchor point for efficient manual verification. This two-stage mechanism follows the cascaded processing idea of "detection-classification," where the first stage is responsible for locating the target from the original image and generating candidate regions, and the second stage performs further category identification on the candidate regions. Compared with directly using multi-result confidence weighting or empirical rule fusion, this mechanism achieves step-by-step discrimination through task separation, effectively reducing the subjectivity of the fusion strategy and improving the clarity and interpretability of the overall judgment process. The category label y, classification confidence, and verification label output by the best classification model are accurately mapped to the original image coordinates recorded in the detection stage through the target location obtained in the target detection stage, thereby achieving the alignment of semantic results with spatial location.
[0098] First, the set of target candidate boxes obtained by cropping from the original image (S41) is used to obtain low-resolution observations. Subsequently, low-resolution observations A T-step residual transfer Markov chain is constructed within the latent space to achieve a gradual recovery from low resolution (LR) to high resolution (HR). The forward diffusion process of the residual transfer Markov chain is defined as follows: In this diffusion process, set ; : No. The latent space state of the step; : No. The latent space state of the step; : As of the The cumulative residual migration coefficient of the step is used to control the degree of state shift from high-resolution representation to low-resolution observation; : No. The incremental migration coefficient of the step satisfies ; : Noise intensity adjustment coefficient, used to control the amplitude of random disturbances during the diffusion process; : No. The noise variance corresponding to each step is generally written as ; Its marginal distribution can then be written as: ; The ideal high-resolution representation of the target, i.e., a clear image; : Input low-resolution degraded observations; Given degenerate observations At that time, from the first Step to the first The positive migration distribution of the step; Given an ideal high-resolution representation and degradation observation At that time, the first The marginal distribution of the step state; : Identity matrix; The corresponding reverse process is: ; : During the reverse recovery process, from the first Step back to the first The conditional distribution of the steps; : From parameters The residual prediction network is represented to estimate the recovery information at the current step and guide backsampling.
[0099] It also includes step S5: Results visualization and statistics S51. Generate a detection-classification fusion graph with category labels; S52. Obtain visualization results of intermediate processes of interest.
[0100] A two-stage prototype classification system for few-sample image recognition, including Data acquisition and preprocessing module S11. Collect a small sample dataset of images; S12. Perform basic data augmentation on the original samples, such as random scaling, horizontal flipping, color perturbation, and random erasure, to increase data diversity; Detection model training module S21. The detection model is trained using an architecture modified based on YOLOv8; S22. Save the optimal weight file of the detection model to participate in subsequent S41 inference; Classification model training module S31. Construct a dual-branch feature extraction network based on EMO, and fuse shallow texture branch and deep shape branch features with global GeM pooling and learnable weights to initially train the classification model; S32. Connect to a prototype projection head with learnable temperature, generate normalized features, and automatically adjust temperature parameters. ; S33. Combining angular interval loss and supervised contrast loss, the class prototype vector is dynamically updated using a multi-prototype bank; S34. Thaw and fine-tune some of the backbones until convergence to obtain the optimal classification model; Reasoning Phase Module S41. Input the original image, call the detection model, and output a set of target candidate boxes; S42. Perform super-resolution reconstruction on each candidate box in the target candidate box set to restore details; S43. Feed the super-resolution target candidate boxes into the optimal classification model to obtain the class probability distribution; S44. Select the category with the highest class probability as the classification result category in the inference stage, and combine it with the target location obtained in the target detection stage S41 to accurately map it to the original image to obtain the final target location and category result.
[0101] This invention provides a two-stage prototype classification method and system for small sample image instance category recognition: In the detection stage, a target detection network based on YOLOv8 is used to extract candidate boxes from the input image and output the candidate boxes and their confidence information; In the classification stage, a dual-branch backbone is constructed, which extracts complementary representations from deep structural semantics and shallow and mid-level texture details respectively, and performs unitized fusion with learnable weights to form stable fine-grained features.
[0102] Specifically, the present invention, based on fusion features, sets up a prototype metric projection discriminant head to maintain at least one optimized discriminant feature for each class. Optimized discriminative features First, cosine similarity is calculated between the sample features to obtain the basic similarity. This similarity is used as a subsequent function. The input is a learnable temperature, which is incorporated into the logits scaling factor to achieve joint calibration of boundaries and confidence under small sample conditions.
[0103] To characterize intra-class multimodal differences, a multi-prototype memory is introduced for each class. An exponential moving average is used for momentum updates, and the nearest prototype strategy is employed for local updates, ensuring that the prototypes adaptively fit the data distribution during training. In terms of loss design, an angle-distance interval loss and supervised contrast loss are combined to simultaneously enhance inter-class separation and intra-class aggregation along both the angle discrimination and intra-batch contrast paths. Combined with cross-entropy and temperature regularization, the output is both discriminative and calibrating.
[0104] This method integrates A. image enhancement and domain alignment super-resolution reconstruction with B. target detection and fine-grained classification. It is suitable for target recognition and judgment under complex imaging conditions and can achieve high-precision, low-latency, and deployable recognition capabilities in scenarios with scarce samples and significant cross-platform domain shifts.
[0105] This invention achieves end-to-end identification and processing of multiple instance targets in a single frame within milliseconds while ensuring high recognition accuracy. The invention was deployed and tested on a Kirin V8 system ARM architecture platform equipped with a dedicated NPU910B2 acceleration unit: the system achieved a fine-grained recognition accuracy of 76.4% for multiple instance targets in complex airspace scenarios, with an end-to-end inference frame rate stably maintained at 20.3 FPS (single frame processing time approximately 49 milliseconds), fully meeting the dual requirements of real-time performance and accuracy in tactical scenarios.
[0106] Compared with the prior art, the present invention has the following beneficial effects: (1) This invention constructs a two-stage recognition framework that decouples detection and classification, achieving end-to-end fast processing of multiple instance targets in a single frame while ensuring high recognition accuracy. Actual deployment verification shows that on a Kirin V8 system with an ARM architecture and a dedicated NPU910B2 acceleration unit, the system achieves a fine-grained recognition accuracy of 76.4% for multiple instance targets in complex airspace scenarios, with an end-to-end inference frame rate stable at 20.3 FPS and a single-frame processing time of approximately 49 milliseconds, meeting the requirements of applications with high real-time requirements.
[0107] (2) By introducing a feature extraction network based on a dual-branch structure, shallow texture information and deep structure information are integrated, which effectively improves the feature expression ability under small sample conditions, thereby significantly enhancing the model's ability to identify fine-grained differences.
[0108] (3) By constructing a prototype-based classification and discrimination mechanism and combining angular interval constraints and supervised contrastive learning, the sample features form a more compact intra-class distribution and a clearer inter-class boundary in the unit hypersphere space, thereby improving classification stability and generalization ability.
[0109] (4) By introducing a multi-prototype bank mechanism, the multi-sub-distribution formed by the same category under different perspectives, configurations and appearances is modeled, which effectively alleviates the problem of insufficient single-prototype representation and improves the recognition accuracy in complex scenarios.
[0110] (5) By introducing candidate box-level super-resolution reconstruction processing in the inference stage, the detailed information of low-resolution targets can be effectively restored, thereby further improving the classification accuracy. Attached Figure Description
[0111] Figure 1 This is a schematic diagram of the network architecture of the optimized YOLOv8 model in this invention; Figure 2 This is a schematic diagram of the overall architecture of the Hybrid Attention Module (HAT) in this invention; Figure 3 yes Figure 2 A schematic diagram of the residual hybrid attention group in the HAT module; Figure 4 yes Figure 3 A schematic diagram of the residual hybrid attention block in the HAT module; Figure 5 yes Figure 4 A schematic diagram of the structure of the overlapping cross-window attention block; Figure 6 yes Figure 5 Schematic diagram of the structure of the central channel attention block; Figure 7 This is a flowchart of the method of the present invention. Detailed Implementation
[0112] like Figure 1-7 The method of the present invention includes: S1. Data Acquisition and Preprocessing S11. Collect a small sample dataset of images; 1.1.1 Data Source a) Satellite remote sensing imagery: resolution 0.5 m / pixel to 1 m / pixel; b) Telephoto images taken by ground-based aerial cameras and drones; c) Synthetic data: Instance images with different lighting, poses and background conditions are obtained based on the rendering of a large visual model, which are used to make up for the lack of rare categories.
[0113] In this embodiment, the dataset mentioned uses the publicly available dataset Mar20 as the primary data source example.
[0114] 1.1.2 Data Cleaning Consistency check scripts are used to remove blurry data, data with inconsistent label aspect ratios, and redundant irrelevant labels to ensure sample quality.
[0115] S12. Data Augmentation Process To address the challenges of small sample sizes and complex imaging conditions, this paper constructs a data augmentation pipeline that combines offline synthetic augmentation with online random augmentation. By introducing perturbations across multiple dimensions, including sample size, appearance variations, and imaging degradation, the generalization ability and robustness of the model are effectively improved.
[0116] (1) Offline synthesis enhancement strategy Offline enhancement employs a one-time generation and caching approach, primarily used to expand the training sample base and mitigate the overfitting risk caused by small sample sizes. This stage focuses on introducing multimodal appearance perturbations, including simulating complex atmospheric conditions such as fog and clouds, as well as overlaying camouflage paint or texture variations, to enhance the model's adaptability to complex backgrounds and camouflaged targets. Simultaneously, a style transfer method based on generative adversarial networks is introduced to map daytime samples to nighttime or low-light styles, effectively expanding the illumination domain distribution and improving the model's stability across time periods and imaging conditions.
[0117] (2) Online randomized augmentation strategy Online augmentation is performed dynamically and probabilistically during the training phase, focusing on improving the model's robustness to pose changes, scale changes, and local occlusion. Multi-scale views are generated through random cropping and scaling to enhance scale invariance while maintaining subject integrity; horizontal and vertical flipping, as well as small-angle rotation, simulate target pose changes under different viewing angles. Color perturbation and perspective transformation are used to simulate different imaging conditions and shooting angles, helping to improve the model's adaptability to lighting changes and perspective distortion. Furthermore, random occlusion operations enhance the model's robustness to locally missing and occluded scenes; a cross-sample mixing strategy is introduced to alleviate class imbalance to some extent and improve the smoothness of decision boundaries; and blurring and compression distortion simulations enhance the model's tolerance to imaging degradation caused by actual transmission and compression.
[0118] S2. Detection Model Training In the detection model of this invention, the original YOLOv8 network has been modified in a targeted and lightweight manner. The core improvements are mainly reflected in four aspects: the backbone network, the neck network, the detection head, and the loss function.
[0119] S21. Object Detection Network Architecture and Training 1. Backbone Network (1) Attention backbone The features first enter the improved stem layer. This layer uses depthwise separable convolution combined with a lightweight coordinate attention mechanism to further enhance the shallow features, providing high-quality input for subsequent downsampling.
[0120] (2) Lightweight network backbone The default YOLOv8 backbone CSPDarknet was replaced with the HGNet V2 backbone. HGNet V2 introduces HGBlock (grouped residual block) in each feature extraction unit, which greatly reduces the number of parameters and computation through grouped convolution and channel reuse techniques; at the same time, in the downsampling stage, cascaded small receptive field convolution and deformable convolution are used to capture the deformation features of instances under different poses.
[0121] 2. Neck network The system employs a C2F architecture combined with an improved SPPF pooling module. C2F cross-layer connectivity fully integrates shallow details with deep semantics, enhancing the ability to identify targets at multiple scales; SPPF is replaced with a combination of depthwise separable convolution and dilated convolution to expand the effective receptive field and improve the robustness of detecting small targets at a distance.
[0122] 3. Head network A multi-scale detection strategy employs a three-level detection head (P3 / P4 / P5) to decouple the classification and regression branches. The classification branch uses a lightweight convolutional structure to reduce inference latency; the localization branch embeds a learnable attention gate to improve the localization accuracy of small instance boxes.
[0123] 4. Loss Function We propose the Unified IoU Focal Loss-inv. This loss unifies geometric metrics such as IoU, GIoU, DIoU, and CIoU into a single system, and adaptively increases gradient weights on low-IoU and difficult-to-detect samples to enhance the model's focus on difficult examples and small targets. At the same time, it shares a modulation factor with the classification Focal Loss, achieving simultaneous optimization of localization and classification.
[0124] Note: The YOLOv8 model retains most of the feature fusion capabilities of the neck network and completely replaces the original backbone network to optimize feature extraction capabilities in a task-oriented manner. The head network is optimized for target recognition and regression tasks, and the focus loss function is used to dynamically allocate gradients to obtain higher accuracy final detection results. 5. Training Process Using the enhanced training set from step S1, iterate for 120 training epochs. The optimizer chosen is AdamW, with an initial learning rate of... Cosine annealing is used for scheduling and warm-up. Enable automatic blending of precision and gradient clipping to ensure stability during mini-batch training; S22. Weight Fixing and Derivation The model file is saved when the validation set accuracy reaches its maximum, facilitating subsequent online inference deployment. This detection model can achieve a real-time detection performance of approximately 20 frames per second on an embedded platform, providing accurate and low-latency candidate box information for subsequent super-resolution reconstruction and prototype classification.
[0125] S3. Classification Model Training To address the fine-grained classification problem of subtle intra-class and inter-class differences in instance categories under small sample conditions, this invention proposes a dual-branch prototype discriminative classification training method based on EMO networks. This method significantly improves fine-grained classification performance while ensuring training stability by simultaneously modeling local texture and overall geometric features, and combining prototype projection, contrast constraints, and a phased fine-tuning strategy. The core innovations lie in the learnable temperature mechanism, multi-constraint loss design, and dynamic prototype bank. The specific implementation process is as follows.
[0126] Construction and initial training of S31 dual-branch feature extraction network (1) Design of dual-branch network structure Based on the EMO backbone network, this invention introduces a dual-branch feature extraction structure to explicitly separate and enhance discriminative information at different levels. Specifically: On the one hand, a shallow texture branch is constructed, which extracts shallow semantic information from the front feature layers of the EMO network. Furthermore, it enhances the model's sensitivity to detail differences by employing multi-layer lightweight convolution to focus on characterizing high-frequency local features such as the details of the instance's fuselage skin, serial number markings, and paint textures.
[0127] On the other hand, a deep shape branch is constructed, which adopts the deep structure of the EMO network and combines methods such as dilated convolution to expand the receptive field, so as to extract the overall geometric and contour information of the wing layout, air intake structure, and vertical tail shape. This enhances the ability to represent the differences in the macroscopic structure of instances.
[0128] The output features of the two branches are compressed into fixed-dimensional representations using GeM pooling, and then adaptively fused through a learnable linear weighting mechanism to obtain a unified image representation that simultaneously considers local texture and overall shape information. Specifically, the dual-branch features are weighted and fused after L2 normalization: ;in These are learnable weights.
[0129] (2) Preliminary training strategies To avoid disrupting training under limited sample size, only the lightweight convolutional modules and fusion weights in the two branches are trained in the initial stage. This stage employs conventional classification loss for supervision, enabling the bi-branch structure to complete initial adaptation in a stable feature space, laying the foundation for subsequent high-discrimination training.
[0130] S32 Prototype Projector Head and Learnable Temperature Mechanism After feature fusion, this invention introduces a prototype-based projection discriminant head (ProjHead), also known as a prototype projection head. This prototype projection head first analyzes the features... Normalization is performed to distribute the features across a unit hypersphere, thus integrating the characteristics. Normalized features generated by the prototype projection head , After being adjusted to a learnable temperature scale, optimized discriminative features are obtained. Then, the discriminative feature and the prototype vectors of each category are calculated. Similarity between them; prototype vectors of each category In the initial training phase, features are randomly initialized or obtained from pre-trained feature statistics and used as learnable parameters or momentum update variables during training. These parameters are then dynamically updated through a multi-prototype bank mechanism. The similarity is used to construct classification logits and serves as the basic input for angular margin constraints and supervised contrastive learning loss, driving the feature space towards intra-class compactness and inter-class separation.
[0131] in:
[0132] This indicates the aforementioned dual-branch fusion feature.
[0133] To characterize the distribution differences of different categories in the feature space, this invention introduces a learnable temperature parameter to scale the features, thereby obtaining optimized discriminative features. Specifically:
[0134] in, Let be the temperature parameter. The above normalization operation maps the features to a unit hypersphere, resulting in... These are the sample discrimination features used uniformly in this invention, and serve as the sole input for subsequent discrimination constraints and prototype matching calculations.
[0135] S33 Multi-constraint Discriminant Learning and Multi-prototype Bank Mechanism (1) Angular interval discrimination constraint To further enhance the separation of different instance categories in the feature space, this invention introduces an angular interval constraint Arcface during the prototype discrimination process. This makes similar samples more closely spaced on a unit sphere, and forms clearer angular boundaries between the network output logits of different categories, thereby improving the discriminative power of the model.
[0136] The target class classification logits are calculated as follows:
[0137] in, This represents the similarity scaling parameter. This represents the ArcFace interval parameter. This represents the angle between the current sample feature vector and its class prototype vector.
[0138] (2) Supervised comparative learning constraints Simultaneously, a supervised contrastive learning mechanism is introduced at the batch level, treating samples of the same class as positive pairs and samples of different classes as negative pairs. By explicitly compressing intra-class distance and increasing inter-class distance, intra-class variance is further reduced, improving the model's ability to identify subtle differences. This contrastive loss... It can be calculated as: in, For a sample set of the same type, For temperature comparison, The number of samples in the current batch. : No. Optimized discriminant features for each sample : with sample Feature vectors of positive samples of the same category : In the batch, except for the sample All sample feature vectors outside, : with sample The set of samples belonging to the same category (the set of positive samples). Size of the positive sample set :sample Cosine similarity with positive samples :sample Cosine similarity with other samples The temperature parameter is used to adjust the smoothness of the similarity distribution.
[0139] (3) Multi-prototype banking mechanism In real-world data, the same model often exhibits multiple sub-clusters in the feature space due to differences in factors such as shooting angle, mounting status, paint scheme, and mission configuration. Using only a single-category prototype can easily lead to feature distortion within the category, reducing discrimination accuracy.
[0140] Based on this, this invention maintains multiple prototype vectors for each category, forming a multi-prototype bank. During training, different samples automatically approach the best-matching prototype based on their feature distribution, thus enabling different substructures and submorphologies within the same category to be effectively represented by different prototypes. By introducing a momentum update mechanism, the multi-prototype bank remains stable even under small-batch training conditions, avoiding prototype drift. Specifically, for input samples (optimized discriminative features)... Real category ),calculate
[0141] in, This refers to the optimized discriminative features obtained after the input sample has undergone the complete feature extraction process of this invention, which are used for multi-prototype matching and classification. This is represented by a prototype vector. :category Next Each prototype vector represents a sub-distribution center of the category in the feature space.
[0142] : The number of prototypes maintained by category y.
[0143] : The set of multi-prototype vectors for category y (i.e., all prototypes of this class in the multi-prototype bank). represent and Similarity between the two.
[0144] Select the prototype and update according to the momentum rule:
[0145] in, The momentum coefficient, with a value range of (0,1), is used to control historical prototype information. With current features Weight allocation between them. Only when The prototype is updated when a threshold is reached. In a preferred embodiment, a similarity threshold may be added. (like () as an optional robust strategy. Indicates the updated y-th class and th... One prototype vector; This indicates the y-th class before the update. One prototype vector; This represents the true category label of the current input sample; This represents the prototype index among the k prototypes of the y-th class that is most similar to the features of the current sample.
[0146] The above design significantly enhances the model's ability to adapt to subtle differences within and between classes, and is one of the key improvements of this invention in the scenario of fine-grained classification with small sample sizes.
[0147] S34 Staged Fine-tuning and Model Convergence Control After constructing the discriminative feature space, this invention employs a phased unfreezing and incremental fine-tuning strategy to finely optimize the model. During the initial training phase, geometric and contour information is updated. With the projection discriminant head (ProjHead), more backbone parameters are gradually released, and the learning rate is reduced simultaneously to avoid catastrophic forgetting and achieve smooth convergence.
[0148] During training, stability is ensured through optimizer regularization, learning rate scheduling, mixed-precision training, and gradient pruning. Simultaneously, early stopping and model selection strategies based on validation set performance are introduced to ensure the final model achieves an optimal balance between accuracy and generalization ability.
[0149] Through the above classification model training process, this invention achieves significantly better performance than traditional single-branch classification models on a small sample dataset of only a hundred or so images, with a stable classification accuracy of over 75%. It seamlessly integrates with the detection module and subsequent super-resolution and inference processes, providing a highly reliable foundation for class determination in image instance recognition tasks in complex scenes.
[0150] Phased fine-tuning and model convergence control After completing the construction of the discriminative feature space, this invention employs a phased parameter unfreezing and gradual fine-tuning strategy to optimize the model, specifically including the following steps: (1) Initial stage (freezing the core team) Freeze the parameters of the bottom and middle layers of the feature extraction network, and only update the parameters of the high-level feature layers and the prototype classification head.
[0151] The learning rate for this phase is set as the initial learning rate. It is used to quickly establish stable classification and discrimination boundaries.
[0152] (2) Intermediate stage (partial thawing) After training reaches the preset number of rounds Alternatively, after the validation set loss converges, gradually unfreeze the high-level parameters of the backbone network (such as stage 3 and stage 4) and adjust the learning rate to... ,in .
[0153] This stage is used to improve feature representation capabilities, enabling the model to adapt to fine-grained differences.
[0154] (3) Later stage (global fine-tuning) Further unfreeze all network parameters and continue to reduce the learning rate to ,in This allows for fine-tuning of global parameters and stable convergence.
[0155] (4) Convergence control mechanism During training, the following control strategy is introduced: Learning rate scheduling: employing cosine annealing or piecewise decay strategies; Gradient clipping: Limits the gradient norm to no more than a threshold G; Weight regularization: controls model complexity through weight decay coefficient λ; Mixed precision training: Improves training efficiency and stabilizes numerical computation; (5) Model selection strategy Based on validation set performance metrics (such as classification accuracy or loss function), an early stopping mechanism is triggered when performance no longer improves for E consecutive periods, and the optimal model parameters are selected as the final output.
[0156] S4 Online Reasoning Stage To ensure recognition accuracy while meeting the real-time requirements of engineering applications, this invention employs a pipelined collaborative inference architecture encompassing detection, super-resolution, and classification in the online inference stage, and achieves efficient collaboration among multiple modules through parallel processing. Each submodule can operate independently, while results are aligned using a unified candidate box index, thus balancing speed and accuracy in end-to-end inference.
[0157] S41 Candidate Object Detection In the candidate object detection stage (S41), the optimal weight file of the detection model saved in S22 is loaded and used as the fixed model parameters of the improved object detection model. Forward inference is performed on the original image to achieve rapid localization and output of candidate object regions. At the beginning of the inference stage, the system first performs scale normalization on the original image to adapt it to the optimal receptive range of the detection network. Subsequently, the scale-normalized original image is fed into the improved object detection model trained in step S2 to quickly locate candidate regions where potential instance objects may exist. Specifically, the improved detection model outputs the bounding box coordinates of each object. The model calculates the corresponding category probabilities and confidence scores. Based on the bounding box coordinates, a corresponding region is cropped from the original input image, and optionally, the bounding box is expanded according to a preset ratio to preserve the contextual information of the target. Further, the original image is cropped according to the target bounding box parameters output by the detection model to obtain corresponding target candidate boxes. The improved target detection model outputs the spatial location, scale level, and corresponding detection confidence score for each candidate target. To avoid duplicate detection and redundant computation, a non-maximum suppression strategy is used to filter highly overlapping candidate boxes, ultimately obtaining a set of high-confidence, low-redundancy candidate boxes. This set provides accurate and controllable input for subsequent super-resolution and classification stages, effectively reducing the overall inference load.
[0158] The candidate boxes are filtered using a nonmaximum suppression strategy. The specific steps are as follows: (1) Sort all candidate boxes output by the improved target detection model from high to low according to the detection confidence; (2) Select the candidate box with the highest confidence as the baseline box and add it to the final detection result set; (3) Calculate the degree of overlap between the baseline box and the other candidate boxes, and use the intersection-over-union (IoU) ratio as the metric; (4) Delete all candidate boxes whose IoU with the baseline box is greater than a preset threshold (e.g., 0.5) to remove highly overlapping redundant detections; (5) Repeat steps (2) to (4) in the remaining candidate boxes until all candidate boxes have been processed; Finally, a set of target candidate boxes with high confidence and low redundancy is obtained.
[0159] S42 Local Super-Resolution Reconstruction Based on Candidate Boxes To address the issue of blurred details in long-distance shooting or small target scenes, this invention does not perform super-resolution processing on the entire image, but only performs local super-resolution reconstruction on the detected candidate regions, thereby significantly reducing computational overhead while ensuring the enhancement of details.
[0160] Specifically, each target candidate box in the target candidate box set obtained by S41 is sent to the super-resolution module for resolution reconstruction. This process enhances the wing edges, fuselage outline, and paint details while simultaneously outputting a heat map to indicate the structural areas that the model focuses on enhancing during the super-resolution process.
[0161] The super-resolution module first recovers high-frequency texture and edge information in low-resolution, blurred, or compressed images through image super-resolution and detail enhancement techniques. This invention uses ResShift as the core reconstruction engine to achieve state equation learning based on diffusion model residual transfer. Specifically, the super-resolution module maps low-resolution images to latent space representations using a VQGANA autoencoder, and then constructs a residual transfer diffusion chain within the latent space to gradually recover high-resolution details. Finally, the UNet state prediction module learns the residual function to guide backsampling, while combining multi-scale features and attention mechanisms to improve the quality of texture and structure reconstruction. Various degradation and enhancement strategies (noise, compression, scaling, etc.) are introduced during training to enhance the model's robustness in real-world low-quality scenes, thus providing reliable input for structure perception and fine-grained classification.
[0162] First, the set of target candidate boxes obtained by cropping S41 from the original image is used to obtain low-resolution observations. Subsequently, a T-step residual transfer Markov chain is constructed within its latent space to achieve a gradual recovery from low-resolution (LR) to high-resolution (HR). Its forward diffusion process is defined as follows: Where can be set
[0163] : No. The latent space state of the step; : No. The latent space state of the step; As of the The cumulative residual transfer coefficient of the step is used to control the state from high-resolution representation to low-resolution observation. The degree of offset; : No. The incremental migration coefficient of the step satisfies; ; : Noise intensity adjustment coefficient, used to control the amplitude of random disturbances during the diffusion process; : No. The noise variance corresponding to each step is generally written as ; Its marginal distribution can then be written as:
[0164] The ideal high-resolution representation of the target, i.e., a clear image; : Input low-resolution degraded observations; Given degenerate observations At that time, from the first Step to the first The positive migration distribution of the step; Given an ideal high-resolution representation and degradation observation At that time, the first The marginal distribution of the step state; : Identity matrix; The corresponding reverse process is: : During the reverse recovery process, from the first Step back to the first The conditional distribution of the steps; : By parameters The residual prediction network is represented to estimate the recovery information at the current step and Guide reverse sampling.
[0165] Based on this process, the super-resolution module can obtain high fidelity by performing T-step sampling during reverse reconstruction. .
[0166] S43 Prototype-based classification reasoning After completing local super-resolution, a uniform scale transformation is performed on the candidate regions that meet the minimum size requirement, and they are combined into inference batches and fed into the best classification model trained in step S3. The similarity distribution of each candidate target in the category prototype space is output, and the most likely category prediction and its classification confidence are given accordingly.
[0167] For extremely small targets with insufficient detailed information, in order to avoid introducing unreliable classification results, this invention filters them based on the confidence threshold in the detection stage and marks them as suspicious targets or objects to be verified, thereby avoiding the spread of misjudgments while ensuring overall reliability.
[0168] For optimized discriminant features Its relationship with the prototype vector of class c The similarity is calculated as follows: ; Further classification logits are obtained by temperature scaling: ; The symbols are defined as follows: : Represents the feature vector of the input sample after passing through the feature extraction network and normalization processing; : indicates the first The prototype vector corresponding to a class is used to represent the central representation of that class in the feature space; : Represents the category index, c ∈{1,2,…,C}, where C is the total number of categories; : Represents the similarity between the input feature vector and the c-th prototype, calculated using the vector dot product form, and is equivalent to cosine similarity under feature normalization conditions; : Represents the temperature scaling parameter, used to adjust the smoothness or discriminative power of the similarity distribution. It can be a fixed constant or a learnable parameter.
[0169] : Represents the category logits corresponding to class c, used in subsequent Softmax calculations of class probabilities. The category logits are then... Input the Softmax function, and obtain the predicted probability of the corresponding class through exponential normalization transformation.
[0170] S44 Inference Result Output Finally, in the inference process, this invention follows the location anchoring and semantic back-fitting process to complete the result integration. Each candidate box generated in the detection phase (S41) fully preserves its relative coordinates (x1, y1, x2, y2) in the input original image, serving as a spatial reference throughout the pipeline. Subsequent super-resolution reconstruction (S42) and classification inference (S43) are performed within the cropped local region of interest (ROI), i.e., the target candidate box.
[0171] The final output strictly uses the original location of the detection box, with added classification semantic attributes to form a structured result: including the target index, original image spatial location, predicted category, classification confidence score, and verification status. For subtypes with similar classification confidence scores or targets marked as suspicious, the system retains their detection box locations and marks them with verification indicators, associating multiple candidate categories with the same spatial anchor point for efficient manual verification. This two-stage mechanism follows a cascaded processing approach of "detection-classification," where the first stage is responsible for locating the target from the original image and generating candidate regions, and the second stage performs further category identification on the candidate regions. Compared to directly using multi-result confidence weighting or empirical rule fusion, this mechanism achieves step-by-step discrimination through task separation, effectively reducing the subjectivity of the fusion strategy and improving the clarity and interpretability of the overall judgment process.
[0172] The category label y, classification confidence, and verification label output by the best classification model are used to accurately map the target location obtained in the target detection stage to the original image to obtain the final target location and category result.
[0173] S5 Results Visualization and Statistical Analysis To facilitate engineering deployment, model evaluation, and subsequent algorithm iteration, this invention performs multi-level visualization and performance statistical analysis on the recognition results after inference is completed.
[0174] S51 Detection and Classification Fusion Results Visualization The system reconstructs the original image from the final recognition results, draws target bounding boxes at the corresponding locations, and labels the predicted category and fusion confidence level. Different instance models are distinguished by different colors, and the line style is associated with the confidence level, facilitating the rapid identification of high-confidence and low-confidence targets.
[0175] S52 Intermediate Process and Model Behavior Analysis Optionally, in debug or research mode, the system can also output intermediate process visualization results, including the original candidate regions, super-resolution enhancement results, and heatmaps of the classification model, to determine whether the model’s region of interest is consistent with the key structure of the instance.
[0176] Furthermore, the feature embedding space can be visualized with dimensionality reduction to aid in the analysis of the distribution of different instance categories in the feature space. Simultaneously, by visualizing the evolution trend of multiple prototypes during training, the stability and differentiation of class centers can be intuitively observed, providing a basis for further optimization of model structure and training strategies. This invention possesses excellent engineering deployment capabilities and hardware adaptability, and its performance monitoring module continuously tracks inference behavior, confidence distribution, and resource consumption, supporting long-term stable system operation and iterative optimization.
Claims
1. A two-stage prototype classification method for few-sample image recognition, characterized in that: Includes the following steps: Step S1: Data Acquisition and Preprocessing S11. Collect a small sample dataset of images; S12. Perform basic data augmentation on the original samples to increase data diversity; Step S2: Detection Model Training S21. The detection model is trained using an architecture modified based on YOLOv8; S22. Save the optimal weight file for the detection model; Step S3: Classification Model Training S31. Construct a dual-branch feature extraction network based on EMO, and fuse shallow texture branch and deep shape branch features with global GeM pooling and learnable weights to initially train the classification model; S32. Connect to a prototype projection head with learnable temperature, generate normalized features, and automatically adjust temperature parameters. ; S33. Combining angular interval loss and supervised contrast loss, the class prototype vector is dynamically updated using a multi-prototype bank; S34. Thaw and fine-tune some of the backbones until convergence to obtain the optimal classification model; Step S4: Reasoning Stage S41. Input the original image, call the detection model, and output a set of target candidate boxes; S42. Perform super-resolution reconstruction on each candidate box in the target candidate box set to restore details; S43. Feed the super-resolution target candidate boxes into the optimal classification model to obtain the class probability distribution; S44. Select the category with the highest class probability as the classification result category in the inference stage, and combine it with the target location obtained in the target detection stage S41 to accurately map it to the original image to obtain the final target location and category result.
2. The two-stage prototype classification method for few-sample image recognition according to claim 1, characterized in that: In step S21, the original YOLOv8 network is lightened by modifying four aspects: the backbone network, the neck network, the detector head, and the loss function. S211 Backbone Network (1) Attention backbone The input original image first enters the improved backbone Stem layer, which uses depthwise separable convolution combined with a lightweight coordinate attention mechanism; (2) Lightweight network backbone The default YOLOv8 backbone CSPDarknet is replaced with the HGNet V2 backbone. HGNet V2 introduces HGBlock grouped residual blocks in each feature extraction unit, which greatly reduces the number of parameters and computation through grouped convolution and channel reuse techniques. At the same time, cascaded small receptive field convolution and deformable convolution are used in the downsampling stage to capture the deformation features of instances under different poses. S212 neck network The system employs a C2F architecture combined with an improved SPPF pooling module. C2F cross-layer connectivity fully integrates shallow details with deep semantics, enhancing the ability to identify targets at multiple scales; SPPF is replaced with a combination of depthwise separable convolution and dilated convolution to expand the effective receptive field and improve the robustness of detecting small targets at a distance. S213 Header Network A three-level detection head is used based on a multi-scale detection strategy to decouple the classification branch from the regression branch; The classification branch uses a lightweight convolutional structure to reduce inference latency; the localization branch embeds a learnable attention gate to improve the localization accuracy of small instance boxes. S213 Loss Function A unified IoU-Focus Reverse Loss is proposed, which models multiple geometric metrics such as IoU, GIoU, DIoU, and CIoU in a unified manner, and adaptively increases the gradient weights for low IoU and difficult-to-detect samples, thereby enhancing the model's learning ability for difficult examples and small targets. At the same time, a modulation factor consistent with the classification Focal Loss is introduced to ensure that the localization branch and the classification branch maintain consistency in the weighting mechanism for difficult samples, achieving synergistic optimization between the two.
3. The two-stage prototype classification method for few-sample image recognition according to claim 1, characterized in that: In step S31, (1) Design of the dual-branch feature extraction network structure Based on the EMO backbone network, a dual-branch feature extraction structure is introduced to explicitly separate and enhance discriminative information at different levels; wherein: On the one hand, a shallow texture branch is constructed, which extracts shallow semantic information from the front feature layers of the EMO network. Furthermore, it is enhanced through multiple lightweight convolutions to focus on characterizing high-frequency local features, thereby improving the sensitivity of the classification model to subtle differences. On the other hand, a deep shape branch is constructed, which adopts the deep structure of the EMO network and expands the receptive field by combining methods such as dilated convolution, so as to extract the overall geometric and contour information. This enhances the ability to represent the differences in the macroscopic structure of instances; The output features of the two branches are compressed into fixed-dimensional representations using GeM pooling, and then adaptively fused through a learnable linear weighting mechanism to obtain a unified image representation that takes into account both local texture and overall shape information. The dual-branch features are then weighted and fused after L2 normalization to obtain the fused features. ; ; in These are learnable weights; In step S32, features are fused. Normalized features generated by the prototype projection head , After being adjusted to a learnable temperature scale, optimized discriminative features are obtained. ; in: ; To characterize the distribution differences of different categories in the feature space, a learnable temperature parameter is introduced to scale the features, thereby obtaining optimized discriminative features. : ; in, This refers to the temperature parameter.
4. The two-stage prototype classification method for few-sample image recognition according to claim 1, characterized in that: In step S33, (1) Angular interval discrimination constraint An angular interval constraint Arcface is introduced in the prototype discrimination process to make similar samples more closely packed on a unit sphere, and to form clearer angular boundaries between the target class classification logits of different categories, thereby improving the discriminativeness of the classification model; The target class classification logits are calculated as follows: ; in, Represents the similarity scaling parameter. This represents the ArcFace interval parameter. y represents the angle between the current sample feature vector and its class prototype vector, and y represents the current sample class label; (2) Supervised comparative learning constraints Simultaneously, a supervised contrastive learning mechanism is introduced at the batch level, treating samples of the same class as positive pairs and samples of different classes as negative pairs. By explicitly compressing intra-class distance and increasing inter-class distance, intra-class variance is further reduced, improving the classification model's ability to identify subtle differences; this contrastive loss... The calculation is as follows: ; in, For a sample set of the same type, For temperature comparison, The number of samples in the current batch. : No. Optimized discriminant features for each sample : with sample Feature vectors of positive samples of the same category : In the batch, except for the sample All sample feature vectors outside, : with sample A collection of samples belonging to the same category. Size of the positive sample set :sample Cosine similarity with positive samples :sample Cosine similarity with other samples The temperature parameter is compared and learned to adjust the smoothness of the similarity distribution. (3) Multi-prototype banking mechanism Optimized discriminative features extracted from input samples , Real category Calculate the input sample and category of A prototype Cosine similarity, select the nearest prototype index : ; in, It is represented by a prototype vector. :category Next Each prototype vector represents a sub-distribution center of the category in the feature space; The number of prototypes maintained by category y; :category A set of multiple prototype vectors; represent and Cosine similarity between the two; Select the prototype and update according to the momentum rule: ; in, is the momentum coefficient, with a value range of (0,1). Indicates the updated y-th class and th... One prototype vector; This indicates the y-th class before the update. One prototype vector; This represents the true category label of the current input sample; This represents the prototype index among the k prototypes of the y-th class that is most similar to the features of the current sample.
5. The two-stage prototype classification method for small sample image recognition according to claim 1, characterized in that: In step S34, after completing the construction of the discriminative feature space, the classification model is optimized using a phased parameter unfreezing and stepwise fine-tuning strategy, including the following steps: (1) Initial stage: Freezing the core Freeze the parameters of the bottom and middle layers of the feature extraction network, and only update the parameters of the high-level feature layers and the prototype classification head; The learning rate for this phase is set as the initial learning rate. It is used to quickly establish stable classification and discrimination boundaries; (2) Intermediate stage: partial thawing After training reaches the preset number of rounds Alternatively, after the validation set loss converges, gradually unfreeze the high-level parameters of the backbone network and adjust the learning rate to... ,in ; This stage is used to improve feature representation capabilities, enabling the model to adapt to fine-grained differences; (3) Later stage: global fine-tuning Further unfreeze all network parameters and continue to reduce the learning rate to ,in This is to achieve fine-tuning of global parameters and stable convergence; (4) Convergence control mechanism During training, the following control strategy is introduced: Learning rate scheduling: employing cosine annealing or piecewise decay strategies; Gradient clipping: Limits the gradient norm to no more than a threshold G; Weight regularization: controls model complexity by using a weight decay coefficient λ; Mixed precision training: Improves training efficiency and stabilizes numerical computation; (5) Model selection strategy Based on the validation set performance metrics, an early stopping mechanism is triggered when the performance no longer improves for E consecutive periods, and the optimal model parameters are selected as the final output.
6. The two-stage prototype classification method for few-sample image recognition according to claim 1, characterized in that: Step S41 Target Detection Stage: The original image is scale-normalized, the optimal weight file of the detection model saved in S22 is loaded, and the set of target candidate boxes is output. The target candidate boxes are filtered using a nonmaximum suppression strategy. The specific steps are as follows: (1) Sort all target candidate boxes output by the detection model from high to low according to the detection confidence; (2) Select the target candidate box with the highest current confidence as the baseline box and add it to the final detection result set; (3) Calculate the degree of overlap between the baseline box and the other target candidate boxes, and use the intersection-union ratio as the metric; (4) Delete all target candidate boxes whose IoU with the baseline box is greater than a preset threshold to remove highly overlapping redundant detections; (5) Repeat steps (2) to (4) in the remaining candidate boxes until all target candidate boxes have been processed; Finally, a set of target candidate boxes with high confidence and low redundancy is obtained.
7. The two-stage prototype classification method for few-sample image recognition according to claim 1, characterized in that: Step S42 Local super-resolution reconstruction based on target candidate boxes Each candidate bounding box in the target candidate box set obtained by S41 is fed into the super-resolution module for resolution reconstruction. The super-resolution module maps the low-resolution image to the latent space representation through the VQGANA autoencoder, and then constructs a residual transfer diffusion chain in the latent space to gradually restore high-resolution details. Finally, the UNet state prediction module learns the residual function to guide backsampling, and combines multi-scale features and attention mechanisms to improve the quality of texture and structure reconstruction. Various degradation and enhancement strategies are introduced during training to enhance the robustness of the model in real low-quality scenes, thereby providing reliable input for structure perception and fine-grained classification. S43 Prototype-based classification reasoning After completing local super-resolution, the target candidate boxes are grouped into inference batches and fed into the best classification model trained in step S3. The similarity distribution of each target candidate box in the category prototype space is output, and the most likely category prediction is given accordingly. Based on the enhanced target candidate boxes, the optimized discriminant features are extracted using the best classification model. For optimized discriminative features Its relationship with the prototype vector of class c The similarity is calculated as follows: ; Further classification logits are obtained by temperature scaling: ; The symbols are defined as follows: : Represents the feature vector of the input sample after passing through the feature extraction network and normalization processing; : Represents the prototype vector corresponding to the c-th class, used to characterize the central representation of this class in the feature space; : Represents the category index, c ∈{1,2,…,C}, where C is the total number of categories; : Represents the similarity between the input feature vector and the c-th prototype, calculated using the vector dot product form, and is equivalent to cosine similarity under feature normalization conditions; : Represents the temperature scaling parameter, used to adjust the smoothness or discriminative power of the similarity distribution; it can be a fixed constant or a learnable parameter. : Represents the category logits corresponding to the c-th category; Step S44: Output the reasoning result Finally, the inference process follows the steps of location anchoring and semantic re-attachment to complete result integration; during the detection phase, the relative coordinates of the data in the original input image are fully preserved. As a spatial reference throughout the pipeline, subsequent super-resolution reconstruction and classification inference are performed within the clipped local region of interest (ROI), i.e., the target candidate box. The final output strictly uses the original location of the target candidate bounding box, with added classification semantic attributes to form a structured result: including the target index, original spatial location, predicted category, classification confidence, and verification status. For sub-types with similar classification confidence or targets marked as suspicious, their detection box locations are retained and marked with verification labels, associating multiple candidate categories with the same spatial anchor point for efficient manual verification. This two-stage mechanism follows the cascaded processing approach of "detection-classification," where the first stage is responsible for locating the target from the original image and generating candidate regions, and the second stage performs further category identification on the candidate regions. The category label y, classification confidence, and verification label output by the best classification model are accurately mapped to the original image coordinates recorded in the detection stage through the target location obtained in the target detection stage, thereby achieving the alignment of semantic results with spatial location.
8. The two-stage prototype classification method for few-sample image recognition according to claim 7, characterized in that: In step S42, firstly, the set of target candidate boxes obtained in S41 is cropped from the original image to obtain low-resolution observations. Subsequently, low-resolution observations A T-step residual transfer Markov chain is constructed within the latent space to achieve gradual recovery from low resolution to high resolution. The forward diffusion process of the residual transfer Markov chain is defined as follows: In this diffusion process, set ; : No. The latent space state of the step; : No. The latent space state of the step; As of the The cumulative residual migration coefficient of the step is used to control the degree of state shift from high-resolution representation to low-resolution observation; : No. The incremental migration coefficient of the step satisfies ; : Noise intensity adjustment coefficient, used to control the amplitude of random disturbances during the diffusion process; : No. The noise variance corresponding to each step is generally written as ; Its marginal distribution can then be written as: ; The ideal high-resolution representation of the target, i.e., a clear image; : Input low-resolution degraded observations; Given degenerate observations At that time, from the first Step to the first The positive migration distribution of the step; Given an ideal high-resolution representation and degradation observation At that time, the first The marginal distribution of the step state; : Identity matrix; The corresponding reverse process is: ; : During the reverse recovery process, from the first Step back to the first The conditional distribution of the steps; : By parameters The residual prediction network is used to estimate the recovery information at the current step and guide backsampling; In step S43, for extremely small targets that are too small in size and lack detailed information, in order to avoid introducing unreliable classification results, they are screened according to the confidence threshold of the detection stage and marked as suspicious targets or objects to be reviewed, thereby avoiding the spread of misjudgments while ensuring overall reliability.
9. The two-stage prototype classification method for few-sample image recognition according to claim 1, characterized in that: It also includes step S5: Results visualization and statistics S51. Generate a detection-classification fusion graph with category labels; S52. Obtain visualization results of intermediate processes of interest.
10. A two-stage prototype classification system for small-sample image recognition, characterized in that: include Data acquisition and preprocessing module S11. Collect a small sample dataset of images; S12. Perform basic data augmentation on the original samples to increase data diversity; Detection model training module S21. The detection model is trained using an architecture modified based on YOLOv8; S22. Save the optimal weight file for the detection model; Classification model training module S31. Construct a dual-branch feature extraction network based on EMO, and fuse shallow texture branch and deep shape branch features with global GeM pooling and learnable weights to initially train the classification model; S32. Connect to a prototype projection head with learnable temperature, generate normalized features, and automatically adjust temperature parameters. ; S33. Combining angular interval loss and supervised contrast loss, the class prototype vector is dynamically updated using a multi-prototype bank; S34. Thaw and fine-tune some of the backbones until convergence to obtain the optimal classification model; Reasoning Phase Module S41. Input the original image, call the detection model, and output a set of target candidate boxes; S42. Perform super-resolution reconstruction on each candidate box in the target candidate box set to restore details; S43. Feed the super-resolution target candidate boxes into the optimal classification model to obtain the class probability distribution; S44. Select the category with the highest class probability as the classification result category in the inference stage, and combine it with the target location obtained in the target detection stage S41 to accurately map it to the original image to obtain the final target location and category result.