An open set fine-grained image recognition method and system based on retrieval-enhanced multi-modal reasoning

By employing a retrieval-enhanced multimodal reasoning approach, this method addresses the challenge of existing fine-grained identification methods struggling to classify known species and discover unknown species within a unified framework. It improves identification accuracy and discovery rate, provides interpretable visual evidence, reduces the cost of constructing supervised data, adapts to different resource constraints, and exhibits strong scalability.

CN122265720APending Publication Date: 2026-06-23HUNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUNAN UNIV
Filing Date
2026-03-25
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing fine-grained identification methods struggle to simultaneously classify known species and discover unknown species within a unified framework. They suffer from performance trade-offs that are difficult to reconcile between the accuracy of known species and the discovery rate of unknown species. High recall results are difficult to translate into high-precision decisions. Furthermore, they lack interpretability and have high costs associated with constructing supervisory data.

Method used

We employ a retrieval-enhanced multimodal reasoning approach. By acquiring query images, we retrieve candidate species and their example images from a species reference retrieval database. We then combine this with a multimodal reasoning model to perform chain-like comparative reasoning, generating structured reasoning results. We use retrieval evidence to make decisions and automatically determine supervision labels based on the retrieval results, thereby reducing data construction costs.

Benefits of technology

It achieves unified processing of known species classification and unknown species discovery, improves identification accuracy and discovery rate, eliminates the retrieval-decision performance gap, provides interpretable visual evidence, reduces the cost of constructing supervised data, adapts to different resource constraints, and has strong scalability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265720A_ABST
    Figure CN122265720A_ABST
Patent Text Reader

Abstract

The application discloses an open set fine-grained image recognition method based on retrieval-enhanced multi-modal reasoning, and comprises the following steps: obtaining a query image to be recognized; using the query image to be recognized to perform candidate species recall in a pre-established species reference retrieval library, so as to obtain k candidate species and example images corresponding to each candidate species; all example images corresponding to all candidate species form a retrieval-enhanced context; inputting the obtained query image to be recognized and the retrieval-enhanced context into a pre-constructed multi-modal reasoning model to perform chained thinking comparison reasoning, so as to obtain a reasoning result. The application can solve the technical problem that the existing closed set classification method often separates the known species classification and unknown species discovery for processing, and is difficult to realize simultaneously in a unified framework.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence and computer vision technology, and more specifically, relates to an open set fine-grained image recognition method and system based on retrieval-enhanced multimodal reasoning. Background Technology

[0002] As global biodiversity recording efforts continue to advance, the number of species to be identified has reached tens of thousands. There are often very small visual differences between different categories, so fine-grained identification technology is needed. In addition, there are unknown species outside the training set in real-world scenarios. Therefore, fine-grained identification technology not only needs to accurately classify known species, but also needs to be able to detect and discover unknown species.

[0003] Existing fine-grained identification methods mainly include the following categories: The first category is closed-set classification methods, which are standard image classification models trained based on convolutional neural networks (such as ResNet) or visual detection models like Transformer (such as ViT). These methods can achieve high classification accuracy for known categories. The second category is nearest-neighbor retrieval methods, which use pre-trained visual encoders (such as OpenCLIP) to extract features from the query image and directly use the species with the highest similarity in the retrieval database as the classification result (i.e., Top-1 retrieval). The third category is open-set detection methods based on confidence thresholds (such as MSP, ODIN, EnergyScore, ViM, ReAct, etc.), which set a threshold for the confidence score of the classifier to classify low-confidence samples as unknown species. The essence of this type of method is indirect judgment based on parameter memory. The fourth category is methods based on Multimodal Large Language Models (MLLM) (such as GPT-4, QwenVL, etc.). These methods utilize the powerful image and text understanding capabilities of MLLM to achieve species identification and classification.

[0004] However, the aforementioned existing fine-grained recognition methods all have some significant drawbacks: (1) Existing closed-set classification methods often handle the classification of known species and the discovery of unknown species separately, making it difficult to achieve both within a unified framework. (2) Although existing nearest neighbor-based search methods can complete the two tasks of classifying known species and discovering unknown species, the discovery of unknown species often depends on the adjustment of the posterior threshold, and there is a performance trade-off between the accuracy of known species and the discovery rate of unknown species that is difficult to reconcile. (3) Existing open set detection methods based on confidence thresholds are often in the retrieval candidate set in large-scale fine-grained species space. However, these methods cannot effectively convert high-recall retrieval results into high-precision decision results, resulting in a significant gap in retrieval-decision performance. (4) Existing MLLM-based methods generally lack interpretability because they usually directly provide results of known species classification and unknown species discovery, and cannot provide biological researchers with visual evidence and reasoning basis to support their judgments. (5) Existing MLLM-based methods have extremely high costs for constructing supervised data for unknown species, requiring manual labeling of a large number of open set samples to determine their category, and requiring that the test labels and training labels do not overlap at all, thus limiting the scalability of the method. Summary of the Invention

[0005] To address the aforementioned deficiencies or improvement needs of existing technologies, this invention provides an open-set fine-grained image recognition method and system based on retrieval-enhanced multimodal reasoning. Its purpose is to solve the following problems: First, existing closed-set classification methods often handle known species classification and unknown species discovery separately, making it difficult to achieve simultaneous processing within a unified framework. Second, while existing nearest-neighbor retrieval methods can accomplish both tasks, the discovery of unknown species often relies on posterior threshold adjustment, resulting in a difficult performance trade-off between known species accuracy and unknown species discovery rate. Third, existing open-set detection methods based on confidence thresholds cannot effectively convert high-recall retrieval results into high-precision decision results, leading to a significant gap in retrieval-decision performance. Fourth, existing MLLM-based methods typically directly provide known species classification and unknown species discovery results, generally lacking interpretability and failing to provide biological researchers with supporting visual evidence and reasoning basis. Fifth, the construction cost of supervised data for unknown species is extremely high, requiring manual labeling of a large number of open-set samples with completely non-overlapping test and training labels, thus limiting the scalability of the method.

[0006] To achieve the above objectives, according to one aspect of the present invention, an open-set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning is provided, comprising the following steps: (1) Obtain the query image to be identified, and use the query image to be identified to recall candidate species in a pre-established species reference retrieval database to obtain the top k candidate species and the corresponding species for each candidate species. Example images, all All example images corresponding to each candidate species constitute the retrieval enhancement context, where The value range is from 4 to 16. The value range is from 1 to 4; (2) Input the query image to be identified and the retrieval enhancement context obtained in step (1) into the pre-built multimodal reasoning model for chain thinking comparison reasoning to obtain the reasoning result.

[0007] Preferably, step (1) specifically includes the following sub-steps: (1-1) Obtain the query image to be identified, and perform a scaling operation on the query image to obtain a normalized query image; (1-2) Input the normalized query image obtained in step (1-1) into the visual encoder ViT for forward inference to obtain the query feature vector; (1-3) Input the query feature vector obtained in step (1-2) into the pre-established species reference retrieval database to obtain... One candidate species; (1-4) The result obtained in step (1-3) For each of the candidate species, the species with the highest similarity to that candidate species is recalled from the species reference retrieval database. A set of images, serving as an example image set corresponding to this candidate species; (1-5) Combine the k candidate species and their taxonomic names obtained in step (1-3) with the set of example images corresponding to each candidate species obtained in step (1-4) to obtain the retrieval enhancement context. ,in For the first The taxonomic names of the candidate species, For the first A set of example images corresponding to each candidate species, and having ∈[1, ].

[0008] Preferably, step (2) specifically includes the following sub-steps: (2-1) Use the prompt template to combine the query image to be identified and the retrieval enhancement context obtained in step (1) into an inference input sequence; (2-2) Input the inference input sequence obtained in step (2-1) into the pre-constructed multimodal inference model to obtain the inference results; (3) Analyze the reasoning results obtained in step (2) to obtain the final open set fine-grained image recognition result, and determine whether the open set fine-grained image recognition result is a classification mode or a discovery mode. If it is a classification mode, the query image to be identified is determined as a known species as the recognition result. If the recognition result is a discovery mode, the query image to be identified is determined as a new species as the recognition result.

[0009] Preferably, the reasoning result includes two types: classification mode and discovery mode. When the format of the reasoning result is [Classification]: taxonomic path, it indicates that the reasoning result is a classification mode. When the format of the reasoning result is [Discovery], it indicates that the reasoning result is a discovery mode.

[0010] Preferably, the species reference database in step (1) is established according to the following steps: (A1) Obtain reference images of multiple known species categories, group and store all reference images according to species category to obtain the reference image set corresponding to each species category; (A2) Use a pre-trained visual encoder to perform forward reasoning on each reference image in the reference image set corresponding to each species category obtained in step (A1) to obtain the feature vector corresponding to the reference image. The feature vectors corresponding to all reference images in the reference image set corresponding to all species categories constitute the feature vector set. (A3) Write the set of feature vectors obtained in step (A2) into the vector index structure to obtain the species reference retrieval library.

[0011] Preferably, the retrieval-enhanced supervised training dataset in step (2) is constructed according to the following steps: (B1) Obtain the iNaturalist dataset, obtain the training set from the iNaturalist dataset, and divide the training set into a non-overlapping dataset of query images to be identified and a dataset of species reference retrieval databases at a ratio of 1:4. (B2) Use the species reference retrieval library obtained in step (1) to perform candidate species recall processing on the dataset of query images to be identified obtained in step (B1) to obtain k candidate species corresponding to each query image to be identified and n example images corresponding to each candidate species, where k∈[4,16] and n∈[1,4]; (B3) Set the counter cnt1=1; (B4) Determine whether the cnt1th query image in the dataset of query images to be identified obtained in step (B2) appears in the k candidate species corresponding to the cnt1th query image to be identified obtained in step (B2). If so, mark the cnt1th query image to be identified as a training sample for the identification of known species, set the supervision label corresponding to the cnt1th query image to the taxonomic path of the candidate species corresponding to the cnt1th query image, and then proceed to step (B5); otherwise, mark the query image to be identified as a training sample for the discovery of new species, set the supervision label corresponding to the cnt1th query image to the discovery training sample, and then proceed to step (B5). (B5) Set cnt1 = cnt1 + 1; (B6) Mix all known species identification training samples and all discovery training samples to obtain a training sample set with supervised labels; (B7) Call the visual language big model to perform chain of thinking (COT) reasoning on each training sample in the training sample set with supervised labels obtained in step (B6) to obtain the structured reasoning result corresponding to the training sample. All training samples whose reasoning results in the structured reasoning results are consistent with the supervised labels constitute the retrieval enhanced supervised training dataset, and all training samples whose reasoning results in the structured reasoning results are inconsistent with the supervised labels constitute the prediction failure sample set.

[0012] Preferably, the multimodal inference model adopts the Qwen2.5-VL-7B-Instruct model, which is constructed according to the following steps: (C1) Obtain the retrieval-enhanced supervised training dataset, and input each training sample in the retrieval-enhanced supervised training dataset into the multimodal inference model to obtain the output sequence of the training sample and the estimated output sequence of the multimodal inference model. Obtain the negative log-likelihood loss value based on the output sequence of the training sample and the estimated output sequence of the model corresponding to the training sample. ; (C2) Based on the negative log-likelihood loss value calculated in step (C1), the parameters of the multimodal reasoning model are iteratively updated using the backpropagation algorithm to obtain a multimodal reasoning model with reasoning chain thinking ability. (C3) Repeat steps (C1) and (C2) above until the parameters of the multimodal reasoning model converge, thereby obtaining a pre-trained multimodal reasoning model; (C4) Obtain the failed prediction samples from the visual language large model, perform candidate species recall processing on the failed prediction samples to obtain the initial hard samples, obtain the cross-domain dataset ImageNet-Think from the website, and perform candidate species recall processing on the obtained cross-domain dataset ImageNet-Think and iNaturalist dataset in the species reference retrieval library to obtain the hard sample training set. (C5) Input each training sample from the hard sample training set obtained in step (C4) into the pre-trained multimodal inference model obtained in step (C3) for inference, so as to obtain multiple inference chains corresponding to the training sample. And through the first Each inference chain obtains its corresponding advantage value. : (C6) The advantage value corresponding to each inference chain obtained in step (C5). Obtain the policy gradient objective: (C7) Use the policy gradient objective obtained in step (C6) Update the parameters of the multimodal inference model and repeat steps (C5) and (C6) above until the parameters of the multimodal inference model converge, thereby obtaining the established multimodal inference model.

[0013] Preferably, the negative log-likelihood loss value in step (C1) The calculation formula is: ; in To retrieve the estimated output sequence of the q-th training sample in the augmented supervised training dataset, This represents estimating the t-th sequence in the output sequence. This represents the first t sequences in the estimated output sequence, where T represents the length of the estimated output sequence. For the multimodal inference model to be trained, To retrieve the enhanced supervised training dataset, To retrieve the q-th training sample in the augmented supervised training dataset, where q∈[1, the total number of training samples in the augmented supervised training dataset], t∈[1, T].

[0014] Preferably, step (C5) specifically adopts the following formula: ; in Represents the reward function, based on the inference chain. The difference is 0 or 1. This indicates the number of reasoning chains obtained, which is equal to 8; Step (C6) specifically uses the following formula: ; in This represents the policy probability ratio for the i-th sample, and its value range is... arrive Preferably 1, This represents the clipping function. This indicates that the i-th sample... Limited to arrive Between, if the i-th sample Value greater than but Value If the i-th sample Value lower than but Value Otherwise, take the i-th sample. The value itself, This represents the update amplitude limit parameter, which ranges from 0.1 to 0.2, with 0.2 being the preferred value.

[0015] According to another aspect of the present invention, an open-set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning is provided, comprising the following modules: The first module is used to acquire the query image to be identified, and then use this image to retrieve candidate species from a pre-established species reference retrieval database to obtain the top k candidate species and the corresponding [database name] for each candidate species. Example images, all All example images corresponding to each candidate species constitute the retrieval enhancement context, where The value range is from 4 to 16. The value range is from 1 to 4; The second module is used to perform chain-thinking comparative reasoning on the query image to be identified obtained from the first module and the retrieval enhancement context input into a pre-built multimodal reasoning model to obtain the reasoning result.

[0016] In general, the technical solution conceived by this invention, compared with the prior art, can achieve the following beneficial effects: (1) Since the present invention adopts steps (1) to (3), it unifies species identification and new species discovery into a decision problem based on retrieval evidence, and realizes the end-to-end unified processing of the two tasks. Therefore, it can solve the technical problem that existing closed set classification methods cannot simultaneously achieve the classification of known species and the discovery of unknown species within a unified framework. (2) Since the present invention adopts steps (2) to (3), it redefines the discovery of new species as an explicit decision problem based on retrieval evidence, rather than an implicit confidence judgment based on parameter memory. Therefore, it can solve the technical problem of the performance trade-off between the accuracy of known species and the discovery rate of unknown species in existing nearest neighbor retrieval methods. (3) Since the present invention adopts steps (1) to (2), it designs a chain-thinking comparison reasoning mechanism based on the retrieval candidate set, which can effectively transform the high recall retrieval results into high precision decision results. Therefore, it can solve the technical problem of the significant retrieval-decision performance gap in the existing open set detection method based on confidence threshold. (4) Since the present invention employs steps (2) to (3), it requires the multimodal reasoning model to generate a structured reasoning chain first, so that each judgment result has verifiable visual evidence. Therefore, it can solve the technical problem of lack of interpretability in existing MLLM-based methods. (5) Since the present invention adopts steps (2-1) to (2-7), it automatically determines the supervision label using the search results, which greatly reduces the cost of building open world data. Therefore, it can solve the technical problems of the existing MLLM-based method, such as the high cost of building supervision data for unknown species, the need for manual labeling of the category of a large number of open set samples, and the requirement that the test label and the training label do not overlap at all, which limits the scalability of the method. (6) The implementation of this invention is simple. The inference depth can be dynamically controlled by adjusting k and n to adapt to different deployment resource constraints; (7) The present invention has wide applicability. It manages the candidate species set based on an external retrieval database. When expanding the retrieval database, there is no need to retrain the model, which has strong scalability. (8) The present invention has strong generalization ability and achieves significantly better performance than baseline methods on large-scale fine-grained species identification benchmarks and multiple cross-domain datasets. Attached Figure Description

[0017] Figure 1 This is a flowchart of the open set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning of the present invention; Figure 2 is a comparison of the theoretical curve of the present invention with the accuracy of the model of the present invention on the iNaturalist test set. Figure 2(a) is a comparison of the theoretical upper bound of the pre-trained visual encoder (VIT-B) with the accuracy of the model of the present invention. Figure 2(b) is a comparison of the theoretical upper bound of the pre-trained visual encoder (VIT-L) with the accuracy of the model of the present invention. Figure 2(c) is a comparison of the theoretical upper bound of the pre-trained visual encoder (VIT-H) with the accuracy of the model of the present invention. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0019] The basic idea of ​​this invention is to address the long-standing problem of separate modeling of known species classification and unknown species discovery in open-world species identification tasks, and to propose a unified method based on retrieval-enhanced multimodal reasoning.

[0020] like Figure 1 As shown, this invention provides an open-set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning, comprising the following steps: (1) Obtain the query image to be identified, and use the query image to be identified to recall candidate species in a pre-established species reference retrieval database to obtain the top k candidate species and the corresponding species for each candidate species. Example images, all All example images corresponding to each candidate species constitute the retrieval enhancement context, where The value range is 4 to 16, preferably 16. The value range is from 1 to 4, preferably 4; This step specifically includes the following sub-steps: (1-1) Obtain the query image to be identified, and scale the query image to be identified (specifically scale it to a size of 224*224) to obtain a normalized query image; (1-2) Input the normalized query image obtained in step (1-1) into the visual encoder (VisionTransformer, or ViT for short) for forward inference to obtain the query feature vector; (1-3) Input the query feature vector obtained in step (1-2) into the pre-established species reference retrieval database to obtain... One candidate species; (1-4) The result obtained in step (1-3) For each of the candidate species, the species with the highest similarity to that candidate species is recalled from the species reference retrieval database. A set of images, serving as an example image set corresponding to this candidate species; (1-5) Combine the k candidate species and their taxonomic names obtained in step (1-3) with the set of example images corresponding to each candidate species obtained in step (1-4) to obtain the retrieval enhancement context. ,in For the first The taxonomic names of the candidate species, For the first A set of example images corresponding to each candidate species, and having ∈[1, ]; (2) Input the query image to be identified and the retrieval enhancement context obtained in step (1) into the pre-built multimodal reasoning model for chain-thinking comparison reasoning to obtain the reasoning result; This step specifically includes the following sub-steps: (2-1) Use the prompt template to combine the query image to be identified and the retrieval enhancement context obtained in step (1) into an inference input sequence; (2-2) Input the inference input sequence obtained in step (2-1) into the pre-constructed multimodal inference model to obtain the inference results; Specifically, the inference results include two types: classification mode and discovery mode. When the format of the inference result is [Classification]: taxonomic path, it indicates that the inference result is a classification mode. When the format of the inference result is [Discovery], it indicates that the inference result is a discovery mode.

[0021] (3) Analyze the reasoning results obtained in step (2) to obtain the final open set fine-grained image recognition result, and determine whether the open set fine-grained image recognition result is a classification mode or a discovery mode. If it is a classification mode, the query image to be identified is determined as a known species as the recognition result. If the recognition result is a discovery mode, the query image to be identified is determined as a new species as the recognition result.

[0022] In a typical embodiment, the inference depth can be dynamically controlled by adjusting the number of candidate species k and the number of example images n in step (1): the classification accuracy of this method continues to increase with the increase of (k,n), reaching 44.45% when k=4 and n=1, and 61.63% when k=32 and n=4, showing good scaling characteristics in the testing phase, without the need to retrain the model.

[0023] Specifically, the species reference retrieval database in step (1) of this invention is established according to the following steps: (A1) Obtain reference images of multiple known species categories, group and store all reference images according to species category to obtain the reference image set corresponding to each species category; (A2) Use a pre-trained visual encoder (Vision Transformer, or ViT for short) to perform forward reasoning on each reference image in the reference image set corresponding to each species category obtained in step (A1) to obtain the feature vector corresponding to the reference image. The feature vectors corresponding to all reference images in the reference image set corresponding to all species categories constitute the feature vector set. (A3) Write the set of feature vectors obtained in step (A2) into the vector index structure to obtain the species reference retrieval library.

[0024] In a typical implementation, a subset of the iNaturalist dataset, containing 10,000 species categories and approximately 500,000 reference images, is used as the reference image source. Feature vectors are extracted using the OpenCLIP ViT-H / 14 encoder, and a FAISS index is constructed to obtain the species reference retrieval library.

[0025] The enhanced supervised training dataset for retrieval in step (2) of this invention is constructed according to the following steps: (B1) Obtain the iNaturalist dataset, obtain the training set from the iNaturalist dataset, and divide the training set into a non-overlapping dataset of query images to be identified and a dataset of species reference retrieval databases at a ratio of 1:4.

[0026] (B2) Use the species reference retrieval library obtained in step (1) to perform candidate species recall processing on the dataset of query images to be identified obtained in step (B1) to obtain k candidate species corresponding to each query image to be identified and n example images corresponding to each candidate species, where k∈[4,16] and n∈[1,4]; (B3) Set the counter cnt1=1; (B4) Determine whether the cnt1th query image in the dataset of query images to be identified obtained in step (B2) appears in the k candidate species corresponding to the cnt1th query image to be identified obtained in step (B2). If so, mark the cnt1th query image to be identified as a training sample for the identification of known species, set the supervision label corresponding to the cnt1th query image to the taxonomic path of the candidate species corresponding to the cnt1th query image, and then proceed to step (B5); otherwise, mark the query image to be identified as a training sample for the discovery of new species, set the supervision label corresponding to the cnt1th query image to the discovery training sample, and then proceed to step (B5). (B5) Set cnt1 = cnt1 + 1; (B6) Mix all known species identification training samples and all discovery training samples to obtain a training sample set with supervised labels; (B7) Call the visual language big model to perform chain of think (COT) reasoning on each training sample in the training sample set with supervised labels obtained in step (B6) to obtain the structured reasoning result corresponding to the training sample. All training samples whose reasoning results are consistent with the supervised labels in their structured reasoning results constitute the retrieval enhanced supervised training dataset, and all training samples whose reasoning results are inconsistent with the supervised labels in their structured reasoning results constitute the prediction failure sample set.

[0027] In a typical embodiment, the above process generates approximately 76,621 retrieval enhancement supervised training samples, including 59,709 known species identification samples, 16,912 discovery training samples, and 44,354 failure samples. The known species identification samples and discovery training samples together constitute the DeepTaxon-SFT-76k retrieval enhancement supervised training dataset.

[0028] The multimodal inference model in step (3) of this invention adopts the Qwen2.5-VL-7B-Instruct model, which is constructed according to the following steps: (C1) Obtain the retrieval-enhanced supervised training dataset, and input each training sample in the retrieval-enhanced supervised training dataset into the multimodal inference model to obtain the output sequence of the training sample and the estimated output sequence of the multimodal inference model. Obtain the negative log-likelihood loss value based on the output sequence of the training sample and the estimated output sequence of the model corresponding to the training sample. ; Specifically, the negative log-likelihood loss value in this step The calculation formula is: ; in To retrieve the estimated output sequence of the q-th training sample in the augmented supervised training dataset, This represents estimating the t-th sequence in the output sequence. This represents the first t sequences in the estimated output sequence, where T represents the length of the estimated output sequence. For the multimodal inference model to be trained, To retrieve the enhanced supervised training dataset, To retrieve the q-th training sample in the augmented supervised training dataset, where q∈[1, the total number of training samples in the augmented supervised training dataset], t∈[1, T].

[0029] (C2) Based on the negative log-likelihood loss value calculated in step (C1), the parameters of the multimodal reasoning model are iteratively updated using the backpropagation algorithm to obtain a multimodal reasoning model with reasoning chain thinking ability. (C3) Repeat steps (C1) and (C2) above until the parameters of the multimodal reasoning model (including the LoRA layer) converge, thereby obtaining a pre-trained multimodal reasoning model; (C4) Obtain the failed prediction samples from the visual language large model (e.g., Doubao), perform candidate species recall processing on the failed prediction samples to obtain the initial hard samples, obtain the cross-domain dataset ImageNet-Think from the website, and perform candidate species recall processing on the obtained cross-domain dataset ImageNet-Think and iNaturalist dataset in the species reference retrieval library to obtain the hard sample training set. Specifically, this step is from https: / / huggingface.co / datasets / krishnateja95 / ImageNet-Think This website provides the ImageNet-Think cross-domain dataset; (C5) Input each training sample from the hard sample training set obtained in step (C4) into the pre-trained multimodal inference model obtained in step (C3) for inference, so as to obtain multiple inference chains corresponding to the training sample. And through the first Each inference chain obtains its corresponding advantage value. : This step specifically uses the following formula: ; in Represents the reward function, based on the inference chain. The difference is 0 or 1. This indicates the number of inference chains obtained, which is equal to 8 in this invention.

[0030] (C6) The advantage value corresponding to each inference chain obtained in step (C5). Obtain the policy gradient objective: ; in This represents the policy probability ratio for the i-th sample, and its value range is... arrive Preferably 1, This represents a truncation function, whose function is to restrict a value to a specified range; if the value exceeds the range, it is forcibly truncated to the boundary. In this invention, This indicates that the i-th sample... Limited to arrive Between, if the i-th sample Value greater than but Value If the i-th sample Value lower than but Value Otherwise, take the i-th sample. The value itself, This represents the update amplitude limit parameter, which ranges from 0.1 to 0.2, with 0.2 being the preferred value.

[0031] (C7) Use the policy gradient objective obtained in step (C6) Update the parameters of the multimodal inference model (including LoRA), and repeat steps (C5) and (C6) above until the parameters of the multimodal inference model converge, thereby obtaining the established multimodal inference model.

[0032] In a typical embodiment, Qwen2.5-VL-7B-Instruct was selected as the base model for supervised fine-tuning training (steps 3-1, 3-2, and 3-3), with a learning rate of 1e-4 and 1 epoch. During the reinforcement learning phase (steps 3-4, 3-5, and 3-6), the learning rate was 1e-5, training lasted for 2 epochs, and the group size G was 8. Using an 8-GPU H2O server, after training, the image recognition accuracy on iNaturalist further improved from 57.80% in supervised fine-tuning to 60.07%.

[0033] Experimental results This invention utilizes the iNaturalist large-scale species benchmark and six fine-grained cross-domain datasets, namely Flowers102 (dataset access link: https: / / www.robots.ox.ac.uk / ~vgg / data / flowers / 102 / Butterfly-200 (Dataset access link: https: / / cse.sysu.edu.cn / hcp / article / 145 Stanford Dogs (dataset access link: http: / / vision.stanford.edu / aditya86 / ImageNetDogs / Food-101 (Dataset access link: https: / / docs.pytorch.ac.cn / vision / stable / generated / torchvision.datasets.Food101.html Stanford Cars (dataset access link: https: / / github.com / cyizhuo / Stanford_Cars_dataset)、FGVC-Aircraft (Dataset acquisition link:) https: / / www.robots.ox.ac.uk / ~vgg / data / fgvc-aircraft Systematic experimental verification was conducted on it.

[0034] In image recognition tasks, this invention achieves an accuracy of 60.07% on the iNaturalist test set, which is 25.43 percentage points higher than the nearest neighbor Top-1 retrieval baseline (34.64%). On six cross-domain datasets, this invention outperforms the Top-1 retrieval baseline by an average of more than 20 percentage points, reaching 99.17% on Flowers102 and 93.02% on Stanford Cars.

[0035] In the task of discovering unknown species, this invention achieved a discovery rate of over 95% on cross-domain datasets (Food-101, Stanford Cars, FGVC-Aircraft) without modifying any parameters, and a discovery rate of 49.4% on fine-grained biological datasets (such as Flowers102).

[0036] The above results verify the effectiveness of the present invention in unifying image recognition and discovery of unknown species, eliminating the performance gap between retrieval and decision-making, and cross-domain generalization.

[0037] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A fine-grained image recognition method for open sets based on retrieval-enhanced multimodal reasoning, characterized in that, Includes the following steps: (1) Obtain the query image to be identified, and use the query image to be identified to recall candidate species in a pre-established species reference retrieval database to obtain the top k candidate species and the corresponding species for each candidate species. Example images, all All example images corresponding to each candidate species constitute the retrieval enhancement context, where The value range is 4 to 16. The value range is from 1 to 4; (2) Input the query image to be identified and the retrieval enhancement context obtained in step (1) into the pre-built multimodal reasoning model for chain thinking comparison reasoning to obtain the reasoning result.

2. The open set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning according to claim 1, characterized in that, Step (1) specifically includes the following sub-steps: (1-1) Obtain the query image to be identified, and perform a scaling operation on the query image to obtain a normalized query image; (1-2) Input the normalized query image obtained in step (1-1) into the visual encoder ViT for forward inference to obtain the query feature vector; (1-3) Input the query feature vector obtained in step (1-2) into the pre-established species reference retrieval database to obtain... One candidate species; (1-4) The result obtained in step (1-3) For each of the candidate species, the species with the highest similarity to that candidate species is recalled from the species reference retrieval database. A set of images, serving as an example image set corresponding to this candidate species; (1-5) Combine the k candidate species and their taxonomic names obtained in step (1-3) with the set of example images corresponding to each candidate species obtained in step (1-4) to obtain the retrieval enhancement context. ,in For the first The taxonomic names of the candidate species, For the first A set of example images corresponding to each candidate species, and having ∈[1, ].

3. The open set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning according to claim 1 or 2, characterized in that, Step (2) specifically includes the following sub-steps: (2-1) Use the prompt template to combine the query image to be identified and the retrieval enhancement context obtained in step (1) into an inference input sequence; (2-2) Input the inference input sequence obtained in step (2-1) into the pre-constructed multimodal inference model to obtain the inference results; (3) Analyze the reasoning results obtained in step (2) to obtain the final open set fine-grained image recognition result, and determine whether the open set fine-grained image recognition result is a classification mode or a discovery mode. If it is a classification mode, the query image to be identified is determined as a known species as the recognition result. If the recognition result is a discovery mode, the query image to be identified is determined as a new species as the recognition result.

4. The open set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning according to any one of claims 1 to 3, characterized in that, The inference results include two types: classification mode and discovery mode. When the format of the inference result is [Classification]: taxonomic path, it indicates that the inference result is a classification mode. When the format of the inference result is [Discovery], it indicates that the inference result is a discovery mode.

5. The open set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning according to claim 4, characterized in that, The species reference database in step (1) was established according to the following steps: (A1) Obtain reference images of multiple known species categories, group and store all reference images according to species category to obtain the reference image set corresponding to each species category; (A2) Use a pre-trained visual encoder to perform forward reasoning on each reference image in the reference image set corresponding to each species category obtained in step (A1) to obtain the feature vector corresponding to the reference image. The feature vectors corresponding to all reference images in the reference image set corresponding to all species categories constitute the feature vector set. (A3) Write the set of feature vectors obtained in step (A2) into the vector index structure to obtain the species reference retrieval library.

6. The open set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning according to claim 5, characterized in that, The enhanced supervised training dataset retrieved in step (2) is constructed according to the following steps: (B1) Obtain the iNaturalist dataset, obtain the training set from the iNaturalist dataset, and divide the training set into a non-overlapping dataset of query images to be identified and a dataset of species reference retrieval databases at a ratio of 1:

4. (B2) Use the species reference retrieval library obtained in step (1) to perform candidate species recall processing on the dataset of query images to be identified obtained in step (B1) to obtain k candidate species corresponding to each query image to be identified and n example images corresponding to each candidate species, where k∈[4,16] and n∈[1,4]; (B3) Set the counter cnt1=1; (B4) Determine whether the cnt1th query image in the dataset of query images to be identified obtained in step (B2) appears in the k candidate species corresponding to the cnt1th query image to be identified obtained in step (B2). If so, mark the cnt1th query image to be identified as a training sample for the identification of known species, set the supervision label corresponding to the cnt1th query image to the taxonomic path of the candidate species corresponding to the cnt1th query image, and then proceed to step (B5); otherwise, mark the query image to be identified as a training sample for the discovery of new species, set the supervision label corresponding to the cnt1th query image to the discovery training sample, and then proceed to step (B5). (B5) Set cnt1 = cnt1 + 1; (B6) Mix all known species identification training samples and all discovery training samples to obtain a training sample set with supervised labels; (B7) Call the visual language big model to perform chain of thinking (COT) reasoning on each training sample in the training sample set with supervised labels obtained in step (B6) to obtain the structured reasoning result corresponding to the training sample. All training samples whose reasoning results in the structured reasoning results are consistent with the supervised labels constitute the retrieval enhanced supervised training dataset, and all training samples whose reasoning results in the structured reasoning results are inconsistent with the supervised labels constitute the prediction failure sample set.

7. The open set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning according to claim 6, characterized in that, The multimodal inference model adopts the Qwen2.5-VL-7B-Instruct model, which is constructed according to the following steps: (C1) Obtain the retrieval-enhanced supervised training dataset, and input each training sample in the retrieval-enhanced supervised training dataset into the multimodal inference model to obtain the output sequence of the training sample and the estimated output sequence of the multimodal inference model. Obtain the negative log-likelihood loss value based on the output sequence of the training sample and the estimated output sequence of the model corresponding to the training sample. ; (C2) Based on the negative log-likelihood loss value calculated in step (C1), the parameters of the multimodal reasoning model are iteratively updated using the backpropagation algorithm to obtain a multimodal reasoning model with reasoning chain thinking ability. (C3) Repeat steps (C1) and (C2) above until the parameters of the multimodal reasoning model converge, thereby obtaining a pre-trained multimodal reasoning model; (C4) Obtain the failed prediction samples from the visual language large model, perform candidate species recall processing on the failed prediction samples to obtain the initial hard samples, obtain the cross-domain dataset ImageNet-Think from the website, and perform candidate species recall processing on the obtained cross-domain dataset ImageNet-Think and iNaturalist dataset in the species reference retrieval library to obtain the hard sample training set. (C5) Input each training sample from the hard sample training set obtained in step (C4) into the pre-trained multimodal inference model obtained in step (C3) for inference, so as to obtain multiple inference chains corresponding to the training sample. And through the first Each inference chain obtains its corresponding advantage value. : (C6) The advantage value corresponding to each inference chain obtained in step (C5). Obtain the policy gradient objective: (C7) Use the policy gradient objective obtained in step (C6) Update the parameters of the multimodal inference model and repeat steps (C5) and (C6) above until the parameters of the multimodal inference model converge, thereby obtaining the established multimodal inference model.

8. The open set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning according to claim 7, characterized in that, Negative log-likelihood loss value in step (C1) The calculation formula is: ; in To retrieve the estimated output sequence of the q-th training sample in the augmented supervised training dataset, This represents estimating the t-th sequence in the output sequence. This represents the first t sequences in the estimated output sequence, where T represents the length of the estimated output sequence. For the multimodal inference model to be trained, To retrieve the enhanced supervised training dataset, To retrieve the q-th training sample in the augmented supervised training dataset, where q∈[1, the total number of training samples in the augmented supervised training dataset], t∈[1, T].

9. The open set fine-grained image recognition method based on retrieval-enhanced multimodal reasoning according to claim 8, characterized in that, Step (C5) specifically uses the following formula: ; in Represents the reward function, based on the inference chain. The difference is 0 or 1. This indicates the number of reasoning chains obtained, which is equal to 8; Step (C6) specifically uses the following formula: ; in This represents the policy probability ratio for the i-th sample, and its value range is... arrive Preferably 1, This represents the clipping function. This indicates that the i-th sample... Limited to arrive Between, if the i-th sample Value greater than but Value ; If the i-th sample Value lower than but Value Otherwise, take the i-th sample. The value itself, This represents the update amplitude limit parameter, which ranges from 0.1 to 0.2, with 0.2 being the preferred value.

10. A fine-grained image recognition method for open sets based on retrieval-enhanced multimodal reasoning, characterized in that, Includes the following modules: The first module is used to acquire the query image to be identified, and then use this image to retrieve candidate species from a pre-established species reference retrieval database to obtain the top k candidate species and the corresponding [database name] for each candidate species. Example images, all All example images corresponding to each candidate species constitute the retrieval enhancement context, where The value range is 4 to 16. The value range is from 1 to 4; The second module is used to perform chain-thinking comparative reasoning on the query image to be identified obtained from the first module and the retrieval enhancement context input into a pre-built multimodal reasoning model to obtain the reasoning result.