Method and system for recognizing similar target image based on hierarchical semantics and attribute guided reasoning
By constructing a hierarchical semantic and attribute-guided method for identifying closely related targets, and using hierarchical semantic and attribute description information to generate semantic-visual fusion category prototypes, the method solves the problem of identifying similar-looking targets in open-world and fine-grained recognition, and improves recognition accuracy and generalization ability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI JIAOTONG UNIV
- Filing Date
- 2026-03-02
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244415A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer vision technology, specifically to a method and system for recognizing near-field targets based on hierarchical semantics and attribute-guided reasoning. Background Technology
[0002] With the development of computer vision and intelligent sensing technologies, target recognition methods have been widely applied in various fields such as ecological monitoring, industrial inspection, intelligent manufacturing, traffic management, and product identification. Automating the identification and classification of complex targets can effectively improve system efficiency and reduce reliance on human experience. However, in real-world applications, target categories are often not fixed but rather exhibit continuous expansion, evolution, or derivation, transforming the target detection and recognition task from a closed-set problem to an open-world recognition problem.
[0003] In open-world environments, newly emerging target categories often lack sufficient labeled samples, with a limited number of available samples or even just a few examples. This makes it difficult for traditional models trained on large-scale supervised data to form stable and discriminative feature representations. Therefore, how to effectively identify and generalize new target categories under conditions of scarce samples has become a crucial problem that urgently needs to be solved in the development of current target recognition technologies.
[0004] To address these challenges, recent research has focused on extending model generalization beyond predefined categories. One important direction is few-shot object detection, which aims to learn new category representations using a limited number of labeled samples. These methods have validated that detection models can still achieve effective generalization under data-scarce conditions by transferring recognition capabilities from existing categories, constructing category prototypes, or introducing meta-learning mechanisms. Furthermore, some methods incorporate external semantic information such as text descriptions to supplement the lack of visual knowledge through cross-modal approaches, enabling the model to maintain a certain level of generalization ability even with limited samples. For example, in patent "CN116246287B; Target Object Recognition Method, Training Method, Device, and Storage Medium," feature extraction is performed on the target object description text to obtain descriptive text features and keyword text features. These features are then fused to obtain target text features. Combining these target text features with initial image features and incorporating external semantic information such as text descriptions, the method identifies target objects in the initial image that match the target object description text.
[0005] Based on this, vision-language models can learn the semantic alignment between images and natural language, enabling them to identify targets based on text descriptions. This drives the evolution of target recognition from a closed paradigm that relies on fixed labels to an open paradigm that relies on semantic reasoning. Overall, these methods have propelled visual recognition from a closed-set, label-dependent paradigm towards a more open direction that relies on semantic inference, enabling models to recognize targets that have never been explicitly learned.
[0006] However, as the recognition task shifts further from general target recognition to fine-grained recognition scenarios, the effectiveness of the aforementioned methods faces new challenges. In fine-grained recognition tasks, new categories of targets typically do not differ significantly from existing categories in overall appearance or semantics. Instead, they are highly similar to existing categories in structure, form, or design style, with only minor differences in local attributes, detailed features, or combinations. For example, different generations of products, models within the same series, or closely related targets are highly consistent in overall structure, differing only in local components, textures, proportions, or configuration details. These differences are often difficult to effectively distinguish using coarse-grained features or global semantic information.
[0007] Furthermore, newly emerging fine-grained categories are often accompanied by new category names or descriptions, making it difficult for open-vocabulary detection methods that rely solely on predefined category labels or simple text prompts to accurately correspond to local visual regions in images, thus further increasing the difficulty of recognition and generalization. Existing models trained on large-scale general data can typically only learn relatively broad visual-semantic relevance, lacking the ability to model domain-specific, discriminative fine-grained attributes and hierarchical relationships. This results in highly similar-looking targets having overly similar representations in the feature space, leading to false detections or confusion.
[0008] Therefore, there is an urgent need for a recognition method that can explicitly model the hierarchical relationship between target categories under conditions of limited samples and continuous category evolution, and combine fine-grained attribute information to guide visual feature learning and reasoning. This would enable the model to infer newly emerging closely related targets or derived categories from existing category knowledge, thereby improving the generalization ability and reliability of target recognition technology in complex open scenarios. Summary of the Invention
[0009] To address one of the shortcomings of existing technologies, the purpose of this application is to provide a method and system for recognizing near-field targets based on hierarchical semantics and attribute-guided reasoning.
[0010] A first aspect of this application provides a method for recognizing near-object images based on hierarchical semantics and attribute-guided reasoning, comprising: Obtain text description information for existing target categories in the support set, wherein the text description information includes hierarchical semantic description information and attribute description information; The text description information is encoded using a pre-trained text encoder to determine the hierarchical semantic embedding representation of the existing target category; Guided by a preset hierarchical alignment unit, a pre-trained visual encoder is used to perform feature extraction processing on target sample images of existing target categories in the support set, and the attributes of existing target categories are determined to guide visual feature representation. The hierarchical semantic embedding representation and the attribute-guided visual feature representation of the existing target category are input into a pre-trained alignment network to determine the semantic-visual fusion category prototype of the existing target category. Based on the existing target category semantic-visual fusion category prototype, the target regions to be identified in the query set are identified, and the predicted category of the target regions to be identified is determined.
[0011] Optionally, obtaining text description information for existing target categories in the support set includes: Based on the category identifier information of the existing target categories in the support set, determine the attribute description information of the existing target categories; Extract attributes from the existing target category's attribute description information to determine the attribute set; Based on the existing target categories in the support set, determine the upper-level category information of the existing target categories; Based on the upper-level category information of the existing target category, determine the parent node category and intermediate node category of the existing target category; Based on the existing target category, the parent node category of the existing target category, and the intermediate node category of the existing target category, construct the hierarchical semantic structure of the existing target category; Based on the hierarchical semantic structure of the existing target categories, determine the hierarchical semantic description information of the existing target categories.
[0012] Optionally, the step of encoding the text description information using a pre-trained text encoder to determine the hierarchical semantic embedding representation of the existing target category includes: The attribute description information is encoded using the pre-trained text encoder to determine the semantic feature representation of the attributes; The hierarchical semantic description information is encoded using the pre-trained text encoder to determine the hierarchical semantic feature representation; The attribute semantic feature representation and the hierarchical semantic feature representation are fused to determine the hierarchical semantic embedding representation of the existing target category.
[0013] Optionally, the method for constructing and training the preset hierarchical alignment unit includes: Based on the attribute set of the existing target category, the parent node category of the existing target category, and the intermediate level node category of the existing target category, multiple learnable hierarchical alignment units are determined. During the visual feature extraction process, the input sequence composed of the multiple learnable hierarchical alignment units and the visual features corresponding to the candidate boxes is input into the pre-trained visual encoder. The multiple learnable hierarchical alignment units interact with the visual features corresponding to the candidate boxes through the self-attention mechanism in the pre-trained visual encoder to determine the visual features corresponding to each hierarchical alignment unit. A local attention module is used, with the visual features corresponding to each hierarchical alignment unit as the query vector and the attribute semantic feature representation as the key and value. The visual features corresponding to each hierarchical alignment unit are interacted through a multi-head cross attention mechanism to determine the attribute semantic feature representation of the visual features corresponding to each hierarchical alignment unit. Based on the contrast loss between the visual features of each existing target category at the same semantic level and the attribute semantic feature representation aligned with the visual features, the learnable hierarchical alignment units are optimized to determine the preset hierarchical alignment units.
[0014] Optionally, the step of using a pre-trained visual encoder to extract features from target sample images of existing target categories in the support set, guided by a preset hierarchical alignment unit, to determine the attribute-guided visual feature representation of the existing target categories, includes: A pre-trained visual encoder is used to perform feature extraction processing on the target sample images of the existing target categories in the support set to determine the visual features of the target sample images of the existing target categories in the support set. The preset hierarchical alignment unit guides the visual feature alignment of the target sample images of the support set with existing target categories to the attribute semantic feature representation under each semantic level, and determines the attribute semantic feature representation under each semantic level of the visual feature alignment. The attribute semantic feature representations at each semantic level aligned with the visual feature are fused to generate the attribute-guided visual feature representation of the existing target category.
[0015] Optionally, the method for determining the pre-trained alignment network includes: Obtain a training set, which contains existing target category data; Based on the training set, obtain the hierarchical semantic embedding representation of the existing target categories in the training set and the attribute-guided visual features of the existing target categories in the training set; The hierarchical semantic embedding representation of the existing target categories in the training set is combined with the attribute-guided visual feature representation of the existing target categories in the training set to determine the joint feature representation; The joint feature representation is input into a preset alignment network to determine the predicted semantic-visual fusion category prototype, and the preset alignment network is trained to determine the trained alignment network. Based on the visual cluster centers of the target categories already present in the training set and the predicted semantic-visual fusion category prototypes, a preset method is used. The loss function is used to optimize the trained alignment network and determine the pre-trained alignment network.
[0016] Optionally, the step of identifying the target region to be identified in the query set based on the existing target category semantic-visual fusion category prototype, and determining the predicted category of the target region to be identified, includes: Guided by the preset hierarchical alignment unit, the pre-trained visual encoder is used to perform feature extraction processing on the target region to be identified in the query set, and the attribute guidance visual feature representation of the target region to be identified is determined. Determine the cosine similarity between the existing target category semantic-visual fusion category prototype and the attribute-guided visual feature representation of the target region to be identified; The existing target category corresponding to the semantic-visual fusion category prototype of the existing target category with the maximum cosine similarity is used as the predicted category of the region to be identified.
[0017] A second aspect of this application provides a near-object image recognition system based on hierarchical semantics and attribute-guided reasoning, comprising: The text description information acquisition module is used to acquire text description information for existing target categories in the support set. The text description information includes hierarchical semantic description information and attribute description information. The hierarchical semantic embedding representation acquisition module is used to encode the text description information using a pre-trained text encoder to determine the hierarchical semantic embedding representation of the existing target category. The attribute-guided visual learning module is used to perform feature extraction processing on target sample images of existing target categories in the support set under the guidance of a preset hierarchical alignment unit, and to determine the attribute-guided visual feature representation of the existing target categories. The semantic-visual fusion category prototype construction module is used to input the hierarchical semantic embedding representation and the attribute-guided visual feature representation of the existing target category into a pre-trained alignment network to determine the semantic-visual fusion category prototype of the existing target category. The category recognition module is used to identify the target region to be identified in the query set based on the existing target category semantic-visual fusion category prototype, and determine the predicted category of the target region to be identified.
[0018] A third aspect of this application provides a non-transitory computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements the steps of any of the methods provided in the first aspect of this application.
[0019] A fourth aspect of this application provides an electronic device comprising: A memory on which computer programs are stored; A processor for executing the computer program in the memory to implement the steps of any of the methods provided in the first aspect of this application.
[0020] This application presents a method for image recognition of closely related targets based on hierarchical semantics and attribute-guided reasoning. By introducing hierarchical semantic description information and using hierarchical relationships between categories for reasoning, it effectively improves the ability to recognize closely related categories or derived targets. By introducing attribute description information and guiding the learning process of visual features, it designs semantic-visual fusion category prototypes for existing target categories, enabling the transfer of knowledge of known target categories to newly emerging closely related targets. This improves the ability to distinguish highly similar targets and the generalization performance for unknown categories. It can be adapted to various target recognition scenarios such as biometrics, industrial product recognition, vehicle recognition, and commodity recognition, and has good versatility and scalability.
[0021] Other technical effects resulting from the additional features will be further illustrated in the corresponding embodiments. Attached Figure Description
[0022] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1 This is a flowchart illustrating a method for near-field target image recognition based on hierarchical semantics and attribute-guided reasoning, according to an exemplary embodiment.
[0023] Figure 2 This is a schematic diagram illustrating the reasoning process of another near-object image recognition method based on hierarchical semantics and attribute-guided reasoning, according to an exemplary embodiment.
[0024] Figure 3 This is a schematic diagram illustrating the structure of a near-field target image recognition system based on hierarchical semantics and attribute-guided reasoning, according to an exemplary embodiment. Detailed Implementation
[0025] The present application will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present application, but do not limit the present application in any way. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application. These all fall within the protection scope of the present application.
[0026] The terms "comprising" and "having," and any variations thereof, in the embodiments of this application are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the steps or units listed, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to such processes, methods, products, or devices.
[0027] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined with "first" or "second" may explicitly or implicitly include one or more of that feature.
[0028] In the description of the embodiments of this application, "multiple" means two or more, unless otherwise explicitly specified.
[0029] Existing target recognition methods suffer from strong sample dependence in open-world and fine-grained recognition scenarios, and insufficient generalization ability to newly emerging closely related categories or derived targets. This is particularly true when target categories are constantly evolving, labeled samples are scarce, or targets are visually similar, making it difficult for existing methods to fully utilize existing category knowledge for effective reasoning, leading to decreased recognition accuracy and stability. To address these issues, this application provides a closely related target image recognition method based on hierarchical semantics and attribute-guided reasoning to resolve these problems.
[0030] Figure 1 This is a flowchart illustrating a method for near-field target image recognition based on hierarchical semantics and attribute-guided reasoning, according to an exemplary embodiment. Figure 2 This is a schematic diagram illustrating the reasoning process of another near-object image recognition method based on hierarchical semantics and attribute-guided reasoning, according to an exemplary embodiment.
[0031] Reference Figure 1 , Figure 2 As shown in one embodiment of this application, a method for recognizing near-object images based on hierarchical semantics and attribute-guided reasoning is provided, including steps S11 to S15.
[0032] S11, Obtain text description information for existing target categories in the support set.
[0033] Specifically, the support set can be a reference sample set for few-shot learning. The support set can consist of N categories, with M labeled samples for each category. These samples serve as the basis for model learning or reference.
[0034] Existing target categories refer to categories that have already been manually labeled in the support set.
[0035] Textual description information includes hierarchical semantic description information and attribute description information. Hierarchical semantic description information is used to describe the subordinate or evolutionary relationships between target categories, while attribute description information is used to describe the target's appearance features, structural features, and local features.
[0036] S12 uses a pre-trained text encoder to encode the text description information and determine the hierarchical semantic embedding representation of the existing target category.
[0037] S13, guided by the preset hierarchical alignment unit, a pre-trained visual encoder is used to perform feature extraction processing on target sample images of existing target categories in the support set, and the attributes of existing target categories are determined to guide the visual feature representation.
[0038] S14, input the existing target category hierarchical semantic embedding representation and the existing target category attribute-guided visual feature representation into the pre-trained alignment network to determine the semantic-visual fusion category prototype of the existing target category.
[0039] S15, based on the existing target category semantic-visual fusion category prototype, identify the target region to be identified in the query set and determine the predicted category of the region to be identified.
[0040] The embodiments described above in this application, by introducing hierarchical semantic description information and using hierarchical relationships between categories for reasoning, effectively improve the ability to identify closely related or derived targets. By introducing attribute description information and guiding the visual feature learning process, a semantic-visual fusion category prototype of an existing target category is designed, enabling the transfer of knowledge of known target categories to newly emerging closely related targets. This improves the ability to distinguish highly similar targets and the generalization performance of unknown categories. It can be adapted to various target recognition scenarios such as biometrics, industrial product recognition, vehicle recognition, and commodity recognition, and has good versatility and scalability.
[0041] In order to obtain hierarchical semantic description information and attribute description information that support existing target categories in the set, in some specific embodiments of this application, S111 to S116 can be used for S11.
[0042] S111, Based on the category identification information of the existing target categories in the support set, determine the attribute description information of the existing target categories.
[0043] Specifically, the category identification information is the target category label.
[0044] Based on the target category labels of existing target categories in the support set, multidimensional semantic description text of the target categories is generated as attribute description information of the existing target categories. The multidimensional semantic description text describes the typical characteristics of the target categories in terms of appearance features, structural features, local features, and combinations thereof. Appearance features include, but are not limited to, color features and texture features.
[0045] In this embodiment, the attribute description information of the existing target category can be generated automatically based on a preset rule template, knowledge base query, or pre-trained visual-language model, and then manually verified or corrected.
[0046] For example, for each existing target category in the support set Generate each existing target category The attribute description information is obtained by using a pre-trained vision-language model and, based on a preset template, classifying the target category. The target category label information is processed to automatically generate natural language description text. , as attribute description information.
[0047] S112, Extract attributes from the existing target category attribute description information to determine the attribute set.
[0048] Specifically, the attribute description information of existing target categories is summarized, filtered, and compared and analyzed, and attributes related to the target recognition task are extracted to form an attribute set. The attribute set is used to represent the distinguishable features of existing target categories.
[0049] Following the example above, a pre-trained visual-language model can be used to summarize, generalize, and compare natural language description texts to extract key features and determine a candidate attribute set. This candidate attribute set represents the target category's attribute features at a fine-grained level. Based on the frequency of occurrence of candidate attributes in the natural language description text, the candidate attributes are sorted and filtered. Specifically, they are sorted from highest to lowest frequency; a higher frequency indicates stronger discriminability with the target category. The top J candidate attributes with strong discriminability with the target category are selected as the attribute set.
[0050] Where J represents a preset positive integer, Indicates sorting as j The candidate attributes are A, where A represents the set of attributes.
[0051] In some specific embodiments of this application, in order to enhance the ability of attribute descriptions to distinguish close targets, attribute description information containing contrast semantics is determined based on the difference relationship between existing target categories and similar target categories, and the attribute description information containing contrast semantics is added to the aforementioned attribute set.
[0052] The attribute set still needs to be manually verified and corrected to form the final attribute set.
[0053] In some specific embodiments of this application, a predefined first conjunction is used to describe the relationship between the target category and the attribute. For example, "has" can be used as the predefined first conjunction, that is, the relationship between the target category and the attribute can be expressed in the form of "target category has attribute".
[0054] The attribute description information obtained above is used in the subsequent semantic encoding process, and the final attribute set is used in the visual feature-guided learning process.
[0055] S113, Based on the existing target categories in the support set, determine the upper-level category information of the existing target categories.
[0056] Specifically, by querying pre-trained visual-language models, domain knowledge interfaces, and officially released category information APIs, one can obtain upper-level category information associated with existing target categories, i.e., the subordinate or evolutionary relationships between target categories.
[0057] S114. Based on the upper-level category information of the existing target category, determine the parent node category and intermediate node category of the existing target category.
[0058] S115. Based on the existing target category, the parent node category of the existing target category, and the intermediate node category of the existing target category, construct the hierarchical semantic structure of the existing target category.
[0059] Specifically, existing target categories, their parent node categories, and their intermediate node categories are represented in a structured form to construct a hierarchical semantic structure containing at least three semantic levels, thereby simultaneously characterizing the common and differential semantic information of existing target categories:
[0060] in, This represents the hierarchical semantic structure of existing target categories. Indicates the hierarchy corresponding to the parent node category. This indicates the level corresponding to the intermediate node category. This indicates the level corresponding to the existing target category.
[0061] The hierarchical semantic structure of an existing target can be represented using a tree structure to support complex hierarchical relationships for subsequent semantic encoding and reasoning processes.
[0062] In some specific embodiments of this application, a predefined second conjunction can be used to describe the subordinate relationship between categories. For example, "belongs to" can be used as a predefined second conjunction to describe the hierarchical relationship between the target category and its parent node category, i.e., in the form of "target category belongs to parent node category".
[0063] S116, Based on the hierarchical semantic structure of the existing target categories, determine the hierarchical semantic description information of the existing target categories.
[0064] Specifically, the hierarchical semantic description information of the existing target category is the hierarchical semantic description information corresponding to the hierarchical semantic structure of the existing target category, including the semantic description text of the parent node category, the semantic description text of the intermediate node, and the semantic description text of the existing target category itself.
[0065] The hierarchical semantic description information of existing target categories can be directly generated based on the existing hierarchical semantic structure of target categories using pre-trained visual-language models, official APIs, or predefined knowledge bases.
[0066] Based on the above steps S111 to S116, for example, an existing target category Bean Weevil is selected as an example object. According to the pre-constructed hierarchical semantic structure, the semantic position of the existing target category Bean Weevil in the hierarchical semantic structure is determined. The parent node category (family) is Seed beetles, the middle node (mid) is Acanthoscelides, and the existing target category node (species) is Bean Weevil. Thus, a three-layer hierarchical semantic structure containing the parent node category, the middle node, and the existing target category itself is constructed to describe the subordinate relationship between the existing target category Bean Weevil and its closely related categories.
[0067] Based on this, the top 7 distinguishable attributes are selected from the appearance and structural features of the existing target categories to form an attribute set. Meanwhile, Larger grain borer was selected as the contrastive semantic category to distinguish it from Bean Weevil.
[0068] Based on the hierarchical semantic structure, attribute set, and comparative semantic categories, the hierarchical semantic description information of Bean Weevil is constructed. An example of its structured representation is shown below: { "Bean Weevil": { "family": "Seed beetles", "mid": "Acanthoscelides", "shape": "Elongated and cylindrical", "color": "Brownish-gray to black", "texture": "Hard and shiny", "wings": "Present but small and folded back along the body", "legs": "Short and stout", "pattern": "No distinct patterns; uniform coloration", "antennae": "Long and slender, about twice the length of the body", "difference": "Larger grain borer" } } The hierarchical relationship is represented in the form of "target category belongs to parent node category", for example: "Bean Weevil belongs to Seed beetles"; The relationship between a target category and its attributes is expressed in the form of "target category has attribute", for example: "Bean Weevil has elongated and cylindrical shape".
[0069] The aforementioned hierarchical semantic description information is represented in a combination of natural language and structured form, and serves as input for subsequent text encoding and hierarchical semantic embedding construction.
[0070] The embodiments described above in this application, by introducing fine-grained attribute description information, enable the model to anchor itself to the core local features of the target category with discernibility, get rid of the vague dependence on overall visual features, accurately capture the attribute features that play a key role in distinguishing highly similar targets, effectively enhance the model's feature discrimination ability for similar targets, improve the accuracy and stability of fine-grained recognition from the feature representation level, and at the same time provide interpretable attribute basis for target recognition, making the model's recognition decision more targeted and reasonable; by explicitly constructing a hierarchical semantic structure between target categories, the model can use the hierarchical relationship between categories to reason, effectively improving the ability to recognize closely related categories and derived targets.
[0071] In order to uniformly map hierarchical semantic description information and attribute description information to a feature space that can be aligned with visual features and construct a hierarchical semantic embedding representation of existing target categories, in some specific embodiments of this application, S121 to S123 can be used for S12.
[0072] S121, a pre-trained text encoder is used to encode the attribute description information to determine the semantic feature representation of the attributes.
[0073] Specifically, the pre-trained text encoder can be a text encoder in the CLIP of a pre-trained vision-language model. .
[0074] For example, for existing target categories , and its corresponding attribute set The attribute description information is input into the pre-trained text encoder. Encoding is performed to obtain a set of attribute semantic feature representations:
[0075] in, Indicates that there is an existing target category. The j-th attribute The attribute description information, where J represents the total number of attributes and j represents the j-th attribute.
[0076] Subsequently, the set of attribute semantic feature representations are accumulated and averaged to obtain fine-grained attribute semantic feature representations:
[0077] in, This represents a species-level semantic representation, where s represents the species.
[0078] Specifically, This is a compact species-level semantic representation obtained by summing and averaging multiple attribute semantic feature representations.
[0079] S122, a pre-trained text encoder is used to encode the hierarchical semantic description information to determine the hierarchical semantic feature representation.
[0080] Specifically, the hierarchical semantic feature representation includes the hierarchical semantic feature representation of the existing target category, the hierarchical semantic feature representation of the parent node category, and the hierarchical semantic feature representation of the intermediate nodes.
[0081] Existing target categories The hierarchical semantic description information of the parent node category is input into the pre-trained text encoder. Encoding is performed to obtain the hierarchical semantic feature representation of the parent node category. .
[0082] Existing target categories The hierarchical semantic description information of the intermediate nodes is input into the pre-trained text encoder. Encoding is performed to obtain the hierarchical semantic feature representation of intermediate nodes. .
[0083] Among them, there are already target categories The hierarchical semantic description information of the parent node category and the existing target category The hierarchical semantic description information of the intermediate nodes is obtained using a shared pre-trained text encoder. Encode the code.
[0084] S123, perform fusion processing on attribute semantic feature representation and hierarchical semantic feature representation to determine the hierarchical semantic embedding representation of the existing target category.
[0085] Specifically, hierarchical semantic embedding represents hierarchical semantic information used to characterize the target category.
[0086] Specifically, mean pooling can be used for fusion processing to obtain the hierarchical semantic embedding representation of the existing target category: that is:
[0087] in, This represents a hierarchical semantic embedding representation. The hierarchical semantic feature representation of the parent node category. The hierarchical semantic feature representation of intermediate nodes. This represents the semantic feature representation of the existing target category attributes.
[0088] Specifically, the fusion processing can also employ methods including, but not limited to, splicing operations, weighted fusion, and mapping transformations.
[0089] In this embodiment, the output of the pre-trained text encoder and the output of the pre-trained visual encoder (hereinafter referred to as the pre-trained visual encoder) are both constrained to an alignable feature space. This enables the semantic feature representation generated by the pre-trained text encoder to be aligned with the visual feature representation generated by the pre-trained visual encoder in the feature space, thereby supporting effective alignment and reasoning of visual features under the guidance of attribute and hierarchical semantics.
[0090] The embodiments described above in this application construct a system that simultaneously includes common semantic information of existing target categories at different semantic levels as well as fine-grained attribute information, which is used for subsequent attribute-guided visual feature learning and semantic-visual fusion prototype generation processes.
[0091] In some specific embodiments of this application, the method for constructing and training a preset hierarchical alignment unit may employ steps S101 to S104.
[0092] S101, based on the existing target category's attribute set, the existing target category's parent node category, and the existing target category's intermediate level node category, determine multiple learnable hierarchical alignment units.
[0093] Specifically, learnable hierarchical alignment units are added to the visual encoder of the pre-trained visual-language model in a randomly initialized manner. The learnable hierarchical alignment units take the form of learnable semantic representations and are used to learn the visual feature representations of existing target categories at different semantic levels. They participate in the alignment process of visual features of target categories in both the training and inference phases.
[0094] Multiple learnable hierarchical alignment units are:
[0095] in, This represents a set of multiple learnable hierarchical alignment units. This represents the hierarchical alignment unit of the parent node category corresponding to the existing target category. This represents the hierarchical alignment unit of the intermediate nodes corresponding to the existing target category, { } represents the hierarchical alignment unit corresponding to the attribute set of the existing target category.
[0096] Each attribute in the existing set of attributes for the target category corresponds to a hierarchical alignment unit.
[0097] The number of learnable hierarchical alignment units is determined by the number of attributes in the existing target category's attribute set, the number of parent node categories of the existing target category, and the number of intermediate node categories of the existing target category. For each attribute, each parent node category, and each intermediate node category, a corresponding hierarchical alignment unit is set. The number of hierarchical alignment units is equal to the sum of the number of attributes, the number of parent nodes, and the number of intermediate nodes, thereby guiding the visual feature learning process at different attribute dimensions and different semantic levels.
[0098] The hierarchical alignment units of the parent node category and the intermediate node corresponding to the existing target category are used to characterize the shared visual features of the existing target category in high-level semantics, while the hierarchical alignment units corresponding to the attribute set of the existing target category are used to characterize the fine-grained visual features related to the attributes.
[0099] S102, In the process of visual feature extraction, the input sequence composed of multiple learnable hierarchical alignment units and visual features corresponding to candidate boxes is input into the pre-trained visual encoder. Multiple learnable hierarchical alignment units interact and learn with the visual features corresponding to candidate boxes through the self-attention mechanism in the pre-trained visual encoder to determine the visual features corresponding to each hierarchical alignment unit.
[0100] S103 employs a local attention module, using the visual features corresponding to each level of alignment unit as the query vector, and the attribute semantic feature representation as the key and value. Through a multi-head cross-attention mechanism, the visual features corresponding to each level of alignment unit are interacted to determine the attribute semantic feature representation of the visual feature alignment corresponding to each level of alignment unit.
[0101] Specifically, the introduction of a local attention module can further enhance the visual features’ ability to perceive attribute information. The local attention module is used to display the alignment hierarchical semantic units and their corresponding attribute semantic feature representations, so that the pre-trained visual-language model can associate the hierarchical alignment units with their most relevant attribute semantic feature representations.
[0102] S104, based on the contrast loss between the visual features of each existing target category at the same semantic level and the attribute semantic feature representation aligned with the visual features, optimize multiple learnable hierarchical alignment units to determine the preset hierarchical alignment units.
[0103] Specifically, during training, to enable the hierarchical alignment unit to learn stable and discriminative visual feature representations, a contrastive learning objective is introduced to optimize the hierarchical alignment unit. Specifically, contrastive loss is used to constrain the similarity relationship between visual features at different semantic levels and their corresponding attribute semantic feature representations, thereby improving the hierarchical alignment unit's ability to model fine-grained attribute differences.
[0104] After employing a pre-trained visual encoder, a visual representation of the image region of the existing target category is obtained, denoted as... ,in, Visual representation of a set, This represents the visual feature representation corresponding to the semantics of the parent node. The visual feature representation corresponding to the semantics of the intermediate nodes. This represents the species-level visual feature representation guided by the first attribute. Indicates the first Species-level visual feature representation guided by individual attributes Let be the number of attributes. Taking the parent node as an example, its contrastive loss function is:
[0105] in, This represents the contrast loss of the parent node. This indicates the total number of categories in the parent node. Indicates temperature parameter, This represents the cosine similarity.
[0106] The contrast loss of intermediate nodes and existing target categories has the same form, which can be referred to as the contrast loss of the parent node, and will not be elaborated here.
[0107] The overall loss is the sum of the contrast loss of the existing target categories, the contrast loss of intermediate nodes, and the contrast loss of the parent node.
[0108] Through the above process, training can guide the alignment of visual features with corresponding attribute semantic feature representations at different semantic levels. Then, the preset hierarchical alignment units can be used to obtain attribute-guided visual feature representations that fuse hierarchical semantic and attribute information, which can be used in subsequent semantic-visual fusion category prototype generation and reasoning recognition processes.
[0109] To achieve visual feature alignment of existing target categories at different semantic levels and construct attribute-guided visual feature representations, in some specific embodiments of this application, for S13, under the guidance of a preset hierarchical alignment unit, a pre-trained visual encoder is used to perform feature extraction processing on target sample images of existing target categories in the support set to determine the attribute-guided visual feature representation of the existing target categories. This can be achieved using S131 to S133.
[0110] S131, a pre-trained visual encoder is used to perform feature extraction processing on target sample images of existing target categories in the support set, and the visual features of target sample images of existing target categories in the support set are determined.
[0111] Specifically, the visual encoder in the pre-trained vision-language model CLIP Feature extraction is performed on the target image region to obtain features that characterize the target category. Visual features representation.
[0112] S132, using a preset hierarchical alignment unit to guide the visual feature alignment of target sample images of existing target categories in the support set, and determining the attribute semantic feature representation under each semantic level of visual feature alignment.
[0113] Specifically, the preset hierarchical alignment unit is the preset hierarchical alignment unit constructed and trained using the above steps S101 to S104.
[0114] S133, fuse the attribute semantic feature representations of each semantic level aligned with the visual features to generate attribute-guided visual feature representations of the existing target category.
[0115] The embodiments described above in this application introduce fine-grained attribute description information and use preset hierarchical alignment units to guide the visual feature learning process, enabling the pre-trained visual-language model to focus on local attribute features that play a key role in distinguishing highly similar targets, thereby improving the accuracy of fine-grained recognition.
[0116] In some specific embodiments of this application, visual modal information and textual modal semantic information are complementary in target recognition tasks. To construct a semantic-visual fusion category prototype, an alignment network is introduced to further fuse and map the features of the two modalities.
[0117] Methods for determining the pre-trained alignment network can be found in S201 to S205.
[0118] S201, Obtain the training set.
[0119] Specifically, the training set contains image data for existing target categories.
[0120] S202, Based on the training set, obtain the hierarchical semantic embedding representation of the existing target categories in the training set. Visual features are guided by attributes of existing target categories in the training set. .
[0121] in, This indicates the visual region where the target category already exists.
[0122] Specifically, step S202 can refer to the methods of steps S11 and S12 above to obtain the hierarchical semantic embedding representation of the existing target categories in the training set, and refer to the method of step S13 above to obtain the attribute-guided visual features of the existing target categories in the training set, which will not be elaborated here.
[0123] S203, the hierarchical semantic embedding representation of the existing target categories in the training set and the attribute-guided visual feature representation of the existing target categories in the training set are concatenated to determine the joint feature representation.
[0124] Specifically, hierarchical semantic embedding representation and attribute-guided visual feature representation are connected along the feature dimension to form a joint feature representation, which is used to simultaneously preserve visual modal information and text modal semantic information.
[0125] S204, input the joint feature representation into the preset alignment network, determine the predicted semantic-visual fusion category prototype, train the preset alignment network G, and determine the trained alignment network.
[0126] S205, based on the visual cluster centers of the target categories already in the training set and the predicted semantic-visual fusion category prototypes, a pre-defined method is used. The loss function optimizes the trained alignment network and determines the pre-trained alignment network.
[0127] Specifically, the method for obtaining the visual cluster centers of existing target categories in the training set is as follows: The existing target categories in the training set The visual features are averaged to obtain the visual cluster centers. .
[0128] The visual clustering center is used to represent the central distribution position of existing target categories in the visual feature space in the training set, providing stable supervision constraints for the pre-defined alignment network.
[0129] Visual cluster centers can also be obtained using weighted aggregation or other statistical methods.
[0130] In this embodiment, the preset alignment network It consists of two multilayer perceptrons and a nonlinear activation function, connected in series in the order of first multilayer perceptron, nonlinear function, and second multilayer perceptron. Through a pre-defined alignment network, the joint features are progressively aligned to the visual cluster centers of the corresponding existing target categories during training. The joint features are input into the first multilayer perceptron for linear transformation to obtain intermediate hidden features; these intermediate hidden features undergo nonlinear transformation via the nonlinear function to obtain nonlinearly mapped hidden features; these nonlinearly mapped hidden features are then input into the second multilayer perceptron for linear transformation to obtain the semantic-visual fusion category prototype. Here, the intermediate hidden features represent the linear features of the joint features after passing through the first multilayer perceptron.
[0131] During the training process, the following methods were adopted: The loss function optimizes the trained alignment network by minimizing the absolute error between the output features of the pre-defined alignment network and the corresponding visual cluster centers, thereby guiding the pre-defined alignment network to learn to map joint features to reasonable visual feature space locations under semantic information constraints. The loss function is as follows:
[0132] in, Represents the alignment loss function. Indicates that there is an existing target category. Hierarchical semantic embedding representation, Indicate target category Visual clustering centers This indicates a feature splicing operation.
[0133] Among them, the alignment loss function Used to measure the difference between the output features of the alignment network and the target visual cluster centers.
[0134] Through the above training process, a pre-trained alignment network is obtained, and the pre-trained alignment network is made capable of effectively fusing semantic and visual information and generating category discrimination representations, which provides a foundation for generating semantic-visual fusion category prototypes in the subsequent reasoning stage.
[0135] To achieve class-based prototype reasoning for close target recognition, during the reasoning phase, the support set and query set are used as inputs to a pre-trained visual-language model to predict the class of the close targets to be identified.
[0136] In some specific embodiments of this application, for S14, the hierarchical semantic embedding representation of the existing target category and the attribute-guided visual feature representation of the existing target category are input into the pre-trained alignment network to determine the semantic-visual fusion category prototype of the existing target category. This can be achieved by using S141 to S142.
[0137] S141, input the hierarchical semantic embedding representation and the attribute-guided visual feature representation of the existing target category into the pre-trained alignment network to determine the initial semantic-visual fusion category prototype of the existing target category.
[0138] Specifically:
[0139] in, Indicates the initial existing target category The corresponding semantic-visual fusion category prototype, This indicates the number of samples supported by the central support group.
[0140] The number of supported samples in a set can be one or more.
[0141] S142, the initial semantic visual fusion category prototype of the existing target category is weighted and fused with the mean of the attribute-guided visual feature representation of the existing target category in the support set to obtain the semantic visual fusion category prototype of the existing target category.
[0142] Specifically:
[0143]
[0144] in, This represents a semantic-visual fusion category prototype for an existing target category. Represents the empirical coefficient. This represents the mean of the visual feature representation guided by attributes that support the existing target categories in the set. This indicates an image region that supports an existing target category. The attributes guide the visual features. This indicates the total number of image regions.
[0145] Specifically, .
[0146] The semantic-visual fusion category prototype of the existing target category is used as a reference representation of the target category. The similarity calculation of visual features is guided by the attributes of the target to be identified, so as to complete the identification and reasoning of closely related targets.
[0147] The embodiments described above in this application, through a semantic-visual fusion-based category prototype modeling approach, achieve stable inference and recognition of unknown targets even when the number of samples is limited or new categories are not involved in training, thereby enhancing the model's generalization ability in open-world scenarios.
[0148] In some specific embodiments of this application, for S15, the target region to be identified in the query set is identified based on the existing target category semantic-visual fusion category prototype, and the predicted category of the region to be identified is determined. This can be done using S151 to S153.
[0149] S151, under the guidance of the preset hierarchical alignment unit, a pre-trained visual encoder is used to perform feature extraction processing on the target region to be identified in the query set, and the attributes of the target region to be identified are determined to guide the visual feature representation.
[0150] Specifically, the pre-trained visual encoder uses the visual encoder from the pre-trained visual-language model CLIP. .
[0151] S152, determine the cosine similarity between the existing target category semantic-visual fusion category prototype and the attribute-guided visual feature representation of the target region to be identified.
[0152] S153, the existing target category corresponding to the semantic-visual fusion category prototype of the existing target category with the maximum cosine similarity is used as the predicted category of the region to be identified.
[0153] The embodiments described above in this application, based on the similarity reasoning process of semantic-visual fusion category prototypes, enable the pre-trained visual-language model to effectively distinguish between categories that are highly similar in appearance and closely related in semantics, thereby achieving accurate identification of closely related targets.
[0154] The preferred features in the above embodiments can be used individually in any embodiment, or in any combination thereof, provided they do not conflict with each other. Furthermore, parts not described in detail in the embodiments can be implemented using existing technologies.
[0155] The following examples and comparative examples will be used to further illustrate this application in order to better understand the above-mentioned technical solutions. It should be understood that the following are only some examples and are not intended to limit this application.
[0156] This application provides a method for identifying near-field targets based on hierarchical semantics and attribute-guided reasoning, given a support set and a query set.
[0157] Given a support set, for each existing target class in the support set... Generate corresponding text description information based on the existing category identifier information of the target category. The text description information includes hierarchical semantic description information and attribute description information of the existing target category, which is used to characterize the features of the existing target category at different semantic levels.
[0158] Subsequently, guided by pre-set hierarchical alignment units, a pre-trained visual encoder is used. Support for centralized management of existing target categories Image area Visual feature extraction is performed to obtain corresponding attribute-guided visual feature representations. Simultaneously, a pre-trained text encoder is employed. The text description information is encoded to obtain the hierarchical semantic embedding representation corresponding to the existing target category. .
[0159] Furthermore, feature fusion is performed on the hierarchical semantic embedding representation and the attribute-guided visual feature representation of the existing target categories. The fused features are then input into a pre-trained alignment network G, which generates semantic-visual fusion category prototypes corresponding to the existing target categories.
[0160] in, Indicates the initial existing target category The corresponding semantic-visual fusion category prototype, This indicates the number of samples supported by the central support group.
[0161] To balance semantic priors and visual representations, the mean of the visual feature representation is guided by the attributes in the support set, using the semantic-visual fusion category prototype. Perform weighted fusion to obtain semantic-visual fusion category prototypes of existing target categories. .
[0162]
[0163] in, This represents the empirical coefficient.
[0164] Specifically, It supports the mean of visual feature representation guided by centralized attributes. Defined as:
[0165] in, This indicates an image region that supports an existing target category. The attributes guide the visual features. This indicates the total number of image regions.
[0166] After completing the prototype construction of the semantic-visual fusion category, the target regions to be identified in the query set are processed.
[0167] Guided by the pre-defined hierarchical alignment units, a pre-trained visual encoder is used to extract features for each query, thereby obtaining the attribute-guided visual feature representation of the target region to be identified.
[0168] Subsequently, the visual feature representation of the query is guided to perform similarity calculation with the semantic-visual fusion category prototypes corresponding to each existing target category. In this embodiment, cosine similarity is used as the similarity measurement method. The degree of matching between the query and each existing target category is determined based on the cosine similarity result. Finally, the target category corresponding to the category prototype with the highest similarity to the query is taken as the predicted category of the query, thereby completing the classification and identification of the query target.
[0169] This application provides a near-object image recognition method based on hierarchical semantics and attribute-guided reasoning. Through semantic-visual fusion category prototype modeling, it achieves stable reasoning and recognition of unknown targets even with a limited number of samples or when new categories are not included in the training. This enhances the model's generalization ability in open-world scenarios, is not dependent on specific application scenarios or fixed category systems, and is applicable to various target recognition scenarios such as biometrics, industrial product recognition, vehicle recognition, and commodity recognition. It has good versatility and scalability.
[0170] Figure 3 This is a schematic diagram illustrating the structure of a near-field target image recognition system based on hierarchical semantics and attribute-guided reasoning, according to an exemplary embodiment.
[0171] Reference Figure 3 As shown in one embodiment of this application, a near-object image recognition system 100 based on hierarchical semantics and attribute-guided reasoning is provided, including: a text description information acquisition module 110, a hierarchical semantic embedding representation acquisition module 120, an attribute-guided visual learning module 130, a semantic-visual fusion category prototype construction module 140, and a category recognition module 150.
[0172] The text description information acquisition module 110 is used to acquire text description information of existing target categories in the support set. The text description information includes hierarchical semantic description information and attribute description information. The hierarchical semantic embedding representation acquisition module 120 is used to encode the text description information using a pre-trained text encoder to determine the hierarchical semantic embedding representation of the existing target category. The attribute-guided visual learning module 130 is used to perform feature extraction processing on target sample images of existing target categories in the support set under the guidance of a preset hierarchical alignment unit, and to determine the attribute-guided visual feature representation of the existing target categories. The semantic-visual fusion category prototype construction module 140 is used to input the hierarchical semantic embedding representation of the existing target category and the attribute-guided visual feature representation of the existing target category into the pre-trained alignment network to determine the semantic-visual fusion category prototype of the existing target category. The category recognition module 150 is used to identify the target region to be identified in the query set based on the existing target category semantic-visual fusion category prototype, and determine the predicted category of the region to be identified.
[0173] The embodiments described above in this application, by introducing hierarchical semantic description information and using hierarchical relationships between categories for reasoning, effectively improve the ability to identify closely related or derived targets. By introducing attribute description information and guiding the visual feature learning process, a semantic-visual fusion category prototype of an existing target category is designed, enabling the transfer of knowledge of known target categories to newly emerging closely related targets. This improves the ability to distinguish highly similar targets and the generalization performance of unknown categories. It can be adapted to various target recognition scenarios such as biometrics, industrial product recognition, vehicle recognition, and commodity recognition, and has good versatility and scalability.
[0174] Regarding the embodiments of the above system, the specific ways in which each module performs operations have been described in detail in the embodiments of the method, and will not be elaborated here.
[0175] Based on the same technical concept, in some specific embodiments of this application, a terminal includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and a method that the processor can use to execute when executing the program.
[0176] Based on the same technical concept, in some specific embodiments of this application, a computer-readable storage medium is provided on which a computer program is stored, which can be used to execute a method when the program is executed by a processor.
[0177] Optionally, the memory is used to store programs; the memory may include volatile memory, such as random-access memory (RAM), such as static random-access memory (SRAM), double data rate synchronous dynamic random-access memory (DDR SDRAM), etc.; the memory may also include non-volatile memory, such as flash memory. The memory is used to store computer programs (such as application programs and functional modules that implement the above methods), computer instructions, etc., and the aforementioned computer programs and computer instructions can be partitioned and stored in one or more memories. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by the processor.
[0178] The aforementioned computer programs, computer instructions, etc., can be stored in partitions within one or more memory locations. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by a processor.
[0179] A processor is used to execute a computer program stored in memory to implement the various steps of the methods involved in the above embodiments. For details, please refer to the relevant descriptions in the preceding method embodiments.
[0180] The processor and memory can be separate structures or integrated structures. When the processor and memory are separate structures, they can be coupled together via a bus.
[0181] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0182] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0183] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0184] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0185] The foregoing has described some specific embodiments of this application. It should be understood that this application is not limited to the specific embodiments described above, and those skilled in the art can make various modifications or variations within the scope of the claims, which do not affect the substantive content of this application. The above-described preferred features can be used in any combination without conflict.
Claims
1. A method for recognizing closely related targets in images based on hierarchical semantics and attribute-guided reasoning, characterized in that, include: Obtain text description information for existing target categories in the support set, wherein the text description information includes hierarchical semantic description information and attribute description information; The text description information is encoded using a pre-trained text encoder to determine the hierarchical semantic embedding representation of the existing target category; Guided by a preset hierarchical alignment unit, a pre-trained visual encoder is used to perform feature extraction processing on target sample images of existing target categories in the support set, and the attributes of existing target categories are determined to guide visual feature representation. The hierarchical semantic embedding representation and the attribute-guided visual feature representation of the existing target category are input into a pre-trained alignment network to determine the semantic-visual fusion category prototype of the existing target category. Based on the existing target category semantic-visual fusion category prototype, the target regions to be identified in the query set are identified, and the predicted category of the target regions to be identified is determined.
2. The method for near-object image recognition based on hierarchical semantics and attribute-guided reasoning according to claim 1, characterized in that, The acquisition of text description information for existing target categories in the support set includes: Based on the category identifier information of the existing target categories in the support set, determine the attribute description information of the existing target categories; Extract attributes from the existing target category's attribute description information to determine the attribute set; Based on the existing target categories in the support set, determine the upper-level category information of the existing target categories; Based on the upper-level category information of the existing target category, determine the parent node category and intermediate node category of the existing target category; Based on the existing target category, the parent node category of the existing target category, and the intermediate node category of the existing target category, construct the hierarchical semantic structure of the existing target category; Based on the hierarchical semantic structure of the existing target categories, determine the hierarchical semantic description information of the existing target categories.
3. The method for near-object image recognition based on hierarchical semantics and attribute-guided reasoning according to claim 2, characterized in that, The step of encoding the text description information using a pre-trained text encoder to determine the hierarchical semantic embedding representation of the existing target category includes: The attribute description information is encoded using the pre-trained text encoder to determine the semantic feature representation of the attributes; The hierarchical semantic description information is encoded using the pre-trained text encoder to determine the hierarchical semantic feature representation; The attribute semantic feature representation and the hierarchical semantic feature representation are fused to determine the hierarchical semantic embedding representation of the existing target category.
4. The method for near-object image recognition based on hierarchical semantics and attribute-guided reasoning according to claim 3, characterized in that, The method for constructing and training the preset hierarchical alignment unit includes: Based on the attribute set of the existing target category, the parent node category of the existing target category, and the intermediate level node category of the existing target category, multiple learnable hierarchical alignment units are determined. During the visual feature extraction process, the input sequence composed of the multiple learnable hierarchical alignment units and the visual features corresponding to the candidate boxes is input into the pre-trained visual encoder. The multiple learnable hierarchical alignment units interact with the visual features corresponding to the candidate boxes through the self-attention mechanism in the pre-trained visual encoder to determine the visual features corresponding to each hierarchical alignment unit. A local attention module is used, with the visual features corresponding to each hierarchical alignment unit as the query vector and the attribute semantic feature representation as the key and value. The visual features corresponding to each hierarchical alignment unit are interacted through a multi-head cross attention mechanism to determine the attribute semantic feature representation of the visual features corresponding to each hierarchical alignment unit. Based on the contrast loss between the visual features of each existing target category at the same semantic level and the attribute semantic feature representation aligned with the visual features, the learnable hierarchical alignment units are optimized to determine the preset hierarchical alignment units.
5. The method for near-object image recognition based on hierarchical semantics and attribute-guided reasoning according to claim 1, characterized in that, Guided by a preset hierarchical alignment unit, a pre-trained visual encoder is used to perform feature extraction processing on target sample images of existing target categories in the support set, determining the attribute-guided visual feature representation of the existing target categories, including: A pre-trained visual encoder is used to perform feature extraction processing on the target sample images of the existing target categories in the support set to determine the visual features of the target sample images of the existing target categories in the support set. The preset hierarchical alignment unit guides the visual feature alignment of the target sample images of the support set with existing target categories to the attribute semantic feature representation under each semantic level, and determines the attribute semantic feature representation under each semantic level of the visual feature alignment. The attribute semantic feature representations at each semantic level aligned with the visual feature are fused to generate the attribute-guided visual feature representation of the existing target category.
6. The method for near-object image recognition based on hierarchical semantics and attribute-guided reasoning according to claim 1, characterized in that, The method for determining the pre-trained alignment network includes: Obtain a training set, which contains existing target category data; Based on the training set, obtain the hierarchical semantic embedding representation of the existing target categories in the training set and the attribute-guided visual features of the existing target categories in the training set; The hierarchical semantic embedding representation of the existing target categories in the training set is combined with the attribute-guided visual feature representation of the existing target categories in the training set to determine the joint feature representation; The joint feature representation is input into a preset alignment network to determine the predicted semantic-visual fusion category prototype, and the preset alignment network is trained to determine the trained alignment network. Based on the visual cluster centers of the target categories already present in the training set and the predicted semantic-visual fusion category prototypes, a preset method is used. The loss function is used to optimize the trained alignment network and determine the pre-trained alignment network.
7. The method for near-object image recognition based on hierarchical semantics and attribute-guided reasoning according to claim 1, characterized in that, The step of identifying the target region to be identified in the query set based on the existing target category semantic-visual fusion category prototype, and determining the predicted category of the target region to be identified, includes: Guided by the preset hierarchical alignment unit, the pre-trained visual encoder is used to perform feature extraction processing on the target region to be identified in the query set, and the attribute guidance visual feature representation of the target region to be identified is determined. Determine the cosine similarity between the existing target category semantic-visual fusion category prototype and the attribute-guided visual feature representation of the target region to be identified; The existing target category corresponding to the semantic-visual fusion category prototype of the existing target category with the maximum cosine similarity is used as the predicted category of the region to be identified.
8. A near-object image recognition system based on hierarchical semantics and attribute-guided reasoning, characterized in that, include: The text description information acquisition module is used to acquire text description information for existing target categories in the support set. The text description information includes hierarchical semantic description information and attribute description information. The hierarchical semantic embedding representation acquisition module is used to encode the text description information using a pre-trained text encoder to determine the hierarchical semantic embedding representation of the existing target category. The attribute-guided visual learning module is used to perform feature extraction processing on target sample images of existing target categories in the support set under the guidance of a preset hierarchical alignment unit, and to determine the attribute-guided visual feature representation of the existing target categories. The semantic-visual fusion category prototype construction module is used to input the hierarchical semantic embedding representation and the attribute-guided visual feature representation of the existing target category into a pre-trained alignment network to determine the semantic-visual fusion category prototype of the existing target category. The category recognition module is used to identify the target region to be identified in the query set based on the existing target category semantic-visual fusion category prototype, and determine the predicted category of the target region to be identified.
9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the program implements the steps of the method described in any one of claims 1-7.
10. An electronic device, characterized in that, include: A memory on which computer programs are stored; A processor for executing the computer program in the memory to implement the steps of the method according to any one of claims 1-7.