Zero-shot multi-modal information retrieval method and system based on fine-grained text inversion
By extracting global and local features of images through a fine-grained text inversion network, generating pseudo-word tags, and utilizing title regularization constraints, the problem of insufficient generalization ability of image retrieval in zero-shot training is solved, and efficient and accurate multimodal combined image retrieval is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANDONG UNIV
- Filing Date
- 2024-03-29
- Publication Date
- 2026-06-16
Smart Images

Figure CN118193768B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image information retrieval technology, specifically relating to a zero-sample multimodal information retrieval method and system based on fine-grained text inversion. Background Technology
[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.
[0003] Compared to traditional image retrieval tasks based on unimodal queries, multimodal combined image retrieval offers a more flexible paradigm, allowing users to retrieve images using multimodal query conditions. In this paradigm, the reference image represents the user's overall retrieval needs, while the modified text reflects the user's desire to modify locally unsatisfactory attributes. However, due to the high cost of manually labeling training data, existing multimodal combined image retrieval datasets are limited in size, resulting in insufficient generalization ability of current supervised methods based on training triples (<reference image, modified text, target image>). To eliminate dependence on labeled datasets, recent research has introduced a challenging zero-shot-trained multimodal combined image retrieval task. This task aims to solve multimodal combined image retrieval tasks without relying on any training triples.
[0004] Existing research on zero-shot-trained multimodal ensemble image retrieval relies on pre-trained text inversion networks, which map images to individual pseudo-word tags, thus transforming ensemble image retrieval tasks into standard text-based image retrieval tasks. While these studies have made significant progress, they overlook the fact that coarse-grained text inversion may not accurately capture the full content of an image.
[0005] Some literature currently proposes using fine-grained text inversion networks to map image information. However, according to the inventors, this approach faces the following main challenges: 1) Images from different domains typically contain different local attributes. For example, images from the fashion domain often contain attributes such as sleeve length, waist design, and color, while images from the animal domain often contain attributes such as background, location, and fur. Therefore, effectively capturing the different local attributes of different images is the first challenge. 2) Previous zero-shot combinatorial image retrieval research used image-related category text as real word tags, thus constraining the mapped pseudo-word tags to reside in the embedding space of real word tags. However, using image categories alone is insufficient to simultaneously normalize both subject-oriented and attribute-oriented pseudo-word tags. Therefore, normalizing both subject-oriented and attribute-oriented pseudo-word tags into the real word tag embedding space becomes another challenge. Summary of the Invention
[0006] To address the aforementioned problems, this invention proposes a zero-sample multimodal information retrieval method and system based on fine-grained text inversion. This invention obtains subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags based on global and local features of images, respectively, and uses different title-based semantic regularization constraints to standardize the pseudo-word tags, thereby achieving more efficient and accurate retrieval results.
[0007] According to some embodiments, the present invention adopts the following technical solution:
[0008] A zero-shot multimodal information retrieval method based on fine-grained text inversion includes the following steps:
[0009] Obtain reference images and modify text;
[0010] Using a pre-trained fine-grained text inversion network, the reference image is mapped to subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags, which are then concatenated with the modified text to obtain a combined query in text form. The cosine similarity between the representation vector of the combined query and the target image vector is used as the retrieval basis to retrieve the corresponding image.
[0011] The training process of the fine-grained text inversion network includes: training the fine-grained inversion network using training image samples, extracting image information, mapping it to subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags to fully represent image information in text form, generating titles for corresponding images using an image title generation model, generating derived descriptions based on the titles, and applying semantic regularization constraints to different derived descriptions to promote the alignment of pseudo-word tags with the embedding space of real word tags.
[0012] As an alternative implementation method, the specific process of extracting image information and mapping it to subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags includes:
[0013] Extract global and local block features from the image;
[0014] Map the global features of an image to subject-oriented pseudoword tags;
[0015] Local block features of an image are aggregated to obtain potential local attribute features. Irrelevant local attribute features are then filtered out through local-global correlation and mapped to attribute-oriented pseudo-word tags.
[0016] As a further implementation, the specific process of adaptively filtering irrelevant local attribute features through local-global relevance and mapping them to attribute-oriented pseudo-word tags includes:
[0017] Define a set of n learnable query embeddings, concatenate them with local image patch features, and feed them into a Transformer network for local attribute feature aggregation. Introduce a fully connected network to ensure that the dimensionality of the local image attribute features is the same as that of the global features.
[0018] Calculate the similarity between each local attribute feature and the global feature, and select the top k local attribute features with the highest similarity as valid local attribute features, where k is a positive integer;
[0019] An orthogonal constraint is introduced on the selected effective local attribute features to ensure that the extracted local attribute features are independent of each other;
[0020] The extracted local attribute features are mapped to attribute-oriented pseudo-word tags using a mapping network based on a multilayer perceptron.
[0021] As an alternative implementation, a symmetric contrast loss is introduced during training to learn pseudo-word labels, i.e., for each unlabeled image, it is expected that its pseudo-word-based text representation is more consistent with its original visual representation than the visual representation of other images.
[0022] As an alternative implementation, the specific process of generating a title for a corresponding image using an image title generation model includes: using the image title generation model to generate a text description for the corresponding image, wherein the text description is divided into two parts: the main body and local attributes; and using a text template to generate a standard title, wherein the standard title is text that expresses the image using the main body and local attributes.
[0023] As a further implementation method, the specific process of generating derived descriptions based on the title and applying semantic regularization constraints to different derived descriptions includes:
[0024] Three types of derived descriptions are generated by replacing the subject with subject-oriented pseudo-word tags, replacing local attributes with attribute-oriented pseudo-word tags, and simultaneously replacing both the subject and local attributes with both subject-oriented and attribute-oriented pseudo-word tags. and
[0025] Semantic regularization constraints are applied to the three types of derived descriptions respectively:
[0026]
[0027] Among them, t B ,t S ,t A and t SA These are standard headings and The text feature representation obtained after the text encoder, L subj and Lattr These are used to guide the learning of subject-oriented pseudo-word tagging and attribute-oriented pseudo-word tagging, respectively. whole Used for learning to standardize all pseudoword tags.
[0028] As an alternative implementation, the process of obtaining a combined query in text form includes: taking the reference image I... r And modify text T m The concatenation results in a combined query to retrieve the target image I. t ;
[0029] Using a trained fine-grained text inversion network, the reference image I... r The mapping is done as subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags, representing the image in text form;
[0030] Combined query is obtained by concatenating pseudo-word tags representing image information and modified text;
[0031] A text encoder is used to encode the combined query, and an image encoder is used to encode the target image I. t The candidate images are encoded, and the candidate images are ranked using the representation vector of the combined query and the image similarity between the candidate images.
[0032] A zero-shot multimodal information retrieval system based on fine-grained text inversion includes:
[0033] The data acquisition module is configured to acquire reference images and modify text;
[0034] The mapping module is configured to use a pre-trained fine-grained text inversion network to map the reference image into subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags, and concatenate them with the modified text to obtain a combined query in text form.
[0035] The retrieval module is configured to use the cosine similarity between the representation vector of the combined query and the target image vector as the retrieval basis to retrieve the corresponding image;
[0036] The fine-grained pseudo-word tagging and mapping module is configured to train the fine-grained inversion network using training image samples, extract image information, and map it into subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags to fully represent image information in text form.
[0037] The title-based ternary semantic regularization constraint module is configured to generate titles for corresponding images using an image title generation model, generate derived descriptions based on the titles, and apply semantic regularization constraints to different derived descriptions to promote the alignment of pseudo-word tags with real word tags in the embedding space.
[0038] A computer-readable storage medium for storing computer instructions, which, when executed by a processor, perform the steps in the above method.
[0039] An electronic device includes a memory and a processor, as well as computer instructions stored in the memory and running on the processor, wherein the computer instructions, when executed by the processor, perform the steps in the method described above.
[0040] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0041] This invention innovatively maps an image to a subject-oriented pseudo-word tag and several attribute-oriented pseudo-word tags to achieve a multimodal combined image retrieval method with zero-sample training.
[0042] This invention innovatively proposes to use dynamic local attribute feature extraction to address the problem of different local attributes in images from different domains, thereby improving the accuracy and speed of local attribute extraction.
[0043] This invention utilizes title-based ternary semantic regularization constraints to facilitate the interaction between pseudo-words and real words, thereby standardizing the alignment of pseudo-word tags and real word tags in the embedding space.
[0044] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description
[0045] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.
[0046] Figure 1 This is a flowchart of a multimodal combined image retrieval method based on zero-shot training of a fine-grained text inversion network, provided by an embodiment of the present invention. Detailed Implementation
[0047] The present invention will be further described below with reference to the accompanying drawings and embodiments.
[0048] It should be noted that the following detailed description is illustrative and intended to provide further explanation of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.
[0049] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.
[0050] Where there is no conflict, the embodiments and features described in this application may be combined with each other.
[0051] Example 1
[0052] like Figure 1 As shown, this embodiment provides a multimodal combined image retrieval method based on zero-shot training of a fine-grained text inversion network, including the following steps:
[0053] Pre-training phase:
[0054] Step 1: Extract global and local attribute features of the image, and map them into subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags, respectively;
[0055] Step 2: Use the image title generation model to obtain the title of the corresponding image, and design three semantic regularization constraints based on the title;
[0056] Testing phase:
[0057] Step 3: Utilize the trained fine-grained text inversion network to achieve multimodal combined image retrieval.
[0058] Step 1 specifically includes the following steps:
[0059] Step 101: Use the multimodal pre-trained model CLIP as a feature extractor to extract the output of the last layer of its visual encoder as the global features of the image. d1 is the dimension of the global feature.
[0060] Step 102: Use a mapping network based on a multilayer perceptron to process the obtained global features v g The mapping is to subject-oriented pseudo-word tags, and the formula is:
[0061] s=φ s (v g )
[0062] Here, s represents a subject-oriented pseudo-word tag.
[0063] Step 103: Using the multimodal pre-trained model CLIP as a feature extractor, extract the penultimate layer output of its visual encoder as local block features of the image. Where d2 is the dimension of the local block feature, and m represents the number of local blocks.
[0064] Step 104: Considering that images from different domains have different local attributes, assume that all open-domain real-world images have n potential local attributes, and aggregate the obtained local block features V to obtain these n potential local attributes.
[0065] First, define a set of n learnable query embeddings. Next, it is concatenated with the local image patch feature V and fed into a Transformer network for local attribute feature aggregation. Additionally, to ensure subsequent filtering based on local-global correlation, a fully connected network is introduced to guarantee that the dimensionality of the local image attribute features is the same as that of the global features. The formula is as follows:
[0066]
[0067] Where [·|·] represents a concatenation operation, and FC stands for fully connected network. These are potential local attribute features of an image.
[0068] Step 105: To ensure that local attribute features unrelated to the given image do not affect the reasoning of subsequent combined image retrieval tasks, relevant local attribute features are selected for each image, using reliable global image features as a reference. First, the similarity between each local attribute feature and the global features is calculated, and the top k local attribute features with the highest similarity are selected as valid local attribute features. The formula is as follows:
[0069]
[0070] Where cos(·,·) represents the cosine similarity calculation, c i This represents the similarity between the global features of the image and the i-th local attribute feature. express The top k largest values in the middle, This indicates the selected valid local attribute feature.
[0071] Step 106: Set a similarity threshold to further ensure the quality of the retained local attribute features. The formula is as follows:
[0072] W′=[x′ j ],c j >εand x′ j ∈W
[0073] Where ε is the global-local similarity threshold, This represents the set of finally selected local attribute features. r∈[1,k] represents the number of selected local attribute features.
[0074] Step 107: To ensure that different local attribute features can represent different visual features, this invention introduces an orthogonal constraint to ensure that the extracted attribute features are independent of each other. Specifically, to ensure the distinguishability between local image attribute features as much as possible while avoiding interference from irrelevant local attribute features (low similarity features), thus facilitating the learning of local attribute features, this invention deploys the orthogonal loss on W, rather than on W′. The orthogonal constraint formula is as follows:
[0075]
[0076] in, As a unit array, It is the Frobenius-2 norm.
[0077] Step 108: Use another mapping network based on a multilayer perceptron to map the obtained local attribute features W′ into attribute-oriented pseudo-word tags, as shown in the formula:
[0078] [a1,…,a r ]=φ s (W′)
[0079] Among them, a i This represents the attribute-oriented pseudoword tag obtained from the i-th mapping.
[0080] Step 109: After obtaining the subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags, each image can be represented in text form. Specifically, this invention designs a pseudo-word-based text template: Where S * It is a pseudo-word corresponding to the subject-oriented pseudo-word tag s, and It is the pseudo-word tag a corresponding to the i-th attribute. i The pseudo-words. Therefore, by inputting this template into the frozen CLIP text encoder, a pseudo-word-based text representation of each image can be obtained, denoted as t. q .
[0081] Step 110: To supervise the learning of pseudoword tags, this invention introduces a symmetric contrast loss. Intuitively, for each unlabeled image, it is expected that its pseudoword-based text representation is more consistent with its original visual representation than the visual representation of other images, and vice versa. This symmetric contrast loss is as follows:
[0082]
[0083] Where B is the batch size and τ is the temperature coefficient. and Let represent the global features of the i-th image and its pseudoword-based text representation, respectively.
[0084] Step 2 specifically includes:
[0085] Step 201: In order to align the embedding space of pseudo-word tags with that of real word tags so that the reference image and modified text embedding can be combined in the subsequent stage of inference, this invention uses image-generated text descriptions and designs three semantic regularization constraints to further regulate the interaction between pseudo-word tags and real word tags.
[0086] Step 202: Since (1) the visual language pre-trained large model has achieved significant success in the image caption generation model; and (2) BLIP is trained on the CoCo dataset, and the text descriptions it generates basically conform to the format of "[body] + [detail description]", which matches the pseudo-word-based text template designed above, this invention uses BLIP as the image caption generation model to generate high-quality descriptions for each image.
[0087] Step 203: For the subsequent three regular expression constraints, this invention uses the part-of-speech tagger (POS) of spacy to generate the first topic word in the title, so as to divide it into two parts: one part T subj Describe the main body and another part T attr Provide local attributes. For example: for the generated title description "three dogs sitting in front of a door.", T subj For "three dogs", T attr for "sitting in front of a door".
[0088] Step 204: To unify the title format and the aforementioned pseudo-word-based text template, this invention uses the text template "aphoto of [T]". subj ]with[T attr ]." as a standard heading
[0089] Step 205: To consider the interaction between pseudo-words and the interaction between pseudo-words and other contextual real words, this invention is based on standard headings. Three derivative descriptions are obtained: and in, and Replace T with subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags respectively. subj and T attr; Simultaneously replace T subj and T attr The three derived descriptions are as follows: For "a photo of [S]" * ]with[T attr ]."; for for
[0090] Step 206: Based on and Design three semantic regular expression constraints based on the title, as follows:
[0091]
[0092] Among them, t B ,t S ,t A and t SA They are and Text feature representation obtained after CLIP text encoder. subj and L attr These are used to guide the learning of subject-oriented pseudo-word tagging and attribute-oriented pseudo-word tagging, respectively. whole These semantic regularization constraints are designed to regulate the learning of all pseudoword tags. Essentially, the goal of these constraints is to ensure that pseudoword-based descriptions are semantically close to descriptions based on real words.
[0093] Step 3 is a step in the reasoning stage, specifically including:
[0094] Step 301: In the reasoning stage, the present invention will refer to image I r And modify text T m The concatenation results in a combined query to retrieve the target image I. t .
[0095] Step 302: First, use the trained fine-grained text inversion network to invert the reference image I. r The mapping is done using subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags to represent the image in text form.
[0096] Step 303: Import Template The combined query is obtained by concatenating pseudo-word tags representing image information and modified text.
[0097] Step 304: Encode the combined query using the CLIP text encoder, and simultaneously encode the target image I using the CLIP image encoder. tThe candidate images are encoded. Finally, the candidate images are ranked using the representation vector of the combined query and the image similarity between the candidate images.
[0098] Example 2
[0099] This embodiment provides an execution system for the method provided in Embodiment 1, namely, a multimodal combined image retrieval system based on zero-shot training of a fine-grained text inversion network, comprising:
[0100] The data acquisition module is configured to acquire reference images and modify text;
[0101] The mapping module is configured to use a pre-trained fine-grained text inversion network to map the reference image into subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags, and concatenate them with the modified text to obtain a combined query in text form.
[0102] The retrieval module is configured to use the cosine similarity between the representation vector of the combined query and the target image vector as the retrieval basis to retrieve the corresponding image;
[0103] The fine-grained pseudo-word tagging mapping module is used to map images into subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags, respectively.
[0104] In the fine-grained pseudo-word tagging mapping module, two branches are designed. In the subject-oriented pseudo-word tagging mapping, the global features of the image are first extracted, and then mapped to subject-oriented pseudo-word tags through a mapping network. In the attribute-oriented pseudo-word tagging mapping, the latent local features of the image are first extracted, and then a dynamic local attribute feature extraction module is used to retain the local features related to the image from the obtained local features, and then mapped to attribute-oriented pseudo-word tags through a mapping network.
[0105] The title-based ternary semantic regularization constraint module generates the title of the corresponding image through the image title generation model. Then, based on the generated title, three semantic constraint losses are designed to normalize the mapped pseudo-word tags into the real word tag embedding space.
[0106] The key modules of the above system are the fine-grained pseudo-word tagging and mapping module and the title-based ternary semantic regularization constraint module.
[0107] The fine-grained pseudo-word tagging mapping module aims to map image information into pseudo-word tags from both the subject and attribute perspectives, so as to comprehensively represent the image in text form.
[0108] In the title-based ternary semantic regularity constraint module, this invention designs three semantic regularity constraints using title information generated from images, thereby standardizing subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags into the real word tag embedding space.
[0109] Example 3
[0110] This embodiment provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the multimodal combined image retrieval method based on zero-shot training of a fine-grained text inversion network as described above.
[0111] Example 4
[0112] This embodiment provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the steps of the multimodal combined image retrieval method based on zero-shot training of a fine-grained text inversion network as described above.
[0113] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0114] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0115] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0116] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0117] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made by those skilled in the art without creative effort within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A zero-shot multimodal information retrieval method based on fine-grained text inversion, characterized in that, Includes the following steps: Obtain reference images and modify text; Using a pre-trained fine-grained text inversion network, the reference image is mapped to subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags, which are then concatenated with the modified text to obtain a combined query in text form. The cosine similarity between the representation vector of the combined query and the target image vector is used as the retrieval basis to retrieve the corresponding image. The training process of the fine-grained text inversion network includes: training the fine-grained inversion network using training image samples, extracting image information, mapping it to subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags to fully represent image information in text form, generating titles for corresponding images using an image title generation model, generating derived descriptions based on the titles, and applying semantic regularization constraints to different derived descriptions to promote the alignment of pseudo-word tags with the embedding space of real word tags; The specific process of generating a title for a corresponding image using an image title generation model includes: using the image title generation model to generate a text description for the corresponding image, the text description being divided into two parts: the main body and local attributes; and using a text template to generate a standard title, the standard title being text that expresses the image using the main body and local attributes. The specific process of generating derived descriptions based on the title and applying semantic regularization constraints to different derived descriptions includes: Three types of derived descriptions are generated by replacing the subject with subject-oriented pseudo-word tags, replacing local attributes with attribute-oriented pseudo-word tags, and simultaneously replacing both the subject and local attributes with both subject-oriented and attribute-oriented pseudo-word tags. and ; Semantic regularization constraints are applied to the three types of derived descriptions respectively: in, They are and The text feature representation obtained after text encoder and These are used to guide the learning of subject-oriented pseudo-word tagging and attribute-oriented pseudo-word tagging, respectively. Used for learning to standardize all pseudoword tags.
2. The zero-sample multimodal information retrieval method based on fine-grained text inversion as described in claim 1, characterized in that, The specific process of extracting image information and mapping it to subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags includes: Extract global and local block features from the image; Map the global features of an image to subject-oriented pseudoword tags; Local block features of an image are aggregated to obtain potential local attribute features. Irrelevant local attribute features are then filtered out through local-global correlation and mapped to attribute-oriented pseudo-word tags.
3. The zero-shot multimodal information retrieval method based on fine-grained text inversion as described in claim 2, characterized in that, The specific process of adaptively filtering irrelevant local attribute features based on local-global relevance and mapping them to attribute-oriented pseudo-word tags includes: Define a set A learnable query embedding is concatenated with local image patch features and fed into a Transformer network for local attribute feature aggregation. A fully connected network is introduced to ensure that the dimensionality of the local image attribute features is the same as that of the global features. Calculate the similarity between each local attribute feature and the global feature, and extract the top similarity. The local attribute feature with the highest similarity is selected as the effective local attribute feature. It is a positive integer; An orthogonal constraint is introduced on the selected effective local attribute features to ensure that the extracted local attribute features are independent of each other; The extracted local attribute features are mapped to attribute-oriented pseudo-word tags using a mapping network based on a multilayer perceptron.
4. The zero-shot multimodal information retrieval method based on fine-grained text inversion as described in claim 1, characterized in that... During training, a symmetric contrast loss is introduced to learn pseudo-word labels. That is, for each unlabeled image, it is expected that its pseudo-word-based text representation is more consistent with its original visual representation than the visual representation of other images.
5. The zero-sample multimodal information retrieval method based on fine-grained text inversion as described in claim 1, characterized in that, The process of obtaining a combined query in text form includes: using a reference image. And modify text The concatenation results in a combined query to retrieve the target image. ; Using a trained fine-grained text inversion network to invert the reference image The mapping is done as subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags, representing the image in text form; Combined query is obtained by concatenating pseudo-word tags representing image information and modified text; A text encoder is used to encode the combined query, while an image encoder is used to encode the target image. The candidate images are encoded, and the candidate images are ranked using the representation vector of the combined query and the image similarity between the candidate images.
6. A zero-shot multimodal information retrieval system based on fine-grained text inversion, employing the method described in any one of claims 1-5, characterized in that, include: The data acquisition module is configured to acquire reference images and modify text; The mapping module is configured to use a pre-trained fine-grained text inversion network to map the reference image into subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags, and concatenate them with the modified text to obtain a combined query in text form. The retrieval module is configured to use the cosine similarity between the representation vector of the combined query and the target image vector as the retrieval basis to retrieve the corresponding image; The fine-grained pseudo-word tagging and mapping module is configured to train the fine-grained inversion network using training image samples, extract image information, and map it into subject-oriented pseudo-word tags and attribute-oriented pseudo-word tags to fully represent image information in text form. The title-based ternary semantic regularization constraint module is configured to generate titles for corresponding images using an image title generation model, generate derived descriptions based on the titles, and apply semantic regularization constraints to different derived descriptions to promote the alignment of pseudo-word tags with real word tags in the embedding space.
7. A computer-readable storage medium, characterized in that, Used to store computer instructions, which, when executed by a processor, complete the steps of the method according to any one of claims 1-5.
8. An electronic device, characterized in that, It includes a memory and a processor, as well as computer instructions stored in the memory and running on the processor, which, when executed by the processor, perform the steps of the method according to any one of claims 1-5.