Multimodal information retrieval method and apparatus, terminal device, and storage medium
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD
- Filing Date
- 2024-12-18
- Publication Date
- 2026-06-19
Smart Images

Figure CN122240867A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of information retrieval technology, and in particular relates to a multimodal information retrieval method, apparatus, terminal equipment and storage medium. Background Technology
[0002] With the development of artificial intelligence and machine learning technologies, multimodal information retrieval is being used more and more widely in fields such as intelligent question answering, social media, e-commerce, and healthcare. However, existing multimodal information retrieval technologies typically only consider the alignment of global features between the retrieved and the retrieved information, resulting in poor performance in scenarios requiring refined retrieval. Summary of the Invention
[0003] This application provides a multimodal information retrieval method, apparatus, terminal device, computer-readable storage medium, and computer program product, which can be applied to various retrieval scenarios and achieve higher retrieval accuracy.
[0004] A first aspect of this application provides a multimodal information retrieval method, comprising: acquiring retrieval information; extracting a first feature of the retrieval information and a second feature of each of a plurality of retrieved information, wherein the first feature and the second feature are determined based on global features of the corresponding information, local features of the corresponding information, and reference weights of the local features of the corresponding information in the current retrieval scenario; filtering first retrieved information that matches the retrieval information from the plurality of retrieved information based on the difference between the first feature and the second feature of each retrieved information, wherein the difference between the second feature of the first retrieved information and the first feature is less than or equal to the difference between the second feature of any second retrieved information and the first feature, and the second retrieved information is different from the first retrieved information.
[0005] In one embodiment, extracting a first feature of the retrieved information and a second feature of each of the multiple retrieved information includes: using a multimodal model to extract features from the retrieved information and the retrieved information to obtain feature extraction results, wherein the feature extraction results include global features and local features of the retrieved information and the retrieved information; and determining the first feature and the second feature based on the feature extraction results, wherein the multimodal model includes a feature compression module, which is used to compress the local features of the retrieved information and the local features of the retrieved information.
[0006] In one implementation, the retrieved information and the retrieved information each include an image or text. The multimodal model further includes an image encoding module and a text encoding module. The multimodal model is used to extract features from the retrieved information and the retrieved information to obtain feature extraction results, including: when the retrieved information or the retrieved information includes an image, the image is input into the image encoding module to obtain global image features and local image features of the image; when the retrieved information or the retrieved information includes text, the text is input into the text encoding module to obtain global text features and local text features of the text; and the local image features and / or local text features are compressed using a feature compression module to obtain compressed local image features and / or local text features.
[0007] In one implementation, before using a multimodal model to extract features from the retrieved information and the retrieved information, the method further includes: acquiring a training dataset, wherein the training dataset includes image-text pair sample data, and the image-text pair sample data includes image samples and text samples; training the multimodal model according to the training dataset and a preset loss function, wherein the preset loss function includes a first loss term and a second loss term, the value of the first loss term being related to the similarity between global image features and global text features inferred by the multimodal model for the current image-text pair sample data, and the value of the second loss term being related to the similarity between local image features and local text features inferred by the multimodal model for the current image-text pair sample data.
[0008] In one implementation, the preset loss function is expressed using the following formula:
[0009] L = L1 + αL2;
[0010] Where L represents the preset loss function, L1 represents the first loss term, L2 represents the second loss term, α represents the reference weight, and α≥0.
[0011] In one implementation, the first retrieved information matching the retrieved information is selected from multiple retrieved information based on the difference between the first feature and the second feature of each retrieved information, including: calculating the inner product between the vector of the first feature and the vector of the second feature of each retrieved information to obtain multiple inner products corresponding to the multiple retrieved information; and determining the retrieved information corresponding to the minimum value among the multiple inner products as the first retrieved information.
[0012] In one implementation, determining the first feature and the second feature based on the feature extraction results includes: determining the first feature and the second feature using the following formula:
[0013] u1 = y0 + ao0;
[0014] u2 = f0 + αo1;
[0015] Where u1 represents the vector of the first feature, u2 represents the vector of the second feature, y0 represents the vector of the global features of the retrieved information, o0 represents the vector of the compressed local features of the retrieved information, f0 represents the vector of the global features of the retrieved information, o1 represents the vector of the compressed local features of the retrieved information, α represents the reference weight, and α≥0.
[0016] A second aspect of this application provides a multimodal information retrieval device, comprising: a first acquisition module for acquiring retrieval information; an extraction module for extracting a first feature of the retrieval information and a second feature of each of a plurality of retrieved information, wherein the first feature and the second feature are determined based on global features of the corresponding information, local features of the corresponding information, and reference weights of the local features of the corresponding information in the current retrieval scenario; and a retrieval module for filtering first retrieved information that matches the retrieval information from the plurality of retrieved information based on the difference between the first feature and the second feature of each retrieved information, wherein the difference between the second feature of the first retrieved information and the first feature is less than or equal to the difference between the second feature of any second retrieved information and the first feature, and the second retrieved information is different from the first retrieved information.
[0017] A third aspect of this application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the above-described multimodal information retrieval method.
[0018] A fourth aspect of this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the multimodal information retrieval method described above.
[0019] The fifth aspect of this application provides a computer program product that, when run on a terminal device, enables the terminal device to implement the steps in the multimodal information retrieval method described above.
[0020] The multimodal information retrieval method provided in the first aspect of this application firstly and accurately extracts a first feature of the retrieval information and a second feature of each retrieved information. In this process, not only are the global and local features of the retrieval information and each retrieved information considered, but also the reference weight of the local features in the current retrieval scenario is fully considered. This makes the determined first and second features more closely match the retrieval requirements of the current retrieval scenario, thereby improving the flexibility and accuracy of feature extraction and enabling it to adapt to retrieval tasks of different types and complexities. Furthermore, based on the difference between the first feature of the retrieval information and the second feature of each retrieved information, the first retrieved information matching the retrieval information is quickly and accurately selected from multiple retrieved information. Moreover, by comprehensively considering the global and local features of the retrieval information and the retrieved information, this method can maintain high retrieval performance even when facing incomplete data or noise, thus enhancing the robustness of the retrieval. Therefore, this multimodal information retrieval method can significantly improve the flexibility, accuracy, robustness, and efficiency of information retrieval, resulting in a better user experience.
[0021] It is understood that the beneficial effects of the second to fifth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here. Attached Figure Description
[0022] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0023] Figure 1 This is a flowchart illustrating a multimodal information retrieval method provided in one embodiment of this application;
[0024] Figure 2 This is a schematic diagram of a retrieval result using a visual semantic search model for multimodal information retrieval;
[0025] Figure 3 This is a schematic diagram of the structure of a multimodal model provided in one embodiment of this application;
[0026] Figure 4 This is a schematic diagram of the structure of a multimodal information retrieval device provided in one embodiment of this application;
[0027] Figure 5 This is a schematic diagram of the structure of a terminal device provided in one embodiment of this application. Detailed Implementation
[0028] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.
[0029] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.
[0030] It should also be understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
[0031] Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0032] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.
[0033] The collection, storage, use, processing, transmission, provision, disclosure, and application of user personal information involved in this application embodiment all comply with the provisions of relevant laws and regulations, have obtained the user's authorization or consent, have taken necessary confidentiality measures, and do not violate public order and good morals.
[0034] As mentioned earlier, existing multimodal information retrieval technologies typically only consider the alignment of global features between the retrieved information and the retrieved information, resulting in poor performance in scenarios requiring refined retrieval.
[0035] Multimodal information retrieval technologies include unimodal information retrieval and crossmodal information retrieval. Unimodal information retrieval involves searching within a single data modality. For example, searching for relevant documents in a text database or searching for similar images only in an image database. Crossmodal information retrieval involves searching across different types of data modalities. For example, retrieving relevant images using a text query or retrieving relevant text descriptions using an image query.
[0036] Take cross-modal information retrieval using multimodal models as an example. Visual semantic search models, as a commonly used multimodal model, can learn rich visual and linguistic knowledge through large-scale pre-training, and can achieve cross-modal searches such as image-to-text search, text-to-image search, or text-to-video search.
[0037] Existing visual semantic search models are suitable for simple category retrieval scenarios, but they suffer from retrieval errors when faced with complex textual semantic input. For example, when a user inputs the search text "apple," existing visual semantic search models can only retrieve images or videos containing "apple" from the user's database. For instance, the model might return... Figure 2 The image shows two apple pictures (image a and image b). However, if a user enters the search text "apple without water droplets on its surface," the traditional visual semantic search model cannot accurately return the correct result. Figure 2 Figure b in the diagram.
[0038] Analysis revealed that traditional visual semantic search models are mainly trained on pure image and text data and use only a single global feature for visual semantic embedding, without considering the alignment of fine-grained features of images and text (such as texture, combination condition description, positional relationship, etc.), resulting in poor model performance in scenarios requiring refined retrieval.
[0039] To at least partially solve the aforementioned technical problems, embodiments of this application provide a multimodal information retrieval method, apparatus, terminal device, computer-readable storage medium, and computer program product. The multimodal information retrieval method of this application can be applied to various fields, including but not limited to urban governance (e.g., retrieving and analyzing captured images of the city through text language), e-commerce (e.g., users can quickly find relevant product information and reviews by uploading product images or entering keywords), healthcare (e.g., doctors can retrieve relevant medical imaging data by inputting patient symptom descriptions), education (teachers and students can find relevant learning materials and video tutorials by searching keywords or uploading images, and can also perform interactive understanding and analysis of text and images), intelligent monitoring and security, smart homes, and autonomous driving. It should be noted that the multimodal information retrieval method of this application can be used for single-modal information retrieval as well as multimodal information retrieval, and is applicable to scenarios with various search granularities, especially maintaining high search accuracy in scenarios requiring refined searches.
[0040] like Figure 1 As shown in the embodiments of this application, the multimodal information retrieval method includes the following steps S110, S120 and S130.
[0041] Step S110: Obtain search information.
[0042] In this embodiment, the retrieval information can be retrieval description information, specifically any description information describing the retrieval purpose. The retrieval information can be any suitable modality, including but not limited to image modality, text modality, video modality, audio modality, etc. For ease of understanding, the modality of the retrieval information can be referred to as the first modality. Exemplarily, the first modality can be an image modality or a text modality. For example, if the first modality is a text modality, the retrieval information is the retrieved text. For example, "boy sitting in a chair," "apple with no water droplets on its surface," etc. Or, for example, if the first modality is an image modality, the retrieval information is the retrieved image. For example, a photo of a cat, etc. Figure 2 Figure a or Figure b, etc.
[0043] In this embodiment of the application, the retrieved information may include simple semantic information (e.g., category information) or complex semantic information (e.g., image texture information, combined condition description information, location relationship information, etc.).
[0044] In this step, various suitable methods can be used to obtain search information. For example, the original search information input by the user through the input device can be obtained first, and then the search information can be determined based on the original search information. For instance, the user inputs search voice through the voice input control in the user interface. In this step, the user's search voice input can be received, converted, and preprocessed into text format to obtain the search text. Of course, the original search information input by the user can also be used directly for retrieval without processing.
[0045] Step S120: Extract the first feature of the retrieved information and the second feature of each of the multiple retrieved information. The first and second features are determined based on the global features, local features, and reference weights of the local features of the corresponding information in the current retrieval scenario.
[0046] In this embodiment, the retrieved information can be data in a database of retrieved information. The modality of the retrieved information can be the modality of the data to be retrieved. For ease of understanding, the modality of the retrieved information can be referred to as the second modality. The second modality can be any suitable modality, including but not limited to image modality, text modality, video modality, audio modality, etc. Exemplarily, the second modality can also be an image modality or a text modality. The second modality can be the same as or different from the first modality. In one example, the second modality is the same as the first modality. For example, both the retrieved information and the retrieved information are text, which corresponds to a single-modal information retrieval of "text-to-text". Or, both the retrieved information and the retrieved information are images, which corresponds to a single-modal information retrieval of "image-to-image". In another example, the first modality is different from the second modality. For example, the retrieved information is text, and the retrieved information is an image. Or, the retrieved information is an image, and the retrieved information is text. For simplicity, the multimodal information retrieval method of this embodiment will be described in detail below with the retrieved information being text and the retrieved information being an image as an example. For example, the search information can be the search text entered by the user in the search box, and the searched information can be images in a massive image library (e.g., referred to as the searched images). The number of searched images in the image library can be denoted as N (e.g., N = 100,000,000), then the number of searched information (searched images) is N.
[0047] In this embodiment, the first feature and the second feature are determined based on the global features of the corresponding information, the local features of the corresponding information, and the reference weights of the local features of the corresponding information in the current retrieval scenario. In the example where the retrieval information is the retrieval text and the retrieved information is the image in the image library, the first feature (comprehensive text feature) of the retrieval text can be determined based on the global features, local features, and reference weights of the local features in the current retrieval scenario. At some times, various suitable methods can be used in advance to determine the reference weights of the local features in the current retrieval scenario. The value range of the reference weight can be [0,1]. For example, in a coarse-grained retrieval scenario, the reference weight can be determined to be less than 0.5, and in a fine-grained retrieval scenario, the reference weight can be determined to be greater than or equal to 0.5. Exemplarily, the retrieval text can be analyzed first, and various suitable algorithms can be used to determine the level of retrieval granularity of the current retrieval scenario, and then the corresponding reference weight can be determined based on that level. Exemplarily, a feature weighting method can be used in this step, based on the first feature of the retrieval text and the second feature of each retrieval image. Examples of this scheme will be described later, and for the sake of brevity, they will not be repeated here.
[0048] In this step, various suitable methods can be used to extract the first feature of the searched text (e.g., called the comprehensive text feature) and the second feature (e.g., called the comprehensive image feature) of each of the N images to be searched in the image database. For example, a trained multimodal model can be used to extract features from both the searched text and each searched image to obtain the feature extraction result. This feature extraction result can include the text feature extraction result for the searched text and the image feature extraction result for each searched image. Furthermore, the comprehensive text feature of the searched text can be determined based on the text feature extraction result, and the comprehensive image feature of the searched image can be determined based on the image feature extraction result for each searched image. Taking the determination of the comprehensive text feature of the searched text as an example, the text feature extraction result can include the global features of the searched text (e.g., called the global text feature) and the local features of the searched text (e.g., called the local text feature). The global text feature and the local text feature of the searched text can be superimposed according to the reference weight of the local features in the current search scenario to obtain the comprehensive text feature (first feature) of the searched text.
[0049] Step S130: Based on the difference between the first feature and the second feature of each retrieved information, select the first retrieved information that matches the retrieved information from the multiple retrieved information. Wherein, the difference between the second feature and the first feature of the first retrieved information is less than or equal to the difference between the second feature and the first feature of any second retrieved information, and the second retrieved information is different from the first retrieved information.
[0050] In this embodiment, the second retrieved information can be any of the retrieved information other than the first retrieved information. For example, if the search information is search text and the retrieved information is an image from an image library, the first retrieved information selected can be the image in the image library that best matches the semantics described by the search text.
[0051] In this embodiment, various suitable methods can be used to determine the difference between the first feature and the second feature of each retrieved information, and the retrieved information can be sorted according to the difference determined for each retrieved information. For example, the smaller the difference, the higher the ranking. Thus, the retrieved information ranked highest can be taken as the first retrieved information. Taking the retrieval information as retrieval text and the retrieved information as images in the image library as an example, in a specific example, various suitable methods can be used to calculate the similarity between the vector of comprehensive text features of the retrieval text (called the comprehensive text feature vector) and the vector of comprehensive image features of each retrieved image (called the comprehensive text feature vector). For example, the absolute value of the difference between the magnitude of the comprehensive text feature vector and the magnitude of the comprehensive image feature vector of each retrieved image can be calculated as the similarity between the two. Another example is that the inner product between the comprehensive text feature vector and the comprehensive image feature vector of each retrieved image can be calculated as the similarity between the two. Then, the retrieved images in the image library can be sorted according to the similarity, with the smaller the similarity, the higher the ranking. The system can select the top-ranked image as the best match for the search text and return that image to the user. In another specific example, a classification model can be pre-trained, and the combined text feature vector and the combined image feature vector of each image being searched can be input into this model. The model can output a confidence score indicating the match between the searched image and each searched image. Finally, the image with the highest confidence score can be selected as the best match for the search text.
[0052] As mentioned earlier, existing multimodal information retrieval technologies typically only consider the alignment of global features between the retrieved information and the retrieved information, resulting in poor performance in scenarios requiring refined retrieval. However, the multimodal information retrieval method described in this application first accurately extracts the first feature of the retrieved information and the second feature of each retrieved information. In this process, not only are the global and local features of the retrieved information and each retrieved information considered, but the reference weight of local features in the current retrieval scenario is also fully considered. This makes the determined first and second features more closely match the retrieval requirements of the current retrieval scenario, thereby improving the flexibility and accuracy of feature extraction and enabling it to adapt to retrieval tasks of different types and complexities. Furthermore, based on the differences between the first feature of the retrieved information and the second feature of each retrieved information, the first retrieved information matching the retrieved information is quickly and accurately selected from multiple retrieved information. Moreover, by comprehensively considering the global and local features of the retrieved information and the retrieved information, this method can maintain high retrieval performance even when facing incomplete data or noise, thus enhancing the robustness of the retrieval. Therefore, this multimodal information retrieval method can significantly improve the flexibility, accuracy, robustness, and efficiency of information retrieval, resulting in a better user experience.
[0053] In one implementation, step S120 extracts a first feature of the retrieved information and a second feature of each of the multiple retrieved information items, including the following steps: Step S121, using a multimodal model to extract features from the retrieved information and the retrieved information to obtain feature extraction results, wherein the feature extraction results include global features and local features of the retrieved information and the retrieved information; Step S122, determining the first feature and the second feature based on the feature extraction results, wherein the multimodal model includes a feature compression module, which is used to compress the local features of the retrieved information and the local features of the retrieved information.
[0054] In this embodiment, the multimodal model has the ability to extract the global and local features of both the retrieved and retrieved information. Furthermore, the multimodal model includes a feature compression module, which compresses the local features of both the retrieved and retrieved information, ensuring that the length of the compressed local features is less than the length of the uncompressed local features. For the case where the retrieved and / or retrieved information is an image, the length of the image's local features can be the number of image sub-representations of different local regions within the corresponding image; for the case where the retrieved and / or retrieved information is text, the length of the text's local features can be the number of text sub-representations of each token of the corresponding text. For example, when the retrieved and / or retrieved information is an image, the feature compression module can compress multiple image sub-representations into one sub-representation; when the retrieved and / or retrieved information is text, the feature compression module can compress multiple text sub-representations into one sub-representation.
[0055] In this embodiment, the multimodal model can be a machine learning model capable of extracting features from data of at least two modalities. For example, the multimodal model can be a trained visual semantic model. The visual semantic model may include an image feature extraction module, a text feature extraction module, and the aforementioned feature compression module. Exemplarily, various suitable training methods can be used to jointly train the various modules in the visual semantic model in advance, enabling the visual semantic model to accurately extract text and image features and to precisely align global and local features between different images, between different texts, and between images and text. For example, image-text pairs can be used to pre-train the visual semantic model, and the model can learn the ability to align features through tasks such as contrastive learning and image-text matching.
[0056] For example, the retrieval information and each piece of retrieved information can be sequentially input into a multimodal model to obtain feature extraction results for each piece of information.
[0057] In one example, when the retrieval information is user-inputted search text and the retrieved information is images from an image library, the search text can be input into a trained visual semantic model to obtain its global and local features. Then, each retrieved image can be input into the trained visual semantic model to obtain its global and local features. Alternatively, before obtaining the search text, each retrieved image from the image library can be input into the trained visual semantic model to obtain its global and local features. Then, after obtaining the search text, it can be input into the trained visual semantic model again to obtain its global and local features. Next, reference weights for the local features in the current retrieval scenario can be determined based on the search text. Then, based on the reference weights, the global and local features of the search text, the first feature (comprehensive text feature) of the search text can be determined. And based on the reference weights, the global and local features of each retrieved image, the second feature (comprehensive image feature) of each retrieved image can be determined. Finally, the differences between the comprehensive image features of each tested image and the comprehensive text features of the retrieved text can be compared. Based on these differences, the similarity between the tested image and the retrieved text can be determined, and the tested image with the highest similarity to the retrieved text can be output and displayed on the user interface for viewing. As mentioned above, the trained visual semantic model has the ability to accurately align global and local features between different images, different texts, and between images and text. Therefore, the global features of each tested image output from the visual semantic model are fully aligned with the global features of the retrieved text, and the local features of each tested image output from the visual semantic model are also fully aligned with the local features of the retrieved text. Furthermore, since the multimodal model includes a feature compression module, the local features of the retrieved text and the tested images are both compressed local features. Therefore, the amount of data representing the local features of the retrieved text and the tested images is reduced. Thus, in scenarios requiring refined retrieval, the accuracy of retrieval can be significantly improved by aligning the global and local features of the retrieval text and the retrieved image, and the computational load required for subsequent searches can be significantly reduced by compressing the local features of the retrieval text and the retrieved image. Therefore, the multimodal information retrieval method using the embodiments of this application can significantly improve the accuracy and efficiency of cross-modal refined search.
[0058] Of course, in other examples where the retrieved and retrieved information are of the same modality, a well-trained multimodal model can also quickly and accurately extract and compress features from both. Take the example where both the retrieved and retrieved information are images. The retrieved image (retrieval information) and each retrieved image (retrieval information) can be sequentially input into the trained multimodal model to obtain the global and local features of the retrieved image and each retrieved image. For example, before acquiring the retrieved image, each retrieved image from the image library can be pre-input into the trained visual semantic model to obtain the global and local features of each retrieved image. After acquiring the retrieved image, it can be input into the trained visual semantic model to obtain its global and local features. Then, the reference weights of the local features in the current retrieval scenario can be determined based on the retrieved image. Then, based on the reference weights, the global and local features of the retrieved image, the first feature (comprehensive image feature) of the retrieved image can be determined. And based on the reference weights, the global and local features of each retrieved image, the second feature (comprehensive image feature) of each retrieved image can be determined. Finally, the differences between the comprehensive image features of each detected image and the comprehensive image features of the retrieved image can be compared. Based on these differences, the similarity between the detected and retrieved images can be determined, and the detected image with the highest similarity to the retrieved image can be output and displayed on the user interface for viewing. It is understood that the global features of each detected image output from the trained visual semantic model are fully aligned with the global features of the retrieved image, and the local features of each detected image output from the visual semantic model are also fully aligned with the local features of the retrieved image. Furthermore, the amount of data represented by the local features of the retrieved image and the local features of the detected image is reduced after compression by the compression module. Therefore, in scenarios requiring refined retrieval, the multimodal information retrieval method of this application embodiment can also significantly improve the accuracy and efficiency of single-modal refined search.
[0059] In one implementation, the retrieved information and the retrieved information each include an image or text. The multimodal model further includes an image encoding module and a text encoding module. Step S121 uses the multimodal model to extract features from the retrieved information and the retrieved information to obtain feature extraction results, including the following steps: Step S1211, if the retrieved information or the retrieved information includes an image, the image is input into the image encoding module to obtain global image features and local image features of the image; Step S1212, if the retrieved information or the retrieved information includes text, the text is input into the text encoding module to obtain global text features and local text features of the text; Step S1213, the local image features and / or local text features are compressed using the feature compression module to obtain compressed local image features and / or local text features.
[0060] For example, the multimodal model in this application embodiment can be a visual semantic model, which may include: an image encoding module, a text encoding module, and a feature compression module. For example, as... Figure 3 As shown, the image encoding module can be an image encoder based on a Vision Transformer (ViT) structure. This image encoder can transform image processing tasks into sequence processing tasks by segmenting the image into fixed-size image patches, then flattening these patches and converting them into embedding vectors through a linear layer. These embedding vectors are then added to positional encodings to preserve spatial information in the image. These embedding vectors are then fed into an encoder consisting of multiple Transformer layers for processing to extract image features. For example, considering the full information of the text, the text encoding module can be a text encoder based on a BERT (Bidirectional Encoder Representations from Transformers) structure. A BERT-based text encoder can combine contextual information from both sides to encode the text, enabling a more comprehensive understanding of the text content and more accurate extraction of text features from the retrieved or searched text. For example, the feature compression module can be a feature compression layer based on a Multi-Layer Perceptron (MLP) structure, specifically used for compressing image sub-representations and text sub-representations. By employing a feature compression layer based on an MLP structure to compress the local features of both the retrieved and the retrieved information, a compact representation of local features can be achieved while retaining key information, thereby improving retrieval efficiency and saving storage space and reducing computational load.
[0061] The following is about Figure 3The reasoning process of each module of the visual semantic model shown is described in detail. During the reasoning process, features of the retrieved and retrieved information can be extracted using all or some modules of the model.
[0062] When the retrieval information and / or the retrieved information is an image, the retrieval image or the retrieved image can be used... This indicates that: Then we have: Where h and w represent the height and width of the image, respectively, and c represents the number of channels in the input image. Assuming the retrieved or detected image is divided into p blocks and the hidden layer has dimension d, the input x of the ViT-based image encoder can be expressed as: x∈R (p+1)×d “p+1” represents p image sub-representations corresponding to local image patches and 1 image category representation corresponding to the overall image. The output y of the image encoder can be expressed as: y = ViT(x), where y ∈ R (p+1)×d .
[0063] When the retrieved information and / or the retrieved information is text, the retrieved text and / or the retrieved text can be represented by 't'. For example, the longest possible statement can be set to L+1. When the text length is less than L+1, a longer statement can be used. The tokens are padded to a length of L+1, where "L+1" represents L text sub-representations corresponding to individual local tokens and 1 text category representation corresponding to a global token. The feature representation of the text output by the BERT-based text encoder can be represented by f. Therefore: f = BERT(t), where f ∈ R (L+1)×d d represents the dimension of the hidden layer of the text representation.
[0064] For example, the feature compression module based on the MLP structure can consist of two fully connected network layers, which can be called the MLP compression layer. The MLP compression layer can compress the p image sub-representations (e.g., denoted as y[1:]) output by the image encoder and / or the L text sub-representations (e.g., denoted as f[1:]) output by the text encoder. The output of the MLP compression layer can be represented by o. Then we have: o = A1A2z, where A1∈R 2×(2p+2L) A2∈R (2p+2L)×(p+L) , where z represents the learnable parameters of the MLP compression layer. Here, z represents the input to the MLP compression layer, and z = [y[1:]; f[1:]] ∈ R. (p+L)×d Ultimately, the output of the MLP compression layer can be o∈R. 2×d .
[0065] It's important to note that during the inference process, for multimodal information search scenarios (such as "text-to-image search"), the input to the image encoder can be empty while the model is inferring from the retrieved text. The MLP compression layer can compress only the text sub-representations of each retrieved text output by the text encoder. For example, multiple text sub-representations can be compressed into a single text sub-representation. In the specific application scenario of "text-to-image search," batches of image data can first be offline encoded using an image encoder based on a ViT structure, preserving the global representation and image sub-representations of all images. When the user inputs text information, text features can be extracted in real time using a BERT-based text encoder, and then the MLP compression layer can perform fast compression of image and text sub-representations.
[0066] For unimodal information retrieval scenarios, the MLP compression layer can compress only local features of the retrieved or retrieved information. For example, when both the retrieved and retrieved information are text, the input to the image encoder can be empty during inference on the retrieved and retrieved texts. In this case, the MLP compression layer can compress only the individual text sub-representations output by the text encoder. For example, multiple text sub-representations can be compressed into a single text sub-representation. Similarly, the output of a multimodal model can include a text category representation and a text sub-representation. As another example, when both the retrieved and retrieved information are images, the input to the text encoder can be empty during inference on the retrieved and retrieved images. In this case, the MLP compression layer can compress only the individual image sub-representations output by the image encoder. For example, multiple image sub-representations can be compressed into a single image sub-representation. Similarly, the output of a multimodal model can include an image category representation and an image sub-representation.
[0067] The aforementioned multimodal model enables rapid and accurate alignment and extraction of global and local features of images and / or text, and allows for the compression of local features. This improves the accuracy and efficiency of multimodal information retrieval. Furthermore, this approach is applicable not only to single-modal retrieval scenarios but also to multimodal retrieval scenarios, demonstrating greater applicability.
[0068] In one embodiment, before performing feature extraction on the retrieved information and the retrieved information using a multimodal model in step S121, the multimodal information retrieval method of this application embodiment further includes the following steps: Step S101, obtaining a training dataset, wherein the training dataset includes image-text pair sample data, and the image-text pair sample data includes image samples and text samples; Step S102, training the multimodal model according to the training dataset and a preset loss function. The preset loss function includes a first loss term and a second loss term. The value of the first loss term is related to the similarity between the global image features and global text features inferred by the multimodal model for the current image-text pair sample data, and the value of the second loss term is related to the similarity between the local image features and local text features inferred by the multimodal model for the current image-text pair sample data.
[0069] In this embodiment, the multimodal model can be a visual semantic model, specifically a visual semantic search model. It can be combined with... Figure 3 The training process of the visual semantic model is described. First, a large amount of image-text pair data can be collected in advance as the training dataset for training the visual semantic model. Then, the visual semantic model can be trained using an image-text alignment training method.
[0070] Analysis revealed that traditional visual semantic models only employ category alignment training for images and text; that is, they only perform comparative learning on image category representations (e.g., denoted as y[0]) and text category representations (e.g., denoted as f[0]). This training method causes visual semantic models to ignore fine-grained feature information of image and text sub-representations, thus making it difficult to cope with cross-modal information search in fine-grained search scenarios. The visual semantic model in this embodiment, due to the addition of a feature compression module (such as the MLP compression layer mentioned above), can achieve joint learning of image and text sub-representations, and can also perform comparative learning training on the compressed image and text sub-representations, enabling the model to also have the ability to align local image features with local text features.
[0071] In this embodiment, the training process of the visual semantic model may include a joint training process of global feature training and local feature training for image-text pairs. For example, for contrastive learning of global features, a contrastive loss term L1 (first loss term) can be set. The value of this contrastive loss term is related to the similarity between the global image features and global text features inferred by the visual semantic model for the current image-text pair sample data. For example, the contrastive loss term L1 can be specifically represented by the following formula:
[0072]
[0073] Where k is the number of image-text pairs in each batch during training. and Let $j$ be the vector representing the global features of the $i$-th image-text pair in the current batch, where $i$ and $j$ both belong to the range [1, k]. The symbol "×" represents the vector inner product. For example, express Figure 3 The vector representing the image category in the image. express Figure 3 The vector representing the text category in the text.
[0074] For comparative learning of image and text sub-features, a contrastive loss term L2 (the second loss term) can be set. The value of this contrastive loss term is related to the similarity between local image features and local text features inferred by the visual semantic model for the current image-text pair sample data. For example, the contrastive loss term L2 can be specifically expressed by the following formula:
[0075]
[0076] Where k is the number of image-text pairs in each batch during training. and These are vectors representing the compressed local features of the i-th image-text pair in the current batch. For example, express Figure 3 The vector of image sub-representations output by the Medium Feature Compressor (MLP) layer ( Figure 3 (As shown by the light gray squares) express Figure 3 The vector of text sub-representations output by the Feature Compressor Layer (MLP) in the middle. Figure 3 (As shown by the dark gray square).
[0077] In this embodiment, the preset loss function during the training of the visual semantic model may include the aforementioned contrastive loss term L1 and contrastive loss term L2. For example, the preset loss function is expressed using the following formula:
[0078] L = L1 + αL2;
[0079] Where L represents the preset loss function, L1 represents the first loss term, L2 represents the second loss term, α represents the reference weight, and α≥0.
[0080] In this embodiment, α can be set according to the actual retrieval scenario. For example, in a refined search scenario, α can be set to be greater than 0. For example, α can be set to 0.5. In some special scenarios that do not focus on fine-grained features, such as simple category search scenarios, α can be set to 0. It can be understood that when α = 0, L = L1, and the trained visual semantic model only has the ability to align global features, which corresponds to the contrastive learning training of traditional visual semantic models.
[0081] The above scheme employs a multi-representation vector alignment training method to jointly and comparatively train the various modules of the multimodal model. This enables the multimodal model to quickly and accurately learn the ability to align global and local features between images and text, between images, and between texts. Using the trained multimodal model for multimodal information retrieval can significantly improve the accuracy of information retrieval. Tests show that compared to traditional multimodal models, the multimodal model trained using the above method has a significant advantage in cross-modal retrieval of fine-grained features, with retrieval accuracy exceeding that of traditional models by more than 10%.
[0082] In one implementation, step S122 determines the first feature and the second feature based on the feature extraction results, including step S122a: determining the first feature and the second feature using the following formula:
[0083] u1 = y0 + αo0;
[0084] u2 = f0 + αo1;
[0085] Where u1 represents the vector of the first feature, u2 represents the vector of the second feature, y0 represents the vector of the global features of the retrieved information, o0 represents the vector of the compressed local features of the retrieved information, f0 represents the vector of the global features of the retrieved information, o1 represents the vector of the compressed local features of the retrieved information, α represents the reference weight, and α≥0.
[0086] It can be understood that the first feature is the comprehensive feature of the retrieved information, and the second feature is the comprehensive feature of the retrieved information. For example, in a refined retrieval scenario of "text-based image search," the reference weight α of the local feature can be 0.5. The retrieved information is the retrieval text, and the retrieved information is each image in the image database. During the reasoning process, the following can be used: Figure 3 The visual semantic model shown in the diagram performs feature extraction and local feature compression on the retrieved text and each retrieved image. Specifically, an image encoder is first used to perform offline image encoding on a batch of retrieved images, retaining the image category representation f0 and multiple image sub-representations for all images. When the user inputs the retrieved text, a BERT-based text encoder can extract text features in real time, obtaining the text category representation y0 and multiple text sub-representations. Then, an MLP feature compression layer can be used to quickly compress the multiple image sub-representations and multiple text sub-representations, obtaining the compressed image sub-representation o1 and the compressed text sub-representation o0, respectively. Afterwards, the comprehensive text feature vector u1 of the retrieved text and the comprehensive image feature vector u2 of each retrieved image can be calculated using the above formula.
[0087] The method described above for determining the comprehensive characteristics of the retrieved and searched information has a simple execution logic and requires less computation, thus further improving retrieval efficiency. Furthermore, this calculation method fully considers actual retrieval scenarios, making the retrieval results more closely match the actual retrieval scenario, more accurate, and the solution more applicable.
[0088] In one implementation, step S130, selecting the first retrieved information that matches the retrieved information from multiple retrieved information based on the difference between the first feature and the second feature of each retrieved information, includes the following steps: step S131, calculating the inner product between the vector of the first feature and the vector of the second feature of each retrieved information to obtain multiple inner products corresponding to the multiple retrieved information; step S132, determining the retrieved information corresponding to the minimum value among the multiple inner products as the first retrieved information.
[0089] Taking the refined retrieval scenario of "text-to-image search" as an example, during the retrieval process, the method in step S122a above can be used to obtain the comprehensive text feature vector u1 of the search text and the comprehensive image feature vector u2 of each of the N images to be searched in the image database. Then, the inner product of the comprehensive text feature vector u1 and the comprehensive image feature vector u2 of each image to be searched can be calculated to obtain N inner products. After that, the minimum inner product among the N inner products can be determined. The image corresponding to the minimum inner product can be used as the retrieval result, output and displayed on the user interface.
[0090] It is understandable that the vector inner product can accurately reflect the consistency of two vectors in direction, thus accurately reflecting the similarity between the comprehensive text feature vector u1 and the comprehensive image feature vector u2 of each detected image. Furthermore, this method can reduce the computational resources required during feature comparison, especially when searching massive databases, resulting in faster search speeds. Therefore, it can further improve the efficiency and accuracy of multimodal information retrieval.
[0091] This application also provides a multimodal information retrieval device. For example... Figure 4 As shown, the multimodal information retrieval device 400 includes:
[0092] The first acquisition module 410 is used to acquire retrieval information;
[0093] The extraction module 420 is used to extract the first feature of the retrieved information and the second feature of each of the multiple retrieved information, wherein the first feature and the second feature are determined based on the global feature of the corresponding information, the local feature of the corresponding information, and the reference weight of the local feature of the corresponding information in the current retrieval scenario;
[0094] The retrieval module 430 is used to filter out first retrieved information that matches the retrieval information from multiple retrieved information based on the difference between the first feature and the second feature of each retrieved information, wherein the difference between the second feature of the first retrieved information and the first feature is less than or equal to the difference between the second feature of any second retrieved information and the first feature, and the second retrieved information is different from the first retrieved information.
[0095] In one embodiment, the extraction module 420 includes:
[0096] The feature extraction submodule is used to extract features from the retrieved information and the retrieved information using a multimodal model, and obtain the feature extraction results, which include global features and local features of the retrieved information and the retrieved information.
[0097] The first determining submodule is used to determine the first feature and the second feature based on the feature extraction results. The multimodal model includes a feature compression module, which is used to compress the local features of the retrieved information and the local features of the retrieved information.
[0098] In one implementation, the retrieved information and the retrieved information each include an image or text. The multimodal model further includes: an image encoding module and a text encoding module, and a feature extraction submodule, including:
[0099] The image encoding unit is used to input the image into the image encoding module when the retrieved information or the retrieved information includes an image, so as to obtain the global image features and local image features of the image;
[0100] The text encoding unit is used to input text into the text encoding module when the retrieved information or the retrieved information includes text, so as to obtain the global text features and local text features of the text.
[0101] The compression alignment unit is used to compress local image features and / or local text features using the feature compression module to obtain compressed local image features and / or local text features.
[0102] In one embodiment, the multimodal information retrieval device 400 of this application further includes:
[0103] The second acquisition module is used to acquire the training dataset, which includes image-text pair sample data, and the image-text pair sample data includes image samples and text samples.
[0104] The training module is used to train the multimodal model based on the training dataset and a preset loss function. The preset loss function includes a first loss term and a second loss term. The value of the first loss term is related to the similarity between the global image features and global text features inferred by the multimodal model for the current image-text pair sample data. The value of the second loss term is related to the similarity between the local image features and local text features inferred by the multimodal model for the current image-text pair sample data.
[0105] In one implementation, the preset loss function is expressed using the following formula:
[0106] L = L1 + αL2;
[0107] Where L represents the preset loss function, L1 represents the first loss term, L2 represents the second loss term, α represents the reference weight, and α≥0.
[0108] In one embodiment, the retrieval module 430 includes:
[0109] The vector calculation unit is used to calculate the inner product between the vector of the first feature and the vector of the second feature of each retrieved information, so as to obtain multiple inner products corresponding to multiple retrieved information.
[0110] The second determining unit is used to determine the information to be retrieved corresponding to the minimum value among multiple inner products as the first information to be retrieved.
[0111] In one implementation, the first determining submodule is configured to:
[0112] The first and second features are determined using the following formula:
[0113] u1 = y0 + αo0;
[0114] u2 = f0 + αo1;
[0115] Where u1 represents the vector of the first feature, u2 represents the vector of the second feature, y0 represents the vector of the global features of the retrieved information, o0 represents the vector of the compressed local features of the retrieved information, f0 represents the vector of the global features of the retrieved information, o1 represents the vector of the compressed local features of the retrieved information, α represents the reference weight, and α≥0.
[0116] like Figure 5 As shown, this application embodiment also provides a terminal device 500, including: at least one processor 510 ( Figure 5 The diagram shows only one processor, memory 520, and computer program 530 stored in memory 520 and executable on at least one processor 510. When processor 510 executes computer program 530, it implements the steps of the multimodal information retrieval method described above.
[0117] In this embodiment, the terminal device may include, but is not limited to, a processor and a memory. Figure 5 This is merely an example of a terminal device and does not constitute a limitation on the terminal device. It may include more or fewer components than illustrated, or combine certain components, or use different components. The processor may be a Central Processing Unit (CPU), but it can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.
[0118] It should be noted that the information interaction and execution process between the above-mentioned devices / modules are based on the same concept as the method embodiments of this application. For details on their specific functions and technical effects, please refer to the method embodiments section, and they will not be repeated here.
[0119] Those skilled in the art will understand that, for the sake of convenience and brevity, the above-described division of functional modules is merely an example. In practical applications, the functions described above can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. The functional modules in the embodiments can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module. The integrated modules can be implemented in hardware or as software functional modules. Furthermore, the specific names of the functional modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0120] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, can implement the steps in the above-described multimodal information retrieval method.
[0121] This application provides a computer program product that, when run on a terminal device, enables the terminal device to implement the steps in the above-described multimodal information retrieval method.
[0122] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. A multimodal information retrieval method, characterized in that, include: Obtain search information; Extract a first feature of the retrieved information and a second feature of each of the multiple retrieved information, wherein the first feature and the second feature are determined based on the global features of the corresponding information, the local features of the corresponding information, and the reference weights of the local features of the corresponding information in the current retrieval scenario; Based on the difference between the first feature and the second feature of each retrieved information, a first retrieved information matching the retrieved information is selected from the plurality of retrieved information, wherein the difference between the second feature of the first retrieved information and the first feature is less than or equal to the difference between the second feature of any second retrieved information and the first feature, and the second retrieved information is different from the first retrieved information.
2. The multimodal information retrieval method as described in claim 1, characterized in that, The extraction of the first feature of the retrieved information and the second feature of each of the plurality of retrieved information includes: A multimodal model is used to extract features from the retrieved information and the retrieved information to obtain feature extraction results, wherein the feature extraction results include global features and local features of the retrieved information and the retrieved information; Based on the feature extraction results, the first feature and the second feature are determined. The multimodal model includes a feature compression module, which is used to compress the local features of the retrieved information and the local features of the retrieved information.
3. The multimodal information retrieval method as described in claim 2, characterized in that, The retrieved information and the retrieved information each include an image or text. The multimodal model further includes an image encoding module and a text encoding module. The feature extraction results obtained by using the multimodal model to extract features from the retrieved information and the retrieved information include: If the search information or the searched information includes an image, the image is input into the image encoding module to obtain the global image features and local image features of the image; If the search information or the searched information includes text, the text is input into the text encoding module to obtain the global text features and local text features of the text; The local image features and / or local text features are compressed using the feature compression module to obtain compressed local image features and / or local text features.
4. The multimodal information retrieval method as described in claim 2, characterized in that, Before performing feature extraction on the retrieved information and the retrieved information using a multimodal model, the method further includes: Obtain a training dataset, wherein the training dataset includes image-text pair sample data, and the image-text pair sample data includes image samples and text samples; The multimodal model is trained based on the training dataset and a preset loss function. The preset loss function includes a first loss term and a second loss term. The value of the first loss term is related to the similarity between the global image features and global text features inferred by the multimodal model for the current image-text pair sample data. The value of the second loss term is related to the similarity between the local image features and local text features inferred by the multimodal model for the current image-text pair sample data.
5. The multimodal information retrieval method as described in claim 4, characterized in that, The preset loss function is expressed by the following formula: L = L1 + αL2; Where L represents the preset loss function, L1 represents the first loss term, L2 represents the second loss term, α represents the reference weight, and α≥0.
6. The multimodal information retrieval method according to any one of claims 1-5, characterized in that, The step of filtering out the first retrieved information that matches the retrieved information from the plurality of retrieved information based on the difference between the first feature and the second feature of each retrieved information includes: Calculate the inner product between the vector of the first feature and the vector of the second feature of each retrieved information to obtain multiple inner products corresponding to the multiple retrieved information; The information to be retrieved corresponding to the minimum value among the plurality of inner products is determined as the first information to be retrieved.
7. The multimodal information retrieval method according to any one of claims 2-5, characterized in that, The step of determining the first feature and the second feature based on the feature extraction result includes: The first feature and the second feature are determined using the following formula: u1 = y0 + αo0; u2 = f0 + αo1; Where u1 represents the vector of the first feature, u2 represents the vector of the second feature, y0 represents the vector of the global features of the retrieved information, o0 represents the vector of the compressed local features of the retrieved information, f0 represents the vector of the global features of the retrieved information, o1 represents the vector of the compressed local features of the retrieved information, α represents the reference weight, and α≥0.
8. A multimodal information retrieval device, characterized in that, include: The first acquisition module is used to acquire search information; An extraction module is used to extract a first feature of the search information and a second feature of each of the multiple searched information, wherein the first feature and the second feature are determined based on the global features of the corresponding information, the local features of the corresponding information, and the reference weights of the local features of the corresponding information in the current search scenario; The retrieval module is configured to filter out first retrieved information that matches the retrieval information from the plurality of retrieved information based on the difference between the first feature and the second feature of each retrieved information, wherein the difference between the second feature of the first retrieved information and the first feature is less than or equal to the difference between the second feature of any second retrieved information and the first feature, and the second retrieved information is different from the first retrieved information.
9. A terminal device, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the multimodal information retrieval method as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the multimodal information retrieval method as described in any one of claims 1 to 7.