Context-aware dynamic latent space transformation interactive image retrieval method
By constructing a context-aware dynamic latent space transformation mechanism, the feature representations of images and text are dynamically adjusted, solving the problem that the feature space cannot flexibly adapt to user intent in existing methods, and achieving more efficient and accurate image retrieval results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG GONGSHANG UNIVERSITY
- Filing Date
- 2026-05-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing interactive image retrieval methods rely on a fixed multimodal feature space, which makes it difficult to flexibly adapt to changes in user intent and to fully capture fine-grained information, thus affecting the accuracy of retrieval results.
By constructing a context-aware dynamic latent space transformation mechanism, a conditional feature transformation matrix is generated using a lightweight network and a large language model. This dynamically adjusts the text and image feature representations, learns the relevance after feature transformation, and achieves dynamic alignment of cross-modal features.
It significantly improves the accuracy and robustness of retrieval, and can dynamically adjust the feature distribution according to user intent, breaking through the limitations of static representation and realizing the evolution from static matching to dynamic semantic space, thereby improving the accuracy and efficiency of retrieval.
Smart Images

Figure CN122240870A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image retrieval, and more particularly to an interactive image retrieval method based on context-aware dynamic latent spatial transformation. Background Technology
[0002] Search technology plays a vital role in daily life, especially in this era of information overload, where users need to quickly find the information they need through retrieval. With the widespread adoption of the internet and mobile devices, the amount of information has increased dramatically, particularly on e-commerce platforms, where the growing demand for product images and other information has led to a growing need for more accurate and efficient search technologies.
[0003] Traditional image retrieval methods typically rely on users providing textual descriptions for their queries, with the retrieval system then matching these descriptions with images and other information in a database. However, these methods have significant drawbacks. They usually depend on a single-round text query, requiring the user to provide a complete and precise description in a single attempt to retrieve the target image from a large pool of candidate images. However, this single-round retrieval often demands overly detailed and comprehensive descriptions, which is very difficult for users to input in a single attempt. Therefore, retrieval performance is usually unsatisfactory compared to the user's ultimate goal.
[0004] To address this limitation, interactive text-to-image retrieval has recently been proposed. This method supports multi-turn interactions between the user and the system, allowing the user to progressively refine their query, clarify their intent, and guide the retrieval process toward the target image. In this way, the system can optimize retrieval results in each round of dialogue, enabling users to obtain the information they need more precisely. Furthermore, we hope that the retrieval space changes with the user's intent during the interaction. For example, if the user emphasizes "the color of the dog" in a certain round of dialogue, then the latent feature space should emphasize color features more. If the feature space does not dynamically adjust according to semantic changes, the model can only restate the query on a fixed representation manifold. However, existing interactive text-to-image retrieval methods still have shortcomings in this regard. While multi-turn dialogues can help users express their needs more precisely, existing methods typically rely on a fixed multimodal feature space for retrieval, making it difficult for them to flexibly adapt to the gradual changes in user intent and fully capture the fine-grained information in user needs. This static paradigm limits the effective use of contextual information, hinders the model from accurately adapting to semantic changes in the dialogue, and thus affects the accuracy of retrieval results. Summary of the Invention
[0005] The purpose of this invention is to address the shortcomings of existing technologies by proposing a context-aware dynamic latent space transformation-based interactive image retrieval method. This method dynamically adjusts the feature representations of text and images based on user query feedback, amplifying the parts of the features that are relevant to the user's intent, thereby improving the accuracy of interactive image retrieval.
[0006] The objective of this invention is achieved through the following technical solution: a context-aware, dynamic latent spatial transformation-based interactive image retrieval method, comprising the following steps:
[0007] Construct an image and text feature interaction model, which includes:
[0008] The text features of the user's initial query and the initial text features of the historical dialogue content are extracted separately using a text encoder.
[0009] The initial visual features of the image are extracted using an image encoder;
[0010] Large models are used to condense the dialogue context from historical dialogue content and extract dialogue context features;
[0011] A lightweight network is used to process the dialogue context features to obtain a feature transformation matrix. Another lightweight network is used to process the concatenated features of the initial query text and context features to obtain the feature transformation magnitude.
[0012] The initial text features and initial visual features of the historical dialogue content are transformed using the feature transformation matrix and feature transformation magnitude, respectively. Finally, the model outputs the transformed text query features and image features.
[0013] The common space learning algorithm is used to learn the correlation between two modalities after feature transformation, and the model is trained in an end-to-end manner.
[0014] Users execute queries in different rounds. In rounds following the initial query, the trained model is used to map text and images to a shared common feature space, and interactive text-to-image retrieval is performed in this common feature space.
[0015] Furthermore, the step of extracting the text features of the user's initial query and the initial text features of the historical dialogue content through the text encoder includes:
[0016] The initial query input by the user and the subsequent dialogue history used for the query are encoded using the BLIP text encoder. All dialogue content up to the turn to be queried in the dialogue history is concatenated to form the query text, which includes the question and answer of the current turn and the user's initial query. The concatenated query text is then encoded using the BLIP model's text encoder.
[0017] Furthermore, the process of using a large model to condense the dialogue context from historical dialogue content and extract dialogue context features includes:
[0018] The large language model is used to generate each round of dialogue context containing the dynamic changes in the user's search intent through prompt words, which serves as the transformation condition text; the BLIP text encoder is used to encode the transformation condition text to obtain the dialogue context features.
[0019] Furthermore, the step of using a lightweight network to process the dialogue context features to obtain the feature transformation matrix includes: processing the current dialogue context features using two lightweight multilayer perceptron modules to obtain two vectors respectively, and performing a dimensional transformation on the two obtained vectors to obtain two d×r matrices. and d represents the dimension of the feature space, and r represents the rank of the matrix. Normalizing the column dimensions of the matrix yields... and The complete transformation matrix is constructed using the normalized transformation matrix. .
[0020] Furthermore, the process of using a lightweight network to process the dialogue context features to obtain the feature transformation matrix includes two lightweight multilayer perceptron modules: each consisting of two linear layers and a GeLU activation function, with the state size of the middle hidden layer set to 2d and the state size of the output of the second linear layer set to dr, where r represents the rank of the matrix and is a constant much smaller than d.
[0021] Furthermore, the feature transformation amplitude obtained by processing the concatenated features of the initial query text and context features using another lightweight network includes:
[0022] The dialogue context features and the initial query text features are concatenated. A lightweight multilayer perceptron module consisting of three linear layers and two GeLU activation functions is used to process the concatenated features. The features are then normalized using the Sigmoid function to obtain the feature transformation amplitude in the range of 0-1.
[0023] Furthermore, feature transformations are performed on the initial textual and visual features of the historical dialogue content using feature transformation matrices and feature transformation magnitudes, respectively:
[0024] For each round of dialogue, the different transformation matrix and transformation magnitude are obtained. The feature to be transformed is multiplied by the transformation matrix and transformation magnitude to obtain an increment related to the dialogue context. Finally, the increment is added back to the original feature through residuals, and the text and visual representations are dynamically aligned according to the constantly changing user intent.
[0025] Furthermore, the method of using a common space learning algorithm to learn the correlation between two modalities after feature transformation includes:
[0026] The transformed text features are obtained through context-guided contrastive loss. Its corresponding contextual image features Alignment, the contrast loss L is specifically expressed as follows:
[0027]
[0028] in B represents temperature, and B represents the sample size in the batch. This represents the function for calculating cosine similarity.
[0029] According to another aspect of the specification, a context-aware dynamic latent spatial transformation interactive image retrieval device is also provided, including a memory and one or more processors. The memory stores executable code, and when the processor executes the executable code, it implements the aforementioned context-aware dynamic latent spatial transformation interactive image retrieval method.
[0030] According to another aspect of the specification, a computer-readable storage medium is also provided, on which a program is stored, which, when executed by a processor, implements the aforementioned context-aware dynamic latent spatial transformation interactive image retrieval method.
[0031] The beneficial effects of this invention are:
[0032] This invention proposes a context-driven dynamic feature space transformation mechanism. By designing a context-aware low-rank projector, it directly and adaptively reshapes the semantic geometry at the feature space level, rather than relying on the static embedding representation in traditional methods. Specifically, this method, for the first time, explicitly decomposes the feature space transformation into two decoupled processes: "transformation direction" and "transformation intensity." A conditionalized feature transformation matrix is generated using the context-aware low-rank projector to characterize the semantic transformation direction guided by user intent. Simultaneously, an adaptive regulator is introduced to dynamically control the transformation amplitude based on contextual changes, thereby achieving fine-tuning of the feature space. Based on this, image and text features are aligned and compared in a conditional common feature space, enabling the model to explicitly model cross-modal semantic relationships under different contextual conditions. Furthermore, by introducing low-rank constraints, this invention decomposes the full matrix transformation in the original high-dimensional feature space into a combination of two low-dimensional matrices, significantly improving the expressive power of features with almost no increase in time overhead, making it highly efficient for large-scale retrieval scenarios. Compared to existing methods that only perform matching in a fixed feature space, this invention can dynamically adjust the feature distribution according to the user's intent, making the semantic space more focused on the dimensions relevant to the current retrieval target. This fundamentally breaks through the limitations of static representation and realizes a paradigm shift from "static matching" to "dynamic semantic space evolution," significantly improving the accuracy and robustness of retrieval.
[0033] This invention eliminates the need for complex concept detection processes, is suitable for interactive image retrieval tasks, and possesses good versatility and practical application value. Attached Figure Description
[0034] Figure 1 An overall framework diagram of the context-aware dynamic latent spatial transformation interactive image retrieval method provided in the embodiments of the present invention;
[0035] Figure 2 This is a performance comparison chart of different dialogue sources provided in an embodiment of the present invention;
[0036] Figure 3 A performance comparison chart of integrating existing frameworks provided for embodiments of the present invention;
[0037] Figure 4 A visualization diagram of dynamic spatial transformation provided in an embodiment of the present invention;
[0038] Figure 5 This is a schematic diagram of an interactive image retrieval device provided in an embodiment of the present invention. Detailed Implementation
[0039] The specific embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
[0040] For retrieval tasks, features of different modalities are fundamental, so the quality of features is particularly important. BLIP is the most popular visual language model in the current field of image-text retrieval. It consists of two main modules: a text encoder and an image encoder. The text encoder is based on the BERT model and learns rich semantic information in the text through pre-training on large-scale unlabeled corpora (such as Wikipedia and BookCorpus). The image encoder adopts modern visual models such as Vision Transformer (ViT) and learns the visual features of images through pre-training on large-scale image datasets (such as COCO and VisualGenome). BLIP achieves deep cross-modal understanding and matching between images and text through the joint work of these two modules. Therefore, we use the BLIP model pre-trained on these datasets to extract cross-modal features of images and text, thereby improving the accuracy and effectiveness of image-text retrieval. (1) In interactive image retrieval, such as Figure 1 As shown, the initial query is a description. For subsequent queries, to accurately capture and preserve the user's complete query intent, the system relies not only on the content of the current dialogue round but also on the user's initial query and previous dialogue history. This method enhances retrieval accuracy by integrating all dialogue content into a unified query text, ensuring the comprehensiveness and consistency of contextual information. Text feature extraction is the core step in this system, specifically:
[0041] (1-1) For the initial query input by the user We will use BLIP's text encoder for both the dialogue history used for subsequent queries. Encode the query. Specifically, the encoding of the initial query yields the initial query features. It is expressed as follows:
[0042]
[0043] (1-2) In interactive image retrieval, the content of each round of user dialogue is not used as a separate query, such as Figure 1 As shown, for a given round of queries, the user's initial query will be used instead. The system concatenates all dialogue content up to a certain round into a complete text query containing the entire dialogue history. Each round of dialogue includes questions from the system. and human reply By piecing together these dialogue messages, the final query text is constructed. In this step, we combine the entire dialogue history to form the query text. This includes the questions and answers of the current round, as well as the user's initial query. Each round of questions and answers contributes to the query text, ensuring that the query text fully reflects the context of the dialogue. Specifically, it can be represented as:
[0044]
[0045] (1-3) When we obtain the concatenated query text The next step is to encode it using a text encoder based on the BLIP model. The text encoder will encode the query text. Convert to high-dimensional text feature vectors This feature contains semantic information from the user's query text, which helps the system understand the user's intent. Specifically, it is represented as follows:
[0046]
[0047] (2) In addition to textual features, we also need to encode the image into a feature vector. BLIP's image encoder, based on the ViT model, can effectively extract visual features from images. This is achieved by processing candidate images... By encoding, we obtained the visual feature vector for each image. These feature vectors capture the visual content and semantic information in the image, and together with the text features, they constitute a cross-modal feature representation, providing the foundation for image-text matching tasks. The specific representation is as follows:
[0048]
[0049] (3-1) To ensure that the model can dynamically adjust the direction of feature transformation according to the user's intent during interactive image retrieval, we introduce a dialogue context generation mechanism. Unlike traditional static feature processing methods, our goal is to guide the feature transformation process through dynamically generated conditions. In this mechanism, we use a large language model (such as GPT-5) to generate the dialogue context for each round of dialogue using designed prompts. This serves as conditional text for feature transformation. These texts are generated based on the historical context of the current dialogue, reflecting dynamic changes in user intent and helping the model focus on relevant content during feature transformation. The generated conditional text is used as semantic conditions for subsequent feature transformations, ensuring accurate alignment of image and text features in each round of dialogue, as shown below:
[0050]
[0051] Table 1 shows the text prompts designed in this invention, which can condense all the content of the dialogue history into a concise dialogue context.
[0052]
[0053] (3-2) After obtaining the transformation condition text for this round, we encode it in the same way as in step (1), where the condition features are transformation conditions generated based on the current dialogue context. This has a strong contextual dependency and is a summary and distillation of the dialogue history. Ultimately, we obtain the transformation condition features for this round, which are used to subsequently generate our transformation matrix and transformation magnitude, as shown below:
[0054]
[0055] (4) In interactive image retrieval tasks, ensuring dynamic adaptation between feature transformations and user intent is crucial. To achieve this, we introduce a conditional transformation matrix to adjust the direction of feature transformations. This matrix not only adjusts according to changes in the dialogue context but also maintains consistency with the current dialogue context through an adaptive mechanism. This method ensures that the model can flexibly respond to different user queries and dialogue scenarios, thereby achieving more accurate image-text matching in multi-turn dialogues. After obtaining the transformation conditional features from step (3), we can generate our conditional transformation matrix, which acts as a hypernetwork that can change according to changes in the input conditions, ensuring that the direction of transformation is always related to the content of the dialogue context. The specific steps are as follows:
[0056] (4-1) To improve computational efficiency and avoid excessive computational overhead, we designed two lightweight MLP (Multilayer Perceptron) modules. Each module consists of two linear layers and a GeLU (Gaussian Error Linear Units) activation function, which helps enhance the nonlinear expressive power of the model, thereby improving the expressive power of feature transformation. Although the two MLPs are consistent in structure and input conditions, their parameters are independent of each other, thus enabling them to learn different mapping functions. One MLP projects features onto a context-dependent low-rank subspace to characterize the direction of semantic transformation, while the other MLP is responsible for remapping them back to the original feature space, thereby achieving effective modulation of the feature space. This decoupled design avoids the constraint of representational power, allowing the model to perform context-aware feature transformations more flexibly. The state size of the intermediate hidden layer is set to 2d to expand the network's representation space, enabling it to better capture complex contextual information. The state size of the output of the second linear layer is dr, where r is a constant, much smaller than d, used to control the rank of the low-rank matrix, thereby effectively reducing computational complexity and the risk of overfitting. By inputting the current dialogue context features obtained in step (3), we can obtain two vectors a and b, which will serve as the basis for generating the subsequent transformation matrix, as shown below:
[0057]
[0058] (4-2) In the process of generating the low-rank transformation matrix, we consider the vectors and A dimensional transformation operation was performed. Specifically, we converted the two vectors into two matrices of dimension d×r. and Here, d represents the dimension of the feature space, and r represents the rank of the matrix. Choosing a low-rank matrix not only helps reduce computational resource consumption but also improves the model's adaptability to changing contexts in multi-turn dialogues. In this way, the transformation matrix can accurately reflect changes in the current dialogue context and appropriately adjust the feature space, thereby optimizing the matching results of images and text.
[0059] (4-3) To ensure the numerical stability of the transformation matrix and prevent gradient explosion or gradient vanishing problems, we... and Column-level normalization was performed. Normalization standardizes the numerical range of each column, ensuring that each column of the transformation matrix maintains a stable numerical scale during training. This not only improves the stability of model training but also accelerates convergence and reduces fluctuations during training. After normalization, the transformation matrix can be more effectively used for feature projection and reprojection operations, ensuring that the final transformation result is accurate and efficient, as shown below:
[0060]
[0061] Finally, we obtained the complete transformation matrix. It will be applied to the feature space of images and text. This matrix is obtained through... The original features are projected into a low-rank subspace, thereby compressing the feature space and highlighting relevant dimensional features. Then, through... The reprojection operation maps features from the low-rank space back to the high-dimensional space, forming a context-conditional directional transformation. This transformation follows the path defined by... The defined axes are applied to the semantic manifold, ensuring that the deformation direction of features matches the current dialogue context, further enhancing the semantic alignment capability of multimodal features. In each round of dialogue, this transformation matrix can be dynamically adjusted to accurately reflect user intent and dialogue context, thereby achieving more precise image-text matching.
[0062] (5) After step (4), we obtain the transformation matrix in the feature transformation to control the direction of our feature transformation. At the same time, the magnitude of the transformation is also important. We obtain our transformation magnitude by learning the semantic differences between the initial query and the current dialogue context. The specific method is as follows:
[0063] We concatenate the dialogue context features obtained in step (3) and the initial query features obtained in step (1) to obtain a 2d feature vector that contains both context features and initial query features. We also designed another lightweight... It consists of three linear layers and two GeLU activation functions. We input the concatenated features into... Then, it is normalized by a sigmoid function to obtain a value in the range of 0-1. The magnitude of our transformation is specifically expressed as follows:
[0064]
[0065] (6) For the first In each round of dialogue, we will obtain different transformation matrices and transformation magnitudes. First, we will determine the features to be transformed. Multiplying the transformation matrix and transformation magnitude yields a context-dependent increment, which is then fed back into the original features via residuals. This process dynamically aligns text and visual representations based on evolving user intent, where low-rank projection provides orientation-aware feature space transformations, and adaptive gating adjusts modulation intensity according to contextual differences. Overall, these components ensure stable and expressive evolution of the multimodal feature space across dialogue rounds, as detailed below:
[0066]
[0067] in Representation layer normalization helps maintain feature stability and training efficiency.
[0068] (7) After step (6), we obtain the text query features and image features after the transformation in the t-th round. We use the common space learning algorithm to learn the correlation between the two modalities after feature transformation. Finally, we train the model in an end-to-end manner. The specific operation is as follows:
[0069] In each training batch, dialogue samples are randomly selected for different dialogue rounds, and each sample contains different dialogue content. Within these sample-specific contexts, textual and visual features are transformed accordingly. We define a context-guided contrastive loss to transform the transformed textual features... Its corresponding contextual image features Alignment, specifically, is as follows:
[0070]
[0071] in B represents temperature, and B represents the sample size in the batch. This represents the calculation of the cosine similarity function. Unlike traditional comparison targets that assume a single, static embedding space, our formula aligns within a set of context-adaptive subspaces, each dynamically adjusted by its own reconstructed dialogue conditions.
[0072] (8) Training the feature transformation module ensures that the model flexibly adjusts the feature representations of text and images according to the dialogue content. Through the module trained in step (7), we can map text and images to a shared public feature space. This space is designed to enhance the relevance of image and text features to the current dialogue context. This mapping can not only better capture the semantic information in images and text, but also amplify the parts of the feature space related to user intent through the guidance of dialogue content, thereby improving the accuracy and effectiveness of retrieval. Through this method of dynamically adjusting the feature space, the system can adaptively optimize the matching of images and text in each round of dialogue. The specific operation is as follows:
[0073] (8-1) We use the same method in step (2) to encode all the images in the candidate library through an image encoder to obtain all the initial image features.
[0074] (8-2) For the user's first query, the initial query text features are obtained by encoding the user's initial query through a text encoder using the method in step (1).
[0075] (8-3) Calculate the similarity between the initial text query and all candidate images, sort the candidate images according to the similarity, and return the top-k search results.
[0076] (8-4) After the initial query is completed, the first round of interaction begins. Relevant questions are posed through the large model, and the user answers as the responder. All dialogue content and the initial query are concatenated to obtain the query text for this round, and a text encoder is used to encode it to obtain the initial query features for this round.
[0077] (8-5) In this stage, the feature transformation module trained in step (6) plays a crucial role. We use this module to transform the initial query text features and all candidate image features of this round, and map them to the same common feature space. This space is a conditional space generated based on the current dialogue context, ensuring semantic alignment between text and image features. In this way, the text query features and image features of this round are jointly mapped to a new space that highlights the features relevant to the current dialogue content while suppressing features irrelevant to the dialogue context. The feature transformation process can better align the semantic information between images and text, thereby improving the accuracy of subsequent retrieval.
[0078] (8-6) Calculate the cosine similarity between the current text query and all candidate videos in the common feature space of this round. Then, sort all candidate images in descending order according to the cosine similarity and return the top-k images as the retrieval results, thereby realizing interactive cross-modal retrieval from text to image. Each subsequent round of interactive retrieval follows the same principle.
[0079] Intra-domain comparison: The effectiveness of the proposed method is evaluated in a multi-round interactive text-image retrieval task. In this task, the text module of the feature extraction network BLIP is first fine-tuned to enable it to extract features from dialogue-style text queries. The feature extraction network remains fixed during the training of other modules. Furthermore, this invention uses Recall@1, Recall@5, and Recall@10 accuracy as retrieval metrics. The publicly available dataset VisDial is used as our evaluation dataset. As shown in Table 2, the proposed method achieves state-of-the-art performance across all retrieval metrics and rounds. Moreover, the performance improvement becomes more significant with increasing round number, validating the effectiveness of dynamic feature changes in interactive retrieval. Figure 2 Performance evaluations were also conducted on various dialogue variants, each originating from a different dialogue model. It can be seen that our method achieves the highest performance across all dialogue variants, demonstrating that the framework can effectively adapt to multiple dialogue formats. Notably, our proposed context-aware spatial modulator can be integrated as a plug-and-play module into existing interactive retrieval models, such as... Figure 3 As shown, when this invention is integrated into existing methods, all retrieval performance is improved, fully demonstrating the plug-and-play capability of the proposed method.
[0080] In addition, such as Figure 4 As shown, we present a visualization example of t-SNE. It displays 100 images of dogs. A sample image is marked with a red border, and its local area is magnified to illustrate the spatial transformation. When the input dialogue context is "black dog," the feature space appears to be transformed into a new space that emphasizes color distinction. As a result, images with the "black" attribute are brought closer together in the transformed space. Similarly, when the dialogue context is "dog on grass," the space is adjusted to prioritize the scene, grouping images in the grassy environment together. This demonstrates how the input context dynamically transforms the representation of images, highlighting the model's ability to adjust its feature space according to the context.
[0081] Table 2 Comparison of Multi-round Search Performance
[0082]
[0083] Corresponding to the aforementioned embodiment of the interactive image retrieval method based on context-aware dynamic latent spatial transformation, the present invention also provides an embodiment of an interactive image retrieval device based on context-aware dynamic latent spatial transformation.
[0084] See Figure 5 The present invention provides a context-aware interactive image retrieval device based on dynamic latent spatial transformation, comprising a memory and one or more processors. The memory stores executable code, and when the processor executes the executable code, it implements a context-aware interactive image retrieval method based on dynamic latent spatial transformation as described in the above embodiment.
[0085] An embodiment of the context-aware dynamic latent spatial transformation interactive image retrieval device provided by this invention can be applied to any device with data processing capabilities, such as a computer. The device embodiment can be implemented in software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by the processor of any data processing device loading the corresponding computer program instructions from non-volatile memory into memory for execution. From a hardware perspective, such as... Figure 5 The diagram shown is a hardware structure diagram of any device with data processing capabilities, which is an interactive image retrieval device based on context-aware dynamic latent spatial transformation provided by the present invention. Except for... Figure 5 In addition to the processor, memory, network interface, and non-volatile memory shown, any data processing device in the embodiment may also include other hardware depending on the actual function of the data processing device, which will not be described in detail here.
[0086] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.
[0087] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the present invention according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0088] This invention also provides a computer-readable storage medium storing a program thereon, which, when executed by a processor, implements a context-aware, dynamic latent spatial transformation-based interactive image retrieval method as described in the above embodiments.
[0089] The computer-readable storage medium can be an internal storage unit of any data processing device described in any of the foregoing embodiments, such as a hard disk or memory. The computer-readable storage medium can also be an external storage device of any data processing device, such as a plug-in hard disk, smart media card (SMC), SD card, flash card, etc., equipped on the device. Furthermore, the computer-readable storage medium can include both internal storage units and external storage devices of any data processing device. The computer-readable storage medium is used to store the computer program and other programs and data required by the data processing device, and can also be used to temporarily store data that has been output or will be output.
[0090] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the aforementioned context-aware dynamic latent space transformation interactive image retrieval method.
[0091] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and embodiments are to be considered exemplary only, and the true scope and spirit of this application are indicated by the claims.
[0092] It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this application. This application is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.
Claims
1. A context-aware, dynamic latent spatial transformation-based interactive image retrieval method, characterized in that, Includes the following steps: Construct an image and text feature interaction model, which includes: The text features of the user's initial query and the initial text features of the historical dialogue content are extracted separately using a text encoder. The initial visual features of the image are extracted using an image encoder; Large models are used to condense the dialogue context from historical dialogue content and extract dialogue context features; A lightweight network is used to process the dialogue context features to obtain a feature transformation matrix. Another lightweight network is used to process the concatenated features of the initial query text and context features to obtain the feature transformation magnitude. The initial text features and initial visual features of the historical dialogue content are transformed using the feature transformation matrix and feature transformation magnitude, respectively. Finally, the model outputs the transformed text query features and image features. The common space learning algorithm is used to learn the correlation between two modalities after feature transformation, and the model is trained in an end-to-end manner. Users execute queries in different rounds. In rounds following the initial query, the trained model is used to map text and images to a shared common feature space, and interactive text-to-image retrieval is performed in this common feature space.
2. The context-aware, dynamic latent spatial transformation-based interactive image retrieval method according to claim 1, characterized in that, The extraction of text features from the user's initial query and the initial text features from historical dialogue content using a text encoder includes: The initial query input by the user and the subsequent dialogue history used for the query are encoded using the BLIP text encoder. All dialogue content up to the turn to be queried in the dialogue history is concatenated to form the query text, which includes the question and answer of the current turn and the user's initial query. The concatenated query text is then encoded using the BLIP model's text encoder.
3. The context-aware, dynamic latent spatial transformation-based interactive image retrieval method according to claim 1, characterized in that, The process of using a large model to condense the dialogue context from historical dialogue content and extract dialogue context features includes: The large language model is used to generate each round of dialogue context containing the dynamic changes in the user's search intent through prompt words, which serves as the transformation condition text; the BLIP text encoder is used to encode the transformation condition text to obtain the dialogue context features.
4. The context-aware, dynamic latent spatial transformation-based interactive image retrieval method according to claim 1, characterized in that, The process of using a lightweight network to process the dialogue context features to obtain the feature transformation matrix includes: processing the current dialogue context features using two lightweight multilayer perceptron modules to obtain two vectors respectively; and performing a dimensionality transformation on the two obtained vectors to obtain two d×r matrices. and d represents the dimension of the feature space, and r represents the rank of the matrix. Normalizing the column dimensions of the matrix yields... and The complete transformation matrix is constructed using the normalized transformation matrix. .
5. The context-aware, dynamic latent spatial transformation-based interactive image retrieval method according to claim 4, characterized in that, The process of using a lightweight network to process dialogue context features to obtain a feature transformation matrix includes two lightweight multilayer perceptron modules: each consisting of two linear layers and a GeLU activation function. The state size of the middle hidden layer is set to 2d, and the state size of the output of the second linear layer is dr, where r represents the rank of the matrix and is a constant much smaller than d.
6. The context-aware, dynamic latent spatial transformation-based interactive image retrieval method according to claim 1, characterized in that, The process of using an additional lightweight network to concatenate the initial query text features and context features yields feature transformation amplitudes including: The dialogue context features and the initial query text features are concatenated. A lightweight multilayer perceptron module consisting of three linear layers and two GeLU activation functions is used to process the concatenated features. The features are then normalized using the Sigmoid function to obtain the feature transformation amplitude in the range of 0-1.
7. The context-aware, dynamic latent spatial transformation-based interactive image retrieval method according to claim 1, characterized in that, Feature transformations are performed on the initial textual and visual features of the historical dialogue content using feature transformation matrices and feature transformation magnitudes, respectively: For each round of dialogue, the different transformation matrix and transformation magnitude are obtained. The feature to be transformed is multiplied by the transformation matrix and transformation magnitude to obtain an increment related to the dialogue context. Finally, the increment is added back to the original feature through residuals, and the text and visual representations are dynamically aligned according to the constantly changing user intent.
8. The context-aware, dynamic latent spatial transformation-based interactive image retrieval method according to claim 1, characterized in that, The method of using a common space learning algorithm to learn the correlation between two modalities after feature transformation includes: The transformed text features are obtained through context-guided contrastive loss. Its corresponding contextual image features Alignment, the contrast loss L is specifically expressed as follows: in B represents temperature, and B represents the sample size in the batch. This represents the function for calculating cosine similarity.
9. A context-aware, dynamic latent spatial transformation-based interactive image retrieval device, comprising a memory and one or more processors, wherein the memory stores executable code, characterized in that, When the processor executes the executable code, it implements an interactive image retrieval method based on context-aware dynamic latent spatial transformation as described in any one of claims 1-8.
10. A computer-readable storage medium having a program stored thereon, characterized in that, When the program is executed by the processor, it implements an interactive image retrieval method based on context-aware dynamic latent spatial transformation as described in any one of claims 1-8.