A conversational mixed-modal retrieval method, system, device, and medium
By combining intent deconstruction, vector augmentation, and visual auditing agents, the problems of low recall and poor interactive experience in existing multimodal retrieval technologies are solved, achieving efficient and accurate retrieval and adaptive iterative optimization for complex multimodal queries.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEFEI UNIV OF TECH
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing information retrieval technologies suffer from low recall and poor relevance when dealing with complex multimodal queries, especially when faced with vague, abstract descriptions and visual feature constraints from users. They also lack the ability to self-check search results, resulting in a poor user experience.
A composite query strategy is adopted, which uses an intent deconstruction agent to generate structured hard constraints and unstructured soft semantics. The query vector is enriched by a vector augmentation agent. A visual audit agent is used to verify the consistency between the text and images. Under closed-loop control, the search conditions are dynamically adjusted to achieve iterative re-search.
It improves the visual accuracy and intent fidelity of search results, avoids the occurrence of "zero results", maintains the recall continuity of exploratory recommendations, and enhances the robustness and interactive experience of multimodal retrieval.
Smart Images

Figure CN122240861A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer information processing technology, and specifically to a conversational mixed-modal retrieval method, system, device, and medium. Background Technology
[0002] With the rapid development of the mobile internet and digital content industry, the data modalities in massive information databases are becoming increasingly rich. Users' methods of acquiring information are undergoing a profound transformation from simple, structured keyword searches to unstructured, natural language-based conversational interactions. In vertical fields with strong visual reliance, such as fashion e-commerce, home design, and digital copyright image libraries, users often express complex search needs through lengthy, vague, and highly subjective natural language descriptions, sometimes even combined with reference images.
[0003] Current mainstream information retrieval technologies are mainly based on two architectures: one is the traditional retrieval architecture based on keyword parsing and inverted indexes; the other is the vector semantic retrieval architecture based on deep learning dual-stream encoders. Although vector retrieval has made some progress in handling cross-modal semantic similarity, in practical large-scale industrial applications, it still faces the following technical challenges when dealing with complex long-tail search intentions: First, users' daily queries are usually short, conversational, and contain a lot of abstract stylistic or atmospheric descriptions, while the target objects in the underlying database are usually structured stacks of objective physical parameters. When traditional two-stream models directly calculate the cosine similarity of the embedding vectors of the two, the recall and relevance of long-tail queries often drop significantly because the feature distribution density of the two in the multimodal high-dimensional space is not equal and the semantic space is not aligned. Second, existing systems typically employ a pipeline mechanism when processing complex queries with attribute constraints: first, a candidate set is retrieved via vectors, and then the extracted attributes are used for hard filtering using Boolean logic. However, in multi-round natural interactions, some user constraints may have soft boundaries, or the intent recognition module may have subtle biases. The veto power of Boolean logic easily leads to "zero results," blocking the user's exploratory search path. Furthermore, the system cannot handle tolerant matching scenarios with high recommendation value, such as "core style matching but minor color differences in some secondary attributes." Third, existing systems generally lack fine-grained consistency verification at the visual level. Most current multimodal retrieval systems employ an open-loop control architecture, meaning that after calculating similarity scores based on global features of text or images, the system directly presents the Top-K results to the user. The retrieval system itself lacks the ability to perform "pre-checking" of the search results, and cannot reason about and determine whether the recalled images actually contain the fine-grained visual constraints emphasized by the user in the text. This loss of local features often results in irrelevant search results being directly exposed to the end user, impacting the user experience.
[0004] In summary, while existing retrieval schemes offer some usability for specific, well-defined structured query tasks, their overall retrieval performance falls short of application requirements when handling complex multimodal composite intents, including those involving high-order semantics, subjective visual feature constraints, and complex multimodal composite intents. Therefore, a conversational hybrid modal retrieval method is needed that can deconstruct complex intents, possess generative semantic enhancement capabilities, and achieve closed-loop error correction through agent introspection. Summary of the Invention
[0005] The present invention proposes a conversational hybrid modal retrieval method, system, device, and medium, which can at least solve one of the technical problems in the background art.
[0006] To achieve the above objectives, the present invention adopts the following technical solution: A conversational mixed-modal retrieval method includes the following steps: S1. Receive a user session request containing unstructured natural language text and optional image information, and perform standardized preprocessing on the user session request; S2. Using intent to deconstruct the intelligent agent, the user session request is decoupled in multiple dimensions based on a preset instruction template to generate a composite query strategy that includes structured hard constraints, unstructured soft semantics, and negative constraints. S3. Using a vector-enhanced agent, a virtual feature augmentation description is generated based on the unstructured soft semantics, and the feature augmentation description is converted into a text embedding vector. When the user session request contains optional image information, the text embedding vector and the visual feature block sequence of the optional image information are fused and encoded in a multimodal cross-attention space to generate an enhanced query vector. S4. Using a retrieval execution agent, perform similarity recall in a preset target object vector data area based on the enhanced query vector to obtain an initial candidate object set; and perform elastic weighted scoring and sorting on the initial candidate object set based on the structured hard constraints and the preset attribute distance matrix to obtain an audit candidate list. S5. Using a visual auditing agent, based on the composite query strategy, dynamically generate visual question-and-answer prompts for the target objects in the candidate list to be audited, and call the visual-language model to perform visual reasoning verification, and calculate the image-text matching confidence of each candidate object; S6. When the confidence level of the image-text matching does not meet the preset consistency conditions, the closed-loop control module is triggered and the audit failure distribution is analyzed. The composite query strategy or elastic weighting parameters are dynamically adjusted, and the iterative re-retrieval is returned until the preset exit conditions are met or the maximum number of iterations is reached.
[0007] As a preferred embodiment of the conversational hybrid modal retrieval method of the present invention, the step of generating a composite query strategy in step S2 includes: inputting the standardized preprocessed user session request into the fine-tuned large language model; guiding the large language model through a preset structured thought chain instruction template to extract the objective attributes that must be satisfied as the structured hard constraints; extracting natural language word clusters describing style, scene and atmosphere as the unstructured soft semantics; and extracting a feature set with exclusive negative prefixes as the negative constraints.
[0008] As a preferred embodiment of the conversational hybrid modal retrieval method of the present invention, the step of generating the enhanced query vector in step S3 includes: using a generative language model to perform feature divergence on the unstructured soft semantics to generate extended text containing core attributes and applicable scenario features, and extracting the text embedding vector through a text encoder; when the user session request does not contain image information, the text embedding vector is directly used as the enhanced query vector; when the user session request contains image information, the local block feature sequence of the image information is extracted through a visual encoder; using the text embedding vector as the query matrix, and the local block feature sequence as the key matrix and value matrix, the attention weight of the text intent to the local region of the image is calculated through a cross-attention mechanism, and the enhanced query vector is generated through residual connection and normalization processing.
[0009] As a preferred embodiment of the conversational hybrid modal retrieval method of the present invention, the step S4 of performing elastic weighted scoring and ranking includes: for any candidate object in the initial candidate object set, calculating the cosine similarity between its corresponding target feature representation and the enhanced query vector to obtain a vector similarity score; retrieving the attribute metadata of the candidate object, calculating the normalized distance between the attribute metadata and the structured hard constraint based on a preset attribute distance matrix, and converting it into an attribute soft matching score; wherein, the attribute distance matrix defines the tolerance for deviation between different attribute categories; determining whether the text or tag of the candidate object contains words in the negation constraint, and if so, generating a negation constraint penalty term; weighting and summing the vector similarity score and the attribute soft matching score according to preset weights, and subtracting the negation constraint penalty term to obtain the comprehensive score of the candidate object, and sorting it in descending order according to the comprehensive score.
[0010] As a preferred embodiment of the conversational hybrid modal retrieval method of the present invention, the step of performing visual reasoning verification in step S5 includes: parsing the composite query strategy, generating a hard blocking question based on the negation constraint to determine whether it contains exclusive features; generating a consistency scoring question based on the unstructured soft semantics to evaluate style and detail matching degree; inputting the image of the candidate object into the visual-language model in combination with the hard blocking question and the consistency scoring question respectively to obtain the reasoning judgment result; performing a weighted calculation on the reasoning judgment result to generate the image-text matching confidence score, and when the judgment result of the hard blocking question is that it contains negation features, the image-text matching confidence score is directly set to zero.
[0011] As a preferred embodiment of the conversational hybrid modal retrieval method of the present invention, the step S6 of dynamically adjusting the composite query strategy or elastic weighting parameters includes: statistically analyzing the reasons for the failure of candidate objects that failed the current round of verification; if the proportion of attribute mismatch exceeds a first set threshold, increasing the calculation weight of the attribute soft matching score in the elastic weighted scoring to ensure that the next round of retrieval adheres to structured attributes; if the proportion of style mismatch exceeds a second set threshold, calling the large language model to rewrite the unstructured soft semantics and supplementing the missing features fed back in the visual audit into the rewritten soft semantics; if the retrieval recall quantity for consecutive set rounds is lower than a preset minimum quantity threshold, automatically identifying and removing the constraint condition with the lowest current weight to perform a downgraded search.
[0012] A conversational mixed-modal retrieval system, comprising: The multimodal request receiving and preprocessing module is used to receive user session requests containing unstructured natural language text and optional image information and perform standardized preprocessing. The intent deconstruction agent module is used to decouple the user session request in multiple dimensions based on a preset instruction template, and generate a composite query strategy that includes structured hard constraints, unstructured soft semantics and negative constraints. The vector-enhanced agent module is used to generate virtual feature augmentation descriptions based on the unstructured soft semantics and convert them into text embedding vectors; and when image information is present, the text embedding vectors and the visual feature block sequence of the image information are fused in the multimodal cross-attention space to generate an enhanced query vector. The retrieval execution agent module is used to perform similarity recall based on the enhanced query vector, and to perform elastic weighted scoring and ranking in combination with the structured hard constraints and attribute distance matrix to obtain a candidate list to be audited; The visual auditing intelligent agent module is used to dynamically generate visual question-and-answer prompts, call the visual-language model to perform inference verification on the candidate list to be audited, and calculate the confidence score of image-text matching; An adaptive closed-loop control module is used to analyze the failure distribution when the confidence level does not meet the conditions, dynamically adjust the query strategy or scoring weight, and trigger the retrieval execution agent module to perform iterative re-retrieval. This is a preferred embodiment of the conversational hybrid modal retrieval method described in this invention. In another aspect, the present invention also discloses a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the steps of the method described above.
[0013] In another aspect, the present invention also discloses a computer device, including a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the steps of the method described above.
[0014] The beneficial effects of this invention are: This invention introduces a vector-enhanced agent that combines intent deconstruction with hypothetical feature generation of a large model. This agent can automatically expand the user's vague and abstract "short text soft semantics" into "long text feature clusters" that include specific materials, details, and scenes. At the same time, it uses a cross-attention mechanism to dynamically anchor key local features of the image, which enriches the semantic information density of the query vector and effectively solves the problem that the underlying data object and the user's subjective intent cannot be aligned in the multi-dimensional feature space. By creatively introducing an elastic weighted ranking mechanism and attribute distance matrix during the retrieval execution phase, this invention reduces the rigidity of the traditional Boolean search architecture. When faced with user memory bias, natural language error tolerance, or retrieval space shrinkage caused by multiple attribute superposition, the system can impose a smooth deduction penalty on "non-fatal attribute bias" instead of a veto, avoiding the dead end of "zero results" and maintaining the recall continuity of exploratory recommendation. This invention introduces a "visual auditing agent" and an adaptive iteration mechanism. Before the results are presented to the user, the system automatically performs a self-check for consistency between the text and images using the powerful fine-grained visual reasoning capabilities of the visual-language model. When the check fails, the system can analyze the failure distribution, dynamically provide feedback, and adjust the search conditions and weight parameters for self-healing retry. This feedback closed-loop mechanism filters out inferior results that are "irrelevant to the topic," ensuring the visual accuracy and intent fidelity of the search results. Attached Figure Description
[0015] Figure 1 This is a schematic diagram of the overall architecture of the dialogic hybrid modal retrieval method, system, device, and medium of the present invention.
[0016] Figure 2 The main flowchart of the conversational mixed-modal retrieval method provided in the embodiments of the present invention is shown below; Figure 3This is a schematic diagram of the intention to deconstruct the internal structure of an intelligent agent and the state machine of the instruction flow of a large model, provided in an embodiment of the present invention. Figure 4 This is a diagram illustrating the tensor dimension change and feature fusion principle of the cross-modal cross-attention mechanism in a vector augmented agent provided in this embodiment of the invention. Figure 5 This invention provides a visual auditing agent's discrimination logic tree diagram and closed-loop adaptive feedback workflow diagram. Figure 6 This is a schematic diagram of the hardware structure of an electronic device for implementing a hybrid modality retrieval method, provided as an embodiment of the present invention. Detailed Implementation
[0017] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are some embodiments of the present invention, but not all embodiments.
[0018] Example 1 Please see Figure 1 This embodiment provides a conversational hybrid modal retrieval system. In terms of specific deployment, the logical architecture of this system preferably adopts microservice containerized orchestration technology, and each core functional unit is independently encapsulated and communicates internally through lightweight remote procedure calls or message queue protocols.
[0019] Specifically, such as Figure 1 As shown, the logical layer of this system mainly includes the following core service components: (1) Access Gateway Service: As the main entry point for this system to connect with external clients, it is responsible for receiving multimodal concurrent requests, performing access authentication, traffic shaping and dynamic request routing and distribution; (2) Central Control Service: As the global state maintenance center of the system, it is responsible for maintaining the multi-round session state machine. This service does not directly execute specific algorithm calculations, but instead schedules its subordinate agents and collects the calculation results based on the feedback state of the current interaction round, coordinating the entire process flow; (3) Intelligent Agent Cluster Service: This is the core computing cluster of the present invention, including an intent deconstruction intelligent agent, a vector enhancement intelligent agent, a retrieval execution intelligent agent, and a visual auditing intelligent agent. Each intelligent agent is independently allocated computing and memory resource pools, supporting elastic scaling in high-concurrency scenarios; (4) Model Inference Microservice: It encapsulates the underlying hardware operators of Large Language Model (LLM) and Visual-Language Model (VLM) downwards, provides standard inference API interfaces upwards, and improves hardware concurrency throughput through dynamic batch processing technology.
[0020] Furthermore, to support the millisecond-level hybrid retrieval of the above system under both hard attribute constraints and soft semantic features, this invention abandons the traditional single relational table structure and constructs a three-dimensional heterogeneous storage system of "high-dimensional vector - relational attribute - session cache", as detailed below: 1. High-dimensional vector storage area configuration In this embodiment, the system employs a distributed vector database for persistent storage of multimodal feature vectors of target objects. For massive amounts of target object data, the system constructs a graph-based nearest neighbor (ANN) index in the vector database. This index structure supports millisecond-level vector similarity retrieval even with billions of data points by building a nearest neighbor graph in a high-dimensional space. The target object vector data area is generated during the system's offline data construction phase. Specifically, the system first performs unified feature extraction on the target object's image, text description, and tag information, and generates a fixed-dimensional feature vector through a preset multimodal coding model. Subsequently, the feature vector, along with the object identification information, is written into the vector database, and an approximate nearest neighbor index structure is constructed. This data area serves as the retrieval space during the vector recall phase of the system operation. Furthermore, to ensure a balance between retrieval recall and memory consumption, this embodiment defines the construction parameters for the graph structure index: Maximum number of connections (M): This parameter determines the number of connected edges for each node in the graph structure. It needs to be configured to a reasonable value to ensure that the connectivity density of the underlying graph path meets the requirements for high-dimensional feature traversal.
[0021] Search depth: This parameter controls the maximum length of the dynamic list during index construction. It needs to be configured to a reasonable value to ensure the convergence quality of the graph structure during the construction phase.
[0022] Distance metric algorithm: configured as cosine similarity. Since the image-text fusion vectors stored in this system have all been normalized in the feature extraction stage, using cosine similarity to calculate the inner product can improve the acceleration performance of the hardware instruction set. The data set pattern of the high-dimensional vector storage area includes at least: an identifier field for mapping the primary key of the target object, a floating-point vector field for storing fixed-dimensional enhanced features, and a Boolean filter identifier field for logical soft deletion and fast shelf-up / down filtering; 2. Relational metadata storage area design The system uses a relational database to store the structured attribute metadata of the target objects to support "soft attribute matching" and filtering operations; its core data table structure includes: (1) Wide table for object attributes: To address the sparse attributes of target objects across different categories, this system does not use the traditional strong validation column model. Instead, it introduces a binary data type that supports flexible nested structures to store key-value pairs. For example, features such as color, material, and style are stored as standardized attribute key-value sets. To improve the attribute lookup efficiency of the retrieval execution agent, the system creates an inverted path index on the path key of this nested field, thereby reducing the query time complexity of attribute key-value pairs. (2) Semantic synonym mapping table: used to smooth the differences in natural language expression during the preprocessing stage. The table structure includes at least: original semantic word field, standard normalized word field and confidence weight field.
[0023] 3. Session state cache design To support contextual semantic inheritance and visual loop retries in multi-turn dialogues, the system uses an in-memory key-value database to maintain the global session state. The data structure is in list form, storing the user request history of the most recent set rounds, the agent recognition status, and intermediate review results; and a preset expiration time is configured to automatically clean up inactive sessions and release computing resources.
[0024] Example 2 See Figure 2 This paper provides a conversational mixed-modal retrieval method, which specifically includes the following steps: S1. Receive a user session request containing unstructured natural language text and optional image information, and perform standardized preprocessing on the user session request; S2. Utilize intent to deconstruct the intelligent agent, decouple user session requests in multiple dimensions based on preset instruction templates, and generate a composite query strategy that includes structured hard constraints, unstructured soft semantics, and negative constraints. S3. Utilize vector-enhanced agents to generate virtual feature augmentation descriptions based on unstructured soft semantics, and transform the feature augmentation descriptions into text embedding vectors. When a user session request includes optional image information, the text embedding vectors and the visual feature block sequence of optional image information are fused and encoded in the multimodal cross-attention space to generate enhanced query vectors. S4. Using a retrieval execution agent, similarity retrieval is performed in the preset target object vector data area based on the enhanced query vector to obtain an initial candidate object set; and based on structured hard constraints and a preset attribute distance matrix, the initial candidate object set is subjected to flexible weighted scoring and sorting to obtain an audit candidate list. S5. Using a visual auditing agent, based on a composite query strategy, visual question-and-answer prompts are dynamically generated for the target objects in the candidate list to be audited, and a visual-language model is called to perform visual reasoning verification and calculate the confidence of the image-text matching of each candidate object. S6. When the confidence level of image-text matching does not meet the preset consistency conditions, the closed-loop control module is triggered and the audit failure distribution is analyzed. The composite query strategy or elastic weighting parameters are dynamically adjusted, and the iterative re-retrieval is performed until the preset exit conditions are met or the maximum number of iterations is reached.
[0025] When the access gateway service receives a user session request from the client, step S1 is triggered because the user input contains unstructured natural language text. and optional reference images This system incorporates a multimodal preprocessing pipeline, responsible for converting heterogeneous raw data into a unified standard tensor format. Specifically: 1. Text data cleaning and standardization logic The system first processes the input text Regular expressions are used to filter out meaningless special symbols other than standard characters, letters, and numbers. Then, a preset word segmentation component is called to perform lexical segmentation, and a stop word list is loaded to filter out function words that have no actual retrieval meaning. Finally, based on the semantic synonym mapping table in the relational metadata storage area, entity recognition and word list normalization are performed to replace the colloquial words entered by the user with standard category words in order to eliminate ambiguity in the subsequent understanding of the large language model. 2. Image Data Augmentation and Normalization Logic If the user session request contains a reference image The preprocessing pipeline first verifies the file header information and converts non-standard image formats into RGB three-channel format. Next, it uses bilinear interpolation or anti-aliasing scaling algorithms to normalize the image resolution to the fixed size required by the visual coding model. Finally, it performs tensor normalization to map the image pixel values to standard statistical intervals and generate a normalized image tensor input matrix. In this embodiment, the standardized preprocessing process is executed based on a preset set of multimodal data processing rules, which includes text cleaning rules, vocabulary normalization rules, and image standardization parameters. The text cleaning rules include character filtering modes, a stop word dictionary, and word segmentation model configuration parameters. The vocabulary normalization rules are implemented based on a pre-built semantic synonym mapping table, used to uniformly map colloquial expressions to standard semantic tags within the system. The image standardization parameters include model input specifications such as image size, color space, and pixel normalization intervals. The rule set can be configured during the system deployment phase according to specific business domains and stored in the system configuration center or metadata storage area for use by the preprocessing pipeline.
[0026] In step S2, the core carrier is the intent deconstruction agent. Faced with complex and ambiguous natural language requests from users, traditional rule-based or dictionary-based entity extraction methods often fail. This embodiment uses a large language model fine-tuned by vertical domain instructions as the inference unit to achieve independent analysis, specifically including: 1. Model fine-tuning strategy: To balance deployment costs and inference timeliness, this system preferably uses low-rank adaptation (LoRA) technology to fine-tune the base of the open-source large language model. During the training phase, the backbone parameters of the base model are frozen, and trainable low-rank decomposition matrices are injected only in the query and key matrices of the self-attention layer. The training data uses a large number of domain-specific intent instruction datasets, enabling the model to distinguish between "physical attributes" and "atmospheric features". 2. Dynamic instruction template design: During the inference phase, the intent-to-deconstruct agent employs a hybrid instruction template strategy combining "structured thought chains with few-sample prompts." The instruction template guides the model to output standardized structured data. Specifically, the output pattern is rigorously divided into three decoupled dimensions: The first dimension consists of the structured hard constraints that must be satisfied (denoted as...). ), used to extract the physical attributes of objectively existing entities; The second dimension is unstructured soft semantics (denoted as...). ), used to extract natural language word clusters that express style and atmosphere; The third dimension is the explicit rejection constraint (denoted as...). ), used to identify and extract feature sets with exclusive prefixes; In this embodiment, the preset instruction template is a structured prompt template designed for multimodal retrieval scenarios. It is used to constrain the output format of the large language model and improve the stability of intent deconstruction. The instruction template includes at least a task description field, a structured thought chain guidance field, and an output format constraint field. The task description field describes the current retrieval parsing task; the structured thought chain guidance field guides the model to gradually identify physical attributes, stylistic semantics, and negation constraints; and the output format constraint field limits the model to output the deconstruction results in a structured data format. The instruction template can be predefined during the system initialization phase and can be extended and updated based on different business domains using few-sample examples.
[0027] 3. Analysis of self-healing and fault-tolerant degradation mechanisms: To ensure system stability, the intent deconstruction agent is configured with a JSON parser and an automatic anomaly repair mechanism. When the model output format is incomplete, the system automatically calls the regular expression capture group to attempt repair. If the repair still fails, the system triggers a degradation fault tolerance mechanism and falls back to the keyword extraction mode based on the local rule tree to ensure that the retrieval process is not interrupted. In step S3, the vector augmentation agent aims to address the "semantic information asymmetry" problem between short user queries and rich documents containing the target object. This embodiment constructs a highly robust unified augmented query vector by introducing generative feature augmentation and a multimodal cross-attention mechanism. 1. Hypothetical Feature Generation and Text Embedding: For the unstructured soft semantics parsed in step S2, the vector augmentation agent first adopts a hypothetical document embedding technique evolution strategy. Specifically, the agent calls the generation interface of the large language model and inputs a preset feature expansion instruction. This instruction requires the large language model to associate and generate a set of structured phrases and feature labels containing potential core materials, clipping details, and applicable scenarios based on the input soft semantic description, denoted as the extended text sequence. ; Subsequently, the generated extended text sequence After forward propagation processing using a predefined text encoder, dense text embedding vectors are obtained. (in (As a feature dimension), through this pre-expansion method, the system makes up for the information density difference in feature distribution between the user's colloquial short text and the target object's long text, so that the query vector is pushed to the corresponding distribution cluster in the multimodal feature space; 2. Cross-modal cross-attention feature fusion mechanism: When the user session request does not contain an image, the system directly embeds the text into a vector. As an enhanced query vector, when the user simultaneously enters text and a reference image In this case, bitwise concatenation or weighted summation of vectors often fails to handle semantic conflicts between modalities. Therefore, this embodiment designs a text-guided cross-attention feature fusion layer. First, a visual encoder is used to extract block feature sequences from the reference image, dividing the image into... From image patches, a visual feature matrix is extracted. Each local block feature Simultaneously, retrieve the text embedding vector generated in the above steps. ; Subsequently, a cross-attention layer is constructed. This invention uses text vectors as the query source and image feature matrices as the key and value source. Its mathematical projection expression is as follows:
[0028]
[0029]
[0030] Among them, among them, The query matrix represents the projected features of the text intent. The key matrix represents the matching identifiers of local image features; The value matrix represents the actual visual content of local features in an image. For text embedding vectors; The visual feature matrix; , , All are of the following dimensions: A learnable linear projection parameter matrix is used to map textual and visual features to the same cross-attention subspace; Next, the attention weight matrix of the text query on each local patch of the image is calculated. :
[0031] in, The attention weight matrix represents the distribution of the textual intent's attention to various local regions of the image (its matrix dimension is...). ); This is a normalized exponential function used to transform the inner product of correlations into a probability distribution; For query matrix; It is the transpose of the key matrix; The channel dimension is the feature. As a scaling factor, it is used to prevent the gradient vanishing problem caused by excessively large dot product values. For example, when the soft semantic text emphasizes "collar design", the model will adaptively assign a higher weight gradient to the patch block corresponding to the collar area in the image through dot product similarity calculation, while automatically suppressing the weight of color or background areas; Finally, after obtaining the weighted visual features, the system generates the final enhanced query vector through residual connections and layer normalization. :
[0032] in, This is the final enhanced query vector; This is the layer normalization function, used to ensure the stationarity of the output feature space distribution; Provide residual direct connections for the original text embedding vectors as the dominant basis; This is a random deactivation function used to prevent overfitting during model training and inference; The attention weight matrix and the value matrix are matrix products, representing dynamically incorporated, fine-grained weighted visual features highly consistent with the text. This formula ensures that the final enhanced query vector is always based on the user's explicitly expressed textual intent, while dynamically incorporating visual details. Traditional hybrid retrieval architectures typically employ a serial approach of "vector recall + Boolean logic hard filtering." However, due to the sparse or missing attribute annotations often found in target object databases, or the non-rigid attribute constraints in user intent, Boolean hard filtering can easily truncate the recall results to "zero results." To address this technical issue, the retrieval execution agent in this embodiment adopts a novel scoring architecture that combines single-path parallel recall with elastic distance metrics. 1. Vector-driven initial screening and recall mechanism: The retrieval agent first uses the enhanced query vector generated in step S3 above. Input is fed into a high-dimensional vector storage area, and an approximate nearest neighbor search (ANN) is performed using the underlying graph structure index to quickly calculate the cosine distance between all target objects in the database and the enhanced query vector, recalling the top-ranked target object identifiers (IDs) based on similarity, and forming an initial candidate object set. This step focuses on broad semantic and stylistic matching without imposing any hard Boolean rule blocking. 2. Attribute distance matrix and soft matching calculation: For the initial candidate object pool any candidate object in The system performs parallel back lookups of its attribute wide tables in the relational database to extract its actual attribute vectors. Calculate its hard constraints obtained from the intended deconstruction. soft matching degree between ; In this embodiment, the attribute distance matrix is an attribute similarity mapping matrix pre-constructed during the system deployment phase, used to describe the semantic distance relationship between different attribute values. The matrix is constructed by hierarchically dividing attribute categories based on business domain knowledge, assigning smaller distance values to attribute values at the same or adjacent levels, and assigning larger distance values to mutually exclusive or conflicting attributes. For example, in the color attribute, "red" and "burgundy" are defined as adjacent categories, with a smaller distance value; while "red" and "green" are defined as conflicting categories, with a larger distance value. The attribute distance matrix can be initialized through manual rule construction, statistical learning methods, or domain knowledge graphs, and dynamically adjusted based on feedback data during system operation. This embodiment uses a normalized distance function. If constraint value An exact match, where "red" is requested but actually "red" is selected. ;like If a node belongs to the same predefined subcategory, and the requirement is "red" but the actual value is "burgundy," then a slight tolerance is granted in the judgment. ;like For conflicting attributes, such as requiring a "long" style when actually requesting a "short" style, then... Through this continuous distance mapping, the soft matching calculation formula is defined as the weighted average hit rate of the constraint attributes: ; in, Indicates candidate objects Actual properties With structured hard constraints The overall soft match score between them; This represents the total number of attribute constraints contained in the structured hard constraint set; This represents the specific constraint attribute in the set of structured hard constraints. Perform a traversal and summation; Indicates the first The target expected value of each attribute constraint, i.e. the preset constraint condition; Indicates candidate objects In the The actual objective value of each attribute; This represents a normalized distance function based on the attribute distance matrix, used to output the quantitative semantic deviation between the target value and the actual value; 3. Construction of a comprehensive elasticity scoring formula: After obtaining vector similarity and attribute soft matching degree, the system calculates candidate objects. Final overall score Simultaneously, the system extracts the negation constraint from step S2. If the text description or OCR recognition tag of the candidate object contains a negative constraint word, a penalty mechanism is triggered. The formula for elasticity scoring is as follows: ; in, Indicates candidate objects The final overall score; and These are the preset similarity weights and attribute weights, respectively; Represents an enhanced query vector With candidate eigenvectors The function for calculating the cosine similarity between them; This represents the structured hard constraints calculated based on the attribute distance matrix. With the actual attributes of the candidate object The soft matching evaluation function between them; To negate the penalty coefficient, it is usually set to a maximum value that is sufficient to suppress the ranking; This represents the penalty function, which applies when a candidate object... Hit negation constraint The function value is 1 if any word in the formula is selected, and 0 otherwise. This formula implements a flexible filtering logic of "deducting points and downgrading for non-key attribute deviations and severely penalizing negative constraints", which fundamentally ensures the continuity of recall results under long-tail retrieval. 4. Sorting, extracting, and generating lists: The system sorts the candidate pool in descending order based on the final comprehensive score calculated by the above formula. Then, the system directly extracts a set number of candidate objects from the previous order after sorting to generate a candidate list to be audited. The objects in this list are the candidate set with the best comprehensive performance in terms of semantic similarity and attribute matching. The system passes it to the visual audit agent for the next step of image-text consistency verification.
[0033] Most existing multimodal retrieval systems are open-loop architectures, meaning that after finding the target object based on text or image features, they directly present it to the user, lacking the ability to self-check local fine-grained features. This embodiment introduces large-model pre-checking and closed-loop feedback: 1. Dynamic Visual Question Answering (VQA) Task Generation: The visual auditing agent does not use a general image description model. Instead, it dynamically generates a set of targeted visual question-and-answer instructions based on the user's specific intent. The agent parses the unstructured soft semantics and negative constraints in step S2 above to generate audit tasks. For negative constraints, the agent generates a binary classification (Yes / No) problem for "hard interception"; for unstructured soft semantics, the agent generates a "consistency scoring" problem. 2. Visual reasoning and image-text matching confidence calculation: The system uses a visual-language model (VLM) to perform multimodal reasoning on the preceding product images in the candidate list to be audited. For each candidate object... The visual-language model outputs the answers and scores for the aforementioned dynamic question-and-answer commands. Based on this, the system calculates the overall audit pass rate of the target object, i.e., the confidence level of image-text matching. The calculation formula is as follows:
[0034] in, Indicates candidate objects The confidence level of the image-text matching; This represents a negative constraint on all extractions. Iterate through the product of all products; This is an indicator function; its value is 1 when the condition within the parentheses is true, and 0 otherwise. This indicates that the visual-language model determines the image. It does indeed contain negation constraints. Features; This represents the total number of soft semantic features. Representing the visual-language model for images Conforms to soft semantic features The normalized scoring function for the degree output (range 0 to 1). The first term of this formula is a "veto mechanism": once the model determines that the image hits any negative constraint feature, the indicator function... If the product term is 1, then (1-1) = 0, which means the confidence level of the candidate object is zero. The second term is the average conformity score of soft semantics. This mechanism ensures that the final output satisfies both the hard exclusion criteria and maintains stylistic and visual consistency. 3. Adaptive Iterative Mechanism and Closed-Loop Control: If the number of target objects in the candidate list exceeds the set proportion and their image-text matching confidence is lower than the preset consistency threshold, the system determines that "semantic drift" or "soft constraint failure" has occurred in this round of retrieval. At this time, the system intercepts the direct output and triggers the closed-loop control module. The closed-loop control module receives feedback signals of audit failures, analyzes the distribution of failure causes, and executes dynamic parameter adjustment strategies based on the failure type to initiate iterative re-retrieval. (A) Dynamic weight enhancement strategy: If the verification failure is mainly due to "structured hard attribute mismatch", such as retrieving long skirts but recalling short skirts, it indicates that the system is not sensitive enough to this attribute in the semantic space of vector retrieval; the control module will increase the weight of the soft matching item of the attribute in the elastic scoring formula in step S4 ( Increase the value so that the next round of re-retrieval adheres to structured relational attributes; (B) Semantic feedback rewriting strategy: If the verification failure is mainly due to "soft semantic style inconsistency", such as not retro enough, it indicates that the constraint of feature expansion description is insufficient; the control module calls the intent deconstruction agent to reflect and rewrite the soft semantics, and explicitly adds the missing features fed back in the visual audit stage, such as dynamically adding the instruction "strengthen the description weight of retro visual elements" in the original Prompt to generate a new round of enhanced query vectors. (C) Constraint relaxation and degradation strategy: If the system detects that the number of recalls drops sharply to 0 after a set number of iterations, it indicates that there is a mutual exclusion conflict between the underlying logic of the user's constraints, such as requiring "silk material" and "extremely low price" at the same time; when the control module detects that the rate of change of the result is large, it will automatically identify and remove the constraint with the lowest weight, perform "degradation search", and generate an explanatory prompt in the final output, thereby avoiding the dialogue from getting into a dead end; Example 3 Reference Figures 3-5 To more intuitively illustrate the collaboration and data flow processes of the various intelligent agents within the system in real-world complex tasks, this embodiment uses a typical "long-tail complex multimodal query" scenario as an example to describe the system's state transitions: Scenario: A user enters natural language text into a smart e-commerce terminal: "Please find me a red maxi dress suitable for taking photos at the beach, definitely not a halter top," and simultaneously uploads a landscape image of "blue sky, white clouds, and sandy beach" as a reference image for the atmosphere. ; Phase 1: Intent Deconstruction and Initialization After receiving a request, the access gateway triggers preprocessing, intending to deconstruct the agent's call to the fine-tuned large language model and output a composite query strategy: Extract structured hard constraints: including the target object's category (long dress), color (red), and length attribute (long length); Extracting unstructured soft semantics: including natural language descriptions that are suitable for seaside shooting, have a beach atmosphere, and look good on camera; Extract negative constraints: including explicitly rejected styles of suspenders and sleeveless designs; Phase Two: Vector Augmentation and Cross-Modal Fusion Vector augmentation agents generate feature label sets based on soft semantics. "Lightweight chiffon fabric, an oversized skirt that flows gracefully in the sea breeze; the highly saturated red stands out vividly against the blue ocean background." Subsequently, through a cross-modal attention layer, the embedding vector of this text is fused with local visual features of a user-uploaded "beach scenery image" to generate an enhanced query vector. At this point, the search focus is shifted in the multidimensional feature space to the distribution area of "high saturation" and "ethereal feel"; Phase Three: Initial Search and Flexible Sorting The retrieval agent performs graph-based vector recall and combines attribute distance matrices for elastic scoring. Because some products in the underlying database lack detailed "sleeve length" or "style" attributes, and some visually appealing "red halter dresses" conform to the vector distribution of "beach vacation" in terms of visual features, they achieve extremely high overall scores and occupy prominent positions in the initial candidate list. Phase Four: Visual Pre-audit and Interception The visual auditing agent triggers verification before the results are displayed. For negative constraints, the agent dynamically generates a visual question-and-answer instruction: "Does this garment contain a halter or sleeveless design? Yes / No"; the Visual-Language Model (VLM) infers from the preceding images and finds that 3 of the Top-5 items are indeed halter dresses, classifying them as Yes. This triggers a veto mechanism, reducing the image-text matching confidence of these 3 items to zero, far below the preset threshold. The system determines that this round of retrieval has resulted in a "constraint failure," blocks the output, and triggers closed-loop feedback. Phase 5: Closed-loop correction and re-retrieval: The closed-loop control module analyzes the audit failure logs and identifies the core cause as "high-frequency hits of negative constraints." The system executes a dynamic weight enhancement strategy: in the next round of elastic scoring formula, the penalty coefficient γ of the negative constraint is multiplied, for example, increased from a normal value of 100.0 to a higher severity value of 500.0. Simultaneously, a strong isolation penalty of the sleeve type is dynamically applied to the attribute distance matrix. The updated parameter matrix is then used to trigger a second round of retrieval. Phase Six: Final Verification and Output In the second round of candidate list, the previously high-scoring sundresses were effectively eliminated, and several long dresses that met the criteria of "red," "seaside style," and had "puff sleeves" or "French square necklines" rose in the rankings. The visual auditing agent verified them again, and the overall audit pass rate of the candidates improved and met the preset consistency output threshold. The system extracted the final ranking results and output them to the front end, highlighting the message on the interface: "Sundress styles have been filtered for you, and we are now showing you flowing, full-skirted designs suitable for seascape photography."
[0035] This embodiment is not limited to the application forms in the specific vertical fields mentioned above. Its core multi-agent closed-loop retrieval architecture can be widely replaced with equivalent effectiveness and expanded to various scenarios according to actual needs: (1) Equivalent replacement of the underlying base model: The generative model on which the agent relies is intended to be deconstructed is not limited to a specific open source series (such as Llama, Qwen), but can also be equivalently replaced by commercial large models such as GPT-4 and Claude; the multimodal model used by the visual audit agent can also be replaced by BLIP-2, MiniGPT-4 or a lightweight visual classification network according to the computing power conditions, as long as it has the ability to map feature space and follow natural language instructions, it falls within the protection scope of this invention; (2) Expansion of cross-domain application scenarios: Smart Home and Space Design Search: The user inputs "Nordic minimalist living room, no leather sofa". The system can accurately identify material textures through a visual auditing agent, forcefully block and exclude leather furniture, and finally output a high-confidence fabric or solid wood combination scheme.
[0036] Digital Rights Image Library and Compliance Review Search: Users input "Business meeting scenario, no specific ethnic groups or prohibited symbols should appear." The system uses Visual Recognition Modeling (VLM) to conduct fine-grained compliance audits of ethnic characteristics or identifiers, ensuring the commercial compliance of the search results.
[0037] Video stream segment semantic retrieval: Long videos are processed by extracting frames and treated as a collection of images in a time series. By applying the cross-attention enhancement and visual auditing logic of this invention, complex spatiotemporal semantic searches such as "finding a close-up shot of the protagonist running in the rain while wearing red" can be achieved. (3) Cloud-edge collaborative evolution of deployment architecture: In edge computing scenarios where terminal computing power is limited or privacy requirements are extremely high, an "edge-cloud collaborative" architecture can be adopted. Preprocessing and intent deconstruction are performed locally on the user terminal device to protect privacy, while the vector retrieval and heavy visual auditing modules, which consume a lot of computing power, are deployed on the cloud cluster. The two communicate and interact through encrypted tensor sequences.
[0038] Furthermore, the present invention also provides an electronic device for implementing the above-described mixed-modality retrieval method; please refer to [link to relevant documentation]. Figure 6 This is a schematic diagram of the hardware structure of the electronic device; such as Figure 6 As shown, the electronic device includes at least: a processor, a memory, a communication interface, and multimedia input / output components interconnected via a system bus. The processor can be a central processing unit, a graphics processing unit, a tensor processor, or a combination thereof, used to control the overall operation of the electronic device and execute computer programs stored in the memory to implement all steps and logical calculations in the aforementioned embodiments one to three, such as multi-agent collaboration, vector augmentation, elastic scoring, and visual auditing closed-loop. The memory may include high-speed random access memory and may also include non-volatile memory, such as at least one disk storage device, flash memory, etc. The memory stores application code or instruction sets implementing the above methods. The communication interface is used to receive multimodal query requests sent by clients via a network (such as the Internet or a local area network) and return the search results.
[0039] Furthermore, the present invention also provides a computer-readable storage medium on which a computer program is stored. When the computer program is configured to be executed by a processor, it is capable of implementing all the steps of the method provided in any of the above embodiments of the present invention. The storage medium may include various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory, a random access memory, a magnetic disk, or an optical disk. It is understood that the systems, devices, and storage media provided in the embodiments of the present invention correspond to the methods provided in the embodiments of the present invention, and explanations, examples, and beneficial effects of related content can be referred to the corresponding parts of the above methods.
[0040] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of this application is generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)).
[0041] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes the element.
[0042] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.
[0043] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A conversational mixed modality retrieval method, characterized in that, Includes the following steps: S1. Receive a user session request containing unstructured natural language text and optional image information, and perform standardized preprocessing on the user session request; S2. Using intent to deconstruct the intelligent agent, the user session request is decoupled in multiple dimensions based on a preset instruction template to generate a composite query strategy that includes structured hard constraints, unstructured soft semantics, and negative constraints. S3. Using a vector-enhanced agent, a virtual feature augmentation description is generated based on the unstructured soft semantics, and the feature augmentation description is converted into a text embedding vector. When the user session request contains optional image information, the text embedding vector and the visual feature block sequence of the optional image information are fused and encoded in a multimodal cross-attention space to generate an enhanced query vector. S4. Using a retrieval execution agent, perform similarity recall in a preset target object vector data area based on the enhanced query vector to obtain an initial candidate object set; and perform elastic weighted scoring and sorting on the initial candidate object set based on the structured hard constraints and the preset attribute distance matrix to obtain an audit candidate list. S5. Using a visual auditing agent, based on the composite query strategy, dynamically generate visual question-and-answer prompts for the target objects in the candidate list to be audited, and call the visual-language model to perform visual reasoning verification, and calculate the image-text matching confidence of each candidate object; S6. When the confidence level of the image-text matching does not meet the preset consistency conditions, the closed-loop control module is triggered and the audit failure distribution is analyzed. The composite query strategy or elastic weighting parameters are dynamically adjusted, and the iterative re-retrieval is returned until the preset exit conditions are met or the maximum number of iterations is reached. 2.The dialogic mixed modality retrieval method of claim 1, wherein: The step of generating the composite query strategy in step S2 includes: The standardized preprocessed user session request is input into the fine-tuned large language model; The large language model is guided by a pre-set structured thought chain instruction template, and the objective attributes that must be satisfied are extracted as the structured hard constraints. Natural language word clusters describing style, scene, and atmosphere are extracted as the unstructured soft semantics; Extract the feature set with the exclusive negation prefix as the negation constraint. 3.The dialogic mixed modality retrieval method of claim 1, wherein: The step of generating the enhanced query vector in step S3 includes: Generative language models are used to diverge the features of the unstructured soft semantics to generate extended text containing core attributes and applicable scenario features, and the text embedding vector is extracted by a text encoder. When the user session request does not contain image information, the text embedding vector is directly used as the enhanced query vector; When the user session request contains image information, the local block feature sequence of the image information is extracted by a visual encoder; the text embedding vector is used as the query matrix, and the local block feature sequence is used as the key matrix and value matrix. The attention weight of the text intent to the local region of the image is calculated by a cross-attention mechanism, and the enhanced query vector is generated by residual connection and normalization. 4.The dialogic mixed modality retrieval method of claim 3, wherein: The step of performing the flexible weighted scoring and sorting in step S4 includes: For any candidate object in the initial candidate object set, calculate the cosine similarity between its corresponding target feature representation and the enhanced query vector to obtain a vector similarity score; The attribute metadata of the candidate object is retrieved, and the normalized distance between the attribute metadata and the structured hard constraint is calculated based on a preset attribute distance matrix, and then converted into an attribute soft matching score; wherein, the attribute distance matrix defines the deviation tolerance between different attribute categories; Determine whether the text or tag of the candidate object contains the words in the negation constraint. If it does, generate a negation constraint penalty term. The vector similarity score and the attribute soft matching score are weighted and summed according to preset weights, and the negative constraint penalty term is subtracted to obtain the comprehensive score of the candidate object. The candidate object is then sorted in descending order based on the comprehensive score.
5. The conversational mixed modality retrieval method of claim 4, wherein: The step of performing visual reasoning verification in step S5 includes: The composite query strategy is analyzed, and a hard blocking problem is generated based on the negation constraint to determine whether it contains exclusive features; The consistency score for evaluating style and detail matching is generated based on the aforementioned unstructured soft semantics. The images of the candidate objects are combined with the hard blocking problem and the consistency scoring problem, respectively, and input into the visual-language model to obtain the reasoning and judgment results; The reasoning and judgment results are weighted to generate the image-text matching confidence score. When the judgment result of the hard blocking problem is that it contains negative features, the image-text matching confidence score is directly set to zero.
6. The conversational mixed modality retrieval method of claim 5, wherein: The step of dynamically adjusting the composite query strategy or elastic weighting parameters in step S6 includes: Analyze the reasons why candidate objects failed the current round of validation. If the proportion of attribute mismatch exceeds the first set threshold, the calculation weight of the attribute soft matching score in the elastic weighted scoring will be increased to ensure that the next round of retrieval follows the structured attributes. If the proportion of style mismatch exceeds the second set threshold, the large language model is invoked to rewrite the unstructured soft semantics, and the missing features fed back in the visual audit are added to the rewritten soft semantics. If the number of retrieved items in consecutive rounds is lower than the preset minimum threshold, the system will automatically identify and remove the constraint with the lowest weight and perform a downgraded search.
7. A conversational hybrid modal retrieval system, characterized in that: include: The multimodal request receiving and preprocessing module is used to receive user session requests containing unstructured natural language text and optional image information and perform standardized preprocessing. The intent deconstruction agent module is used to decouple the user session request in multiple dimensions based on a preset instruction template, and generate a composite query strategy that includes structured hard constraints, unstructured soft semantics and negative constraints. The vector-enhanced agent module is used to generate virtual feature augmentation descriptions based on the unstructured soft semantics and convert them into text embedding vectors; and when image information is present, the text embedding vectors and the visual feature block sequence of the image information are fused in the multimodal cross-attention space to generate an enhanced query vector. The retrieval execution agent module is used to perform similarity recall based on the enhanced query vector, and to perform elastic weighted scoring and ranking in combination with the structured hard constraints and attribute distance matrix to obtain a candidate list to be audited; The visual auditing intelligent agent module is used to dynamically generate visual question-and-answer prompts, call the visual-language model to perform inference verification on the candidate list to be audited, and calculate the confidence score of image-text matching; An adaptive closed-loop control module is used to analyze the failure distribution when the confidence level does not meet the conditions, dynamically adjust the query strategy or scoring weight, and trigger the retrieval execution agent module to perform iterative re-retrieval.
8. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it causes the processor to perform the steps of the method as described in any one of claims 1 to 6.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the computer program is executed by the processor, the processor performs the steps of the method as described in any one of claims 1 to 6.