A retrieval enhancement generation method and system based on dynamic information extraction of user queries
By dynamically generating structured information extraction templates and performing multimodal large language model analysis, the problem that the RAG system cannot handle visual information and understand complex relationships is solved, thus improving the logic and accuracy of the answers.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INSPUR TIANYUAN COMM INFORMATION SYST CO LTD
- Filing Date
- 2026-03-05
- Publication Date
- 2026-06-19
AI Technical Summary
Existing search augmentation generation (RAG) technology cannot effectively process visual information such as charts and graphs, and cannot understand complex relationships between entities, resulting in poor logic and low accuracy in the generated answers.
A user query-based dynamic information extraction method is adopted. Through preliminary retrieval, dynamic template generation, template filling and optimization, and response generation steps, a multimodal large language model is used to analyze candidate information blocks, perform logical consistency verification and conflict resolution, and generate structured information extraction templates.
It significantly improves retrieval accuracy and the quality of generated answers, achieves deep understanding of multimodal document content, mines logical relationships between entities, and improves the relevance and accuracy of answers.
Smart Images

Figure CN122240849A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of artificial intelligence and information retrieval technology, specifically a retrieval enhancement generation method and system based on dynamic information extraction from user queries. Background Technology
[0002] Retrieval-enhanced generation (RAG) is a key technique in current large language model applications. Its core idea is to retrieve relevant information from external knowledge bases and provide it as context to the large language model, thereby reducing factual errors (illusions) and enabling it to utilize up-to-date knowledge for responses.
[0003] Existing RAG technology typically employs the following process: the source document is cut into text chunks, an encoder model is used to convert the text chunks into vectors and store them in a vector database; when a user asks a question, the question is also converted into a vector, a similarity search is performed in the database, the top-K most relevant text chunks are retrieved as context, and finally, the question and the text chunks are input into a large language model to generate the answer.
[0004] However, existing technologies have at least the following drawbacks: 1. Information Flattening Processing: Traditional RAG treats documents as plain text streams, ignoring the inherent structural information within the document. Especially for complex documents containing numerous charts, flowcharts, organizational charts, and data tables (such as technical reports and financial research reports), directly discarding these visual elements or simply performing optical character recognition (OCR) will lead to the loss of key structured information and logical relationships.
[0005] 2. Context fragmentation: Text segmentation based on fixed length or simple rules can easily cut off semantically complete paragraphs, resulting in incomplete contextual information retrieved and affecting the quality of the final generated answer.
[0006] 3. Weak Relationship Understanding: Vector similarity retrieval primarily focuses on semantic proximity but struggles to accurately capture explicit and complex logical relationships between entities (e.g., "A causes B," "C is a component of D"). When user questions involve deep logical reasoning, the model struggles to construct a logically rigorous answer based solely on fragmented text fragments retrieved. Existing RAG systems typically only perform a simple 'retrieval-generation' (System 1 quick thinking) model, lacking in-depth causal inference and logical self-consistency verification of the retrieved content. When documents contain implicit causal chains or contradictory data, the model is prone to illusions and cannot perform multi-step reflection and self-correction like human experts.
[0007] Therefore, how to enable RAG systems to deeply understand the textual and visual information in multimodal documents, just like human experts, and to utilize their inherent structure and logical relationships, is a technical problem that urgently needs to be solved in the current technology field. Summary of the Invention
[0008] The technical objective of this invention is to address the above-mentioned shortcomings by providing a retrieval enhancement generation method and system that extracts information dynamically based on user queries. This method and system can solve the problems that RAG systems cannot effectively process visual information such as charts and cannot understand complex relationships between entities, resulting in poor logic and low accuracy in generated answers. This significantly improves retrieval accuracy and the quality of generated answers.
[0009] The technical solution adopted by this invention to solve its technical problem is: A retrieval enhancement generation method based on dynamic information extraction from user queries, the implementation of which includes: (1) Preliminary retrieval steps: When a user query is received, one or more candidate information blocks related to the user query are retrieved and recalled from a knowledge base containing multimodal information blocks; (2) Dynamic template generation step: Based on the user's query intent, dynamically generate a structured information extraction template that defines the structure of the information to be extracted; (3) Template filling and optimization steps: Using a multimodal large language model, the candidate information blocks are analyzed, and the conflict resolution of multi-source data is performed based on the model confidence score or document timestamp. Logical consistency verification is performed, and the confirmed information is filled into the structured information extraction template to form a filled template. (4) Response generation step: Based on the structured information in the filled template, generate the final response to the user query.
[0010] This method provides an enhanced retrieval and generation method that can deeply understand the content of multimodal documents, mine and utilize the logical relationships between entities, thereby significantly improving retrieval accuracy and the quality of generated answers. This invention aims to solve the problems of RAG systems' inability to effectively process visual information such as charts and graphs, and their inability to understand complex relationships between entities, resulting in poor logical consistency and low accuracy in generated answers. This method addresses the problem of information extraction being disconnected from user intent in existing technologies, achieving on-demand and precise deep information mining, and significantly improving the relevance and accuracy of answers.
[0011] Furthermore, prior to the initial retrieval step, a preprocessing step is included, which parses and segments the original multimodal document and creates vector embeddings for each information block to build a vector index library for retrieval.
[0012] Furthermore, in the dynamic template generation step, the user query is analyzed using a large language model, and the structured information extraction template is generated in a structured data format (such as JSON, XML, YAML, Knowledge Graph, etc., which are machine-readable formats). The data structure of the structured information extraction template not only includes entity attribute fields, but also includes causal logic chain fields between entities, which are used to characterize the cause, process, and result of the event.
[0013] Furthermore, the candidate information block is a multimodal information block, containing text elements and / or visual elements.
[0014] Furthermore, in the template filling step, the multimodal large language model uses a multimodal thinking chain mechanism to analyze the visual elements in the candidate information block in order to extract the information required to fill the template.
[0015] Furthermore, the template filling step also includes an integration step, which merges the multiple filled templates obtained after filling multiple candidate information blocks into a final filled template with more complete information; the integration step includes a conflict resolution mechanism: When there is inconsistency in the information extracted from different candidate information blocks for the same template field, the system performs weighted arbitration based on the preset confidence score and / or the timestamp of the candidate information block to determine the value to be filled into the final filled template. The confidence score is output synchronously by the multimodal large language model when extracting information, and the timestamp is determined by the metadata of the candidate information block.
[0016] Furthermore, after the template filling step, a consistency verification step is also included: using a large language model to perform a reverse logical comparison between the structured information in the filled template and the original content of the candidate information block; if the comparison result shows a logical contradiction or factual illusion, the model is triggered to generate a self-correction instruction to correct the structured information or mark it as a low-confidence state.
[0017] Furthermore, the template filling step also includes an iterative supplementary retrieval mechanism: if it is found that the key field in the structured information extraction template fails to obtain valid information from the current candidate information block (i.e. the field is empty), a new retrieval query vector is automatically generated based on the missing field, the supplementary retrieval step is executed, and the template filling is performed again on the supplemented and recalled information block.
[0018] Furthermore, in the response generation step, the filled template is serialized into text and input into a large language model along with the user's original query as context to generate the final response.
[0019] This invention also claims a retrieval enhancement generation system that dynamically extracts information based on user queries, comprising: A preliminary search module is provided for performing the preliminary search steps. A dynamic template generation module is used to perform the dynamic template generation step; A template filling and optimization module is configured with a multimodal large language model to perform the template filling step and to perform conflict resolution and logical consistency verification. A response generation module is provided for performing the response generation step. This system can implement the methods described above.
[0020] Compared with the prior art, the retrieval enhancement generation method and system of the present invention, which extracts information dynamically based on user queries, has the following advantages: This invention improves the efficiency of cross-modal data processing and enables efficient and unified management of multimodal data, offering the following advantages: Extremely high relevance: The information extraction process is entirely driven by user queries, and the generated structured context is "tailor-made" to answer the current question, eliminating all irrelevant information and fundamentally solving the problem of context redundancy.
[0021] High flexibility and adaptability: The system does not require predefined fixed knowledge graph patterns. Regardless of the angle or structure of the questions raised by the user, the system can dynamically generate corresponding extraction templates, demonstrating extremely high adaptability.
[0022] Deep visual information understanding: It retains the core advantages of the multimodal thinking chain mechanism, and can deeply analyze the complex data and logic contained in visual elements such as charts and flowcharts, and directly map them to the structure required by the problem.
[0023] On-demand allocation of computing resources: The process of deep analysis (multimodal thinking chain mechanism), which consumes the most computing resources, is shifted from the preprocessing of the entire knowledge base to the "real-time processing" of only a few relevant information blocks. This may make more efficient use of computing resources while ensuring the effectiveness.
[0024] Intelligent conflict resolution and timeliness management: To address potential contradictions or outdated information in multi-source data, this invention introduces an arbitration mechanism based on model confidence and document timestamps. This ensures that when faced with situations such as inconsistent predictions of the same indicator from different research reports or conflicts between old and new data, the system can automatically filter out noise and prioritize more credible and up-to-date information, significantly improving the factual accuracy of the final answer.
[0025] Possessing self-reflection and logical error correction capabilities: This invention innovatively introduces consistency verification and iterative retrieval mechanisms. The system not only passively receives information but also engages in "slow thinking" (System2) like a human expert, proactively reflecting on the logical consistency of extracted information and autonomously initiating supplementary retrieval when information is missing. This allows the RAG system to evolve from a simple "information carrier" into an intelligent analytical agent with causal inference capabilities, greatly reducing model illusions. Attached Figure Description
[0026] Figure 1 This is a flowchart illustrating the overall process of the retrieval enhancement generation method based on dynamic information extraction from user queries provided in this embodiment of the invention. Figure 2 This is a flowchart illustrating the implementation of a retrieval enhancement generation method based on dynamic information extraction from user queries, as described in Example 1 of this invention. Detailed Implementation
[0027] The present invention will be further described in conjunction with the accompanying drawings and specific embodiments: A retrieval enhancement generation method based on dynamic information extraction from user queries, such as... Figure 1 As shown, the implementation of this method includes: (1) Query receiving steps: Receive the query request input by the user; (2) Preliminary retrieval steps: When a user query is received, one or more candidate information blocks related to the user query are retrieved and recalled from a knowledge base containing multimodal information blocks; (3) Dynamic template generation step: Based on the user's query intent, dynamically generate a structured information extraction template that defines the structure of the information to be extracted; (4) Template filling and optimization steps: Using a multimodal large language model, the candidate information blocks are analyzed, and the conflict resolution of multi-source data is performed based on the model confidence score or document timestamp. Logical consistency verification is performed, and the confirmed information is filled into the structured information extraction template to form a filled template. (5) Response generation step: Based on the structured information in the filled template, generate the final response to the user query.
[0028] Prior to the initial retrieval step, a preprocessing step is included, which parses and segments the original multimodal document and creates vector embeddings for each information block to build a vector index library for retrieval.
[0029] In the dynamic template generation step, the user query is analyzed using a large language model, and the structured information extraction template is generated in a structured data format (such as JSON, XML, YAML, Knowledge Graph, etc., which are machine-readable formats). The data structure of the structured information extraction template not only includes entity attribute fields, but also includes causal logic chain fields between entities, which are used to characterize the cause, process, and result of the event.
[0030] The candidate information block is a multimodal information block, which includes text elements and / or visual elements.
[0031] In the template filling step, the multimodal large language model uses a multimodal thinking chain mechanism to analyze the visual elements in the candidate information block in order to extract the information required to fill the template.
[0032] The template filling step also includes an integration step and a conflict resolution mechanism. The integration step is used to merge multiple filled templates obtained after filling multiple candidate information blocks into a final filled template with more complete information.
[0033] Conflict resolution mechanism: When there is inconsistency in the information extracted from different candidate information blocks for the same template field, the system performs weighted arbitration based on the preset confidence score and / or the timestamp of the candidate information block to determine the value to be filled into the final filled template. The confidence score is output synchronously by the multimodal large language model when extracting information, and the timestamp is determined by the metadata of the candidate information block.
[0034] Following the template filling step, a consistency verification step is also included: using a large language model to perform a reverse logical comparison between the structured information in the filled template and the original content of the candidate information block; if the comparison result shows a logical contradiction or factual illusion, the model is triggered to generate a self-correction instruction to correct the structured information or mark it as a low-confidence state.
[0035] The template filling step also includes an iterative supplementary retrieval mechanism: if it is found that the key fields in the structured information extraction template fail to obtain valid information from the current candidate information block (i.e., the fields are empty), a new retrieval query vector is automatically generated based on the missing fields, the supplementary retrieval step is executed, and the template filling is performed again on the supplemented and recalled information blocks.
[0036] In the response generation step, the filled template is serialized into text and used together with the user's original query as context, and then input into a large language model to generate the final response.
[0037] The specific implementation of this method includes the following steps: Step 1: Preprocessing and Basic Index Building. This step prepares for subsequent rapid retrieval.
[0038] 1.1 Multimodal Document Parsing and Chunking: Receives the original document and parses it into multimodal information blocks containing elements such as text, images, and tables.
[0039] 1.2 Basic Vector Index Construction: Vectorize the text content of each information block and the preliminary textual description of visual elements to construct a basic vector index library. The main function of this index is to perform rapid, semantically relevant preliminary retrieval.
[0040] Step 2: Preliminary Search and Candidate Set Filtering. This step is performed when the system receives a user query.
[0041] 2.1 User query vectorization: Convert user query requests into vectors.
[0042] 2.2 Candidate Information Block Recall: In the basic vector index library constructed in step one, a similarity search is performed to recall the Top-K multimodal information blocks that are most relevant to the user's query semantics, forming a candidate set. These information blocks are the raw materials for the next step of in-depth analysis.
[0043] Step 3: Query-oriented dynamic information extraction. This step is the core of this method. It replaces the traditional static knowledge graph query and enables on-demand deep information understanding.
[0044] To enable those skilled in the art to more clearly understand how this invention handles structured data interaction and multimodal logical reasoning, the following embodiments will use the currently industry-standard JSON format as a specific example of a structured information extraction template, and will elaborate on the Visual Chain-of-Thought (VCoT) as the core reasoning strategy of a multimodal large language model. It should be noted that JSON and VCoT are merely preferred implementations of this invention and should not be construed as limiting the scope of protection of this invention. Other structured formats (such as XML and YAML) or reasoning mechanisms with equivalent functionality can also be applied to this invention.
[0045] 3.1 Dynamic Information Extraction Template Generation (Form Creation): Receive contextual text information from user queries and initial recall.
[0046] Use a large language model (or rule engine) to analyze query intent to determine the information structure needed to answer the question.
[0047] A structured information extraction template is dynamically generated, defining the structure of the information to be extracted. In this embodiment, the template is instantiated as a JSON object containing predefined keys but empty values, defining the key entities, attributes, and relationships to be extracted from the candidate information blocks.
[0048] Example: If a user's query is "Compare the sales revenue and market share of product A and product B in the fourth quarter", the system-generated template might be: { "comparison_summary": [ { "product_name": "Product A", "period": "fourth quarter", "sales_revenue": null, "market_share": null }, { "product_name": "Product B", "period": "fourth quarter", "sales_revenue": null, "market_share": null } ] }
[0049] 3.2 VCoT-based template filling: Iterate through each candidate multimodal information block recalled in step two.
[0050] Submit the information block (containing text and images) along with the "empty form" generated in the previous step to a powerful multimodal large language model.
[0051] Guide the model to perform a Visual CoT (Visual Thinking Chain) task using specific instructions: "You are a data analyst. Please carefully analyze the following text and charts, and strictly follow the provided JSON template format to extract the corresponding information to populate the template. Please think step by step about how you find the data from the charts." The model performs in-depth analysis on each information block and fills the extracted information into the corresponding fields of the template.
[0052] 3.3 Structured Result Integration and Conflict Resolution: The system aggregates the preliminary population results extracted from all candidate information blocks (e.g., candidate blocks 1, 2, and 3). At this point, for the same field (Key) in the template, there may be multiple candidate values from different sources. The system executes the following conflict resolution process: 3.3.1 Numerical Alignment and Grouping. Group all extracted data by field. For example, for the field growth_rate, candidate block 1 (from the chart) extracts the value "30%", and candidate block 3 (from the text summary) extracts the value "28%".
[0053] 3.3.2 Attribute Weighted Evaluation. The system examines the metadata for each value: Timestamp Weight: Candidate block 1 originates from the "2023 Annual Report Final Draft" (more recent date), and candidate block 3 originates from the "2023 Q3 Quarterly Forecast" (older date). The system determines that the timestamp of the annual report has higher priority.
[0054] Model Confidence Score: In step 3.2, the model output a confidence score of 0.95 when extracting candidate block 1 (due to the explicitness of the chart data) and 0.75 when extracting candidate block 3 (due to the ambiguity of the text description). Specifically, the self-reflection capability of the large language model can be utilized. When the model detects data conflicts, it outputs a short chain of reasoning (CoT) to assist in the judgment, for example: 'Although the text mentions 28%, the chart clearly states 30%, and the chart usually represents the final settlement data, therefore we tend to favor 30%'. The system can parse this chain of reasoning to assist in arbitration.
[0055] 3.3.3 Weighted Arbitration. System comprehensive calculation: Final score = α * time weight + β * confidence level. In this embodiment, the system determines that "30%" of candidate block 1 is a valid value and removes "28%" of candidate block 3.
[0056] 3.3.4. Synthesize the final template. Fill the final JSON template with the optimal value after arbitration to form a logically consistent and fully populated information template.
[0057] 3.4 System 2-based consistency verification: Before generating the final template, the system initiates a "self-reflection" mode. The already populated JSON data and the original text fragments are input into the model again to perform reverse validation.
[0058] Enter the command: "Please check if the growth_rate:30% in the JSON matches the original text 'Sales revenue increased by 30%'? Is there a logical contradiction?" Model feedback: "Upon verification, the original text clearly states the same information, and the chart trend confirms this growth. There is no risk of illusion. Confidence level is marked as High." If a contradiction is found, such as the original text stating a decrease but the field being filled with an increase, the system will automatically mark the field as requiring manual review or trigger a re-extraction.
[0059] 3.5 Iterative Supplementary Retrieval: The system detected that the market_share (market share) field in the integrated template was still null, which is a critical missing piece of information.
[0060] Action: The system automatically generates a new query vector based on the missing field: "2023 Smart Terminal Market Share Data".
[0061] Execution: Perform a second round of targeted retrieval in the vector database, or conduct an online search (if the system supports it).
[0062] Result: A copy of the "2023 Industry Analysis Brief" was recalled, from which "market share is approximately 15%" was extracted and filled into the template.
[0063] Step 4: Enhanced generation of structured context.
[0064] 4.1 Constructing the final context: Combine the integrated and filled information template (which can be serialized into text) from step 3.3 with the user's original question as the final context.
[0065] 4.2 Instruction Generation: Submit the context to the large language model and issue the instruction: "Please answer the user's question clearly and logically based on the following structured information." Since the context is already highly structured and perfectly relevant to the question, the model can easily generate high-quality, highly accurate answers.
[0066] To clearly and completely describe the purpose, technical solution, and advantages of this method, specific application examples are used to further illustrate the method.
[0067] Example 1: This embodiment provides a specific implementation of a retrieval enhancement generation method that dynamically extracts information based on user queries. The implementation uses processing a company's annual financial report in PDF format containing charts and text, and answering complex questions about the performance of specific products, as an example.
[0068] The system environment and components used in this embodiment include: Document parsing: Using the open-source libraries PyMuPDF and unstructured.
[0069] Text embedding and vector storage: Chinese text embedding is performed using the bge-large-zh model from the sentence-transformers library, and ChromaDB is used as the local vector database.
[0070] Multimodal Large Language Model: The Qwen-VL-Max model from Alibaba Cloud's Tongyi Qianwen series is used. This model has powerful visual understanding, reasoning, and Chinese processing capabilities.
[0071] Workflow orchestration: Use LangChain or a custom Python script for workflow control.
[0072] like Figure 2 As shown, the specific implementation steps are as follows: Step 1: Preprocessing and basic index building.
[0073] 1.1 Receiving Documents: The system receives a document titled "2023 Annual Report of a Technology Company.pdf". This document contains a textual description of the company's operations throughout the year, financial data tables, and bar charts showing the sales revenue of each product line.
[0074] 1.2 Document Parsing and Chunking: Use the PyMuPDF library to extract text blocks and image objects from a PDF page by page.
[0075] The unstructured library is used to structure the extracted content, recognizing it as elements such as "headings", "paragraphs", "images", and "tables".
[0076] Key Processing: To maintain contextual integrity, the system merges the identified image (e.g., a sales bar chart) with its immediately preceding and following paragraph text (usually the chart's title and explanatory text) to form a "multimodal information block." For example, a block might contain: [Text: "Figure 3-1: Comparison of Annual Sales Revenue by Product Line"] + [Image: Bar Chart Image Object] + [Text: "As can be seen from the chart, the 'Smart Terminal' product line performed exceptionally well, with sales revenue increasing by 30% year-on-year..."]. Plain text portions are then organized into "text information blocks" based on natural paragraphs.
[0077] 1.3 Construction of basic vector indexes: Iterate through all generated information blocks.
[0078] For the "text information block", the bge-large-zh model is used directly to encode its text content into a 768-dimensional vector.
[0079] For the "multimodal information block", the system extracts all the text content (as shown in the title and description), concatenates them, and encodes them into a vector using the bge-large-zh model.
[0080] All generated vectors and their corresponding original information block contents (including references / paths to text and image objects) are stored in the ChromaDB vector database to complete the construction of the basic index.
[0081] Step 2: Preliminary search and candidate set screening.
[0082] 2.1 Receiving User Queries: The system receives a user query: "Please summarize the sales revenue of the 'Smart Terminals' and 'Cloud Services' product lines in 2023, and explain which product line had a higher growth rate?" 2.2 Candidate Information Block Recall: The system uses the bge-large-zh model to encode user queries into query vectors.
[0083] Perform a vector similarity search in the ChromaDB database to retrieve the top-3 most relevant information blocks.
[0084] The consequences of a recall may include: Candidate Block 1 (Multimodal): An information block containing a sales bar chart and its preceding and following descriptive text.
[0085] Candidate Block 2 (Text): The paragraph in the financial report that details the development of the "Smart Terminals" business.
[0086] Candidate Block 3 (text): The paragraph in the financial report regarding the "cloud services" business strategy and annual review.
[0087] Step 3: Extracting dynamic information guided by the query.
[0088] 3.1 Dynamic Information Extraction Template Generation (Form Creation): The system uses a large language model to analyze the user's query intent and dynamically generates an empty template with the following JSON structure, which includes a causal_reasoning field to support causal inference.
[0089] { "product_performance": [ { "product_line": "Smart Terminal", "year": 2023, "sales_revenue": null, / / To be filled: Sales revenue "growth_rate": null / / To be filled: growth rate }, { "product_line": "Cloud Services", "year": 2023, "sales_revenue": null, / / To be filled: Sales revenue "growth_rate": null / / To be filled: growth rate } ], "comparison_conclusion": { "higher_growth_product": null, / / To be filled: Product lines with higher growth rates "causal_reasoning": null / / To be filled: the causal logic chain used to draw the conclusion } }
[0090] 3.2 VCoT-based template filling: The system iterates through the three candidate information blocks recalled in step two. Taking candidate block 1 (multimodal) as an example.
[0091] The system constructs an input containing an image and text, and sends the following request (Prompt) to the Qwen-VL-Max model: [system] You are a meticulous financial analyst. Your task is to populate the required data based on the text and image information provided below, strictly adhering to the given JSON template format. If the information does not exist, please leave the field value null.
[0092] [user] # JSON template { "product_performance": [ { "product_line": "Smart Terminal", "year": 2023, "sales_revenue": null, / / To be filled: Sales revenue "growth_rate": null / / To be filled: growth rate }, { "product_line": "Cloud Services", "year": 2023, "sales_revenue": null, / / To be filled: Sales revenue "growth_rate": null / / To be filled: growth rate } ], "comparison_conclusion": { "higher_growth_product": null, / / To be filled: Product lines with higher growth rates "causal_reasoning": null / / To be filled: the causal logic chain used to draw the conclusion } } # Information to be analyzed ## Text Information Figure 3-1: Comparison of Annual Sales Revenue by Product Line... As can be seen from the figure, the 'Smart Terminal' product line performed exceptionally well, with sales revenue increasing by 30% year-on-year... ## Image Information [Insert bar chart image data from candidate block 1 here] # Your task Please analyze the images and text step by step, and then populate the JSON template.
[0093] **The Chain of Thought:** 1. Look for information on the "smart terminal" product line.
[0094] 2. Locate the bar chart corresponding to "Smart Terminals" in the image; the top value displays "5 billion". Therefore, sales_revenue is 5 billion.
[0095] 3. The text explicitly mentions that the "sales revenue of the smart terminal" product line increased by 30% year-on-year. Therefore, growth_rate is 30%.
[0096] 4. Look for information on the "cloud services" product line.
[0097] 5. Locate the bar chart corresponding to "Cloud Services" in the image; the top value displays "3.5 billion". Therefore, sales_revenue is 3.5 billion.
[0098] 6. The text and images do not directly mention the growth rate of "cloud services". Therefore, growth_rate remains null.
[0099] 7. Comparing growth rates, the current known growth rate for "smart terminals" is 30%, indicating better information.
[0100] **Final Output:** [The model is required to output only the padded JSON here.] After receiving the request, the Qwen-VL-Max model performs visual analysis and text understanding, and outputs the padded JSON: { "product_performance": [ { "product_line": "Smart Terminal", "year": 2023, "sales_revenue": "5 billion yuan", "growth_rate": "30%" }, { "product_line": "Cloud Services", "year": 2023, "sales_revenue": "3.5 billion yuan", "growth_rate": null } ], "comparison_conclusion": { "higher_growth_product": "intelligent terminal", "causal_reasoning": null } } The system repeats this process for other candidate blocks (such as candidate blocks 2 and 3), which may populate or confirm other information such as growth_rate.
[0101] 3.3 Structured Result Integration and Conflict Resolution: The system aggregates the extraction results of all candidate blocks, and for conflicts in the same field (e.g., growth_rate), it executes a conflict resolution process: Numerical grouping: Candidate block 1 (annual report charts) has an extracted value of "30%", and candidate block 3 (Q3 forecast) has an extracted value of "28%".
[0102] Weighted assessment: Timestamps: Candidate block 1 is the "final draft of the annual report" (latest), and candidate block 3 is the "Q3 forecast" (older).
[0103] Confidence: The model's confidence in extracting chart data (0.95) is higher than that of fuzzy text (0.75).
[0104] Arbitration: The system determined that "30%" was a valid value, removed "28%", and formed a preliminary integration template.
[0105] 3.4 System 2-based consistency verification: The system initiates a "self-reflection" mode to perform reverse verification on the integrated data.
[0106] Enter the command: "Please check if the growth_rate: 30% in the JSON contradicts the original logic?" Model feedback: "Upon verification, the original text clearly states the same information, and the chart trend confirms the growth. There is no risk of illusion."
[0107] 3.5 Iterative Retrieval: The system detected that the causal_reasoning field in the integrated template was still empty, which is a key missing information and cannot answer the question "why".
[0108] Action: The system automatically generates a new query vector based on the missing field: "The reason why the growth rate of smart terminals was higher than that of cloud services in 2023".
[0109] Execution: Perform a second round of targeted retrieval in the vector database.
[0110] Result: A copy of the "President's Address" was recalled, which stated that "the growth of smart terminals is mainly due to the explosive sales of the newly released X series mobile phones."
[0111] Additional population: Fill the causal logic into the causal_reasoning field.
[0112] Step 4: Enhanced generation of structured context.
[0113] 4.1 Construct the final context and generate the answer: The system serializes the final, integrated, and populated JSON template into a clear text, combines it with the user's original question, and sends a final request to the Qwen-VL-Max model (or another model specifically designed for generation): [user] # User Issues Please summarize the sales revenue of the 'Smart Terminals' and 'Cloud Services' product lines in 2023, and explain which product line had a higher growth rate. # Known facts (derived from document analysis) - Product Line: Smart Terminals - 2023 sales revenue: 5 billion yuan - 2023 growth rate: 30% - Product Line: Cloud Services - 2023 sales revenue: 3.5 billion yuan - 2023 growth rate: 25% - Growth rate comparison: 'Smart terminals' have a higher growth rate.
[0114] - Core reason (cause and effect chain): The growth of smart terminals is mainly due to the explosive sales of the newly released X series mobile phones.
[0115] # Your task Based on the facts above, please answer the user's question in natural and fluent language.
[0116] 4.2 Output the final answer: The model generates the final answer based on a highly structured context: "Based on the analysis of the annual report you provided, in 2023:" Sales of smart terminal products reached 5 billion yuan.
[0117] Sales of cloud service products reached 3.5 billion yuan.
[0118] In terms of growth rate, the smart terminal product line grew faster (30%) than the cloud service product line (25%). This was mainly due to the explosive sales growth of the company's newly released X-series phones that year. Through the above embodiments, this method successfully decomposes a complex problem that requires data extraction and comparison by combining text and image information into key steps such as retrieval, dynamic template generation, multimodal thinking chain filling, intelligent conflict resolution, consistency verification, iterative supplementary retrieval, and final generation, ultimately producing accurate and verifiable answers, fully demonstrating the technical advantages of the present invention.
[0119] This method first preprocesses the multimodal documents to construct a basic vector index. Upon receiving a user query, the system first uses this vector index for preliminary retrieval, recalling relevant multimodal document fragments. The core innovation of this method lies in the fact that the system analyzes the user's query intent and dynamically generates a structured "information extraction form" or template, which defines the target information needed to answer the question. Subsequently, the system calls a multimodal large language model, utilizing its multimodal thinking chain mechanism to perform in-depth analysis on the recalled document fragments (including text and images), accurately extracting the required information in a "form-filling" manner. Finally, the completed form content, as a highly structured and relevant context, is submitted to the large language model to generate the final answer. This method solves the problem of information extraction being disconnected from user intent in existing technologies, achieving on-demand and precise deep information mining, and significantly improving the relevance and accuracy of the answer.
[0120] This invention introduces a closed-loop optimization mechanism similar to the "System 2 slow thinking" of human experts. After initial form filling, the system uses a multimodal model for self-reflection and logical consistency verification, eliminating potential illusions and data conflicts. Simultaneously, for key fields still missing in the template, the system can autonomously initiate iterative supplementary searches until complete information is obtained. The verified complete form content, as a highly structured and relevant context, is submitted to a large language model to generate the final answer. This method not only solves the problem of information extraction being disconnected from user intent in existing technologies, but also achieves on-demand, precise deep information mining through reasoning verification and proactive retrieval, significantly improving the factual accuracy and logical rigor of the answers.
[0121] This invention also provides a retrieval enhancement generation system that dynamically extracts information based on user queries, comprising: A preliminary search module is provided for performing the preliminary search steps. A dynamic template generation module is used to perform the dynamic template generation step; A template filling and optimization module is configured with a multimodal large language model to perform the template filling step and to perform conflict resolution and logical consistency verification. A response generation module is used to perform the response generation step. This system can implement the retrieval enhancement generation method based on dynamic information extraction from user queries as described in the above embodiments.
[0122] This invention also provides an electronic device, including: at least one memory and at least one processor; The at least one memory is used to store a machine-readable program; The at least one processor is used to call the machine-readable program to implement the retrieval enhancement generation method based on dynamic information extraction from user queries as described in the above embodiments.
[0123] This invention also provides a computer-readable medium storing computer instructions. When executed by a processor, the computer instructions cause the processor to perform the retrieval enhancement generation method based on dynamic information extraction according to user queries as described in the above embodiments. Specifically, a system or apparatus equipped with a storage medium storing software program code that implements the functions of any of the above embodiments can be provided, and the computer (or CPU or MPU) of the system or apparatus can read and execute the program code stored in the storage medium.
[0124] In this case, the program code read from the storage medium can itself implement the function of any of the above embodiments, and therefore the program code and the storage medium storing the program code constitute part of the present invention.
[0125] Storage media embodiments for providing program code include floppy disks, hard disks, magneto-optical disks, optical disks (such as CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), magnetic tapes, non-volatile memory cards, and ROMs. Alternatively, program code can be downloaded from a server computer via a communication network.
[0126] Furthermore, it should be clear that not only can the program code read by the computer be executed, but also the operating system or other components operating on the computer can be instructed based on the program code to perform some or all of the actual operations, thereby realizing the function of any of the embodiments described above.
[0127] Furthermore, it is understood that the program code read from the storage medium is written to the memory set in the expansion board inserted into the computer or to the memory set in the expansion unit connected to the computer. Then, based on the instructions of the program code, the CPU or other components installed on the expansion board or expansion unit execute some and all of the actual operations, thereby realizing the function of any of the embodiments described above.
[0128] The present invention has been shown and described in detail above with reference to the accompanying drawings and preferred embodiments. However, the present invention is not limited to these disclosed embodiments. Based on the above multiple embodiments, those skilled in the art will know that more embodiments of the present invention can be obtained by combining the technical means in the different embodiments above, and these embodiments are also within the protection scope of the present invention.
Claims
1. A retrieval enhancement generation method based on dynamic information extraction from user queries, characterized in that, The implementation of this method includes: (1) Preliminary retrieval steps: When a user query is received, one or more candidate information blocks related to the user query are retrieved and recalled from a knowledge base containing multimodal information blocks; (2) Dynamic template generation step: Based on the user's query intent, dynamically generate a structured information extraction template that defines the structure of the information to be extracted; (3) Template filling and optimization steps: Using a multimodal large language model, the candidate information blocks are analyzed, and the conflict resolution of multi-source data is performed based on the model confidence score or document timestamp. Logical consistency verification is performed, and the confirmed information is filled into the structured information extraction template to form a filled template. (4) Response generation step: Based on the structured information in the filled template, generate the final response to the user query.
2. The retrieval enhancement generation method based on dynamic information extraction from user queries according to claim 1, characterized in that, Prior to the initial retrieval step, a preprocessing step is also included, which parses and segments the original multimodal document and creates vector embeddings for each information block to build a vector index library for retrieval.
3. The retrieval enhancement generation method based on dynamic information extraction from user queries according to claim 1, characterized in that, In the dynamic template generation step, the user query is analyzed using a large language model, and the structured information extraction template is generated in a structured data format. The data structure of the structured information extraction template includes not only entity attribute fields, but also causal logic chain fields between entities, which are used to characterize the cause, process and result of the event.
4. The retrieval enhancement generation method based on dynamic information extraction from user queries according to claim 1, characterized in that, The candidate information block is a multimodal information block, which includes text elements and / or visual elements.
5. The retrieval enhancement generation method based on dynamic information extraction from user queries according to claim 4, characterized in that, In the template filling step, the multimodal large language model uses a visual thinking chain method to analyze the visual elements in the candidate information block in order to extract the information required to fill the template.
6. The retrieval enhancement generation method based on dynamic information extraction from user queries according to claim 1, characterized in that, The template filling step further includes an integration step, which merges the multiple filled templates obtained after filling multiple candidate information blocks into a final filled template with more complete information; the integration step includes a conflict resolution mechanism. When there is inconsistency in the information extracted from different candidate information blocks for the same template field, the system performs weighted arbitration based on the preset confidence score and / or the timestamp of the candidate information block to determine the value to be filled into the final filled template. The confidence score is output synchronously by the multimodal large language model when extracting information, and the timestamp is determined by the metadata of the candidate information block.
7. The retrieval enhancement generation method based on dynamic information extraction from user queries according to claim 1, characterized in that, Following the template filling step, a consistency verification step is also included: using a large language model to perform a reverse logical comparison between the structured information in the filled template and the original content of the candidate information block; if the comparison result shows a logical contradiction or factual illusion, the model is triggered to generate a self-correction instruction to correct the structured information or mark it as a low-confidence state.
8. The retrieval enhancement generation method based on dynamic information extraction from user queries according to claim 1, characterized in that, The template filling step also includes an iterative supplementary retrieval mechanism: if it is found that the key fields in the structured information extraction template fail to obtain effective information from the current candidate information block, a new retrieval query vector is automatically generated based on the missing fields, the supplementary retrieval step is executed, and the template filling is performed again on the supplemented and recalled information block.
9. The retrieval enhancement generation method based on dynamic information extraction from user queries according to claim 1, characterized in that, In the response generation step, the filled template is serialized into text and used together with the user's original query as context, and then input into a large language model to generate the final response.
10. A retrieval enhancement generation system based on dynamic information extraction from user queries, characterized in that, include: A preliminary search module is provided for performing the preliminary search steps. A dynamic template generation module is used to perform the dynamic template generation step; A template filling and optimization module is configured with a multimodal large language model to perform the template filling step and to perform conflict resolution and logical consistency verification. A response generation module is provided for performing the response generation step. The system is capable of implementing the method described in any one of claims 1 to 9.