Knowledge retrieval reasoning method and system based on multi-agent cooperation and reinforcement learning
By constructing a domain knowledge forest and a closed-loop data refinement module, and combining multi-agent collaboration and reinforcement learning between generative models and inference engines, the problems of illusion and reasoning separation in existing models in vertical domains are solved, achieving high-quality logical reasoning and accurate retrieval, and improving the ability to answer complex questions.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 浙江金汇数字技术有限公司
- Filing Date
- 2026-02-05
- Publication Date
- 2026-06-16
AI Technical Summary
Existing generative models suffer from problems such as illusion risk, separation of retrieval and reasoning, and lack of high-quality logical reasoning data when dealing with complex problems in vertical domains, making it difficult to meet the application requirements of high accuracy and high compliance.
We construct a knowledge retrieval and reasoning system based on multi-agent collaboration and reinforcement learning. By building a domain knowledge forest and a closed-loop data refinement module, we adopt a two-stage result-oriented reinforcement learning strategy and combine generative models and reasoning engines to achieve a deep integration of autonomous retrieval and logical reasoning.
It significantly improves the model's retrieval accuracy and reasoning logic in complex scenarios, reduces the risk of hallucinations, increases the accuracy and interpretability of answers to complex questions, and enhances the stability and data quality of the training process.
Smart Images

Figure CN122221997A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of artificial intelligence and natural language processing, and in particular to a knowledge retrieval and reasoning method and system based on multi-agent collaboration and reinforcement learning. Technical Background
[0002] With the rapid development of large language models, they have demonstrated powerful capabilities in instruction following and intent understanding in general domains. However, existing generative models still face significant challenges when dealing with complex problems in specific vertical domains.
[0003] Most existing question-answering and reasoning methods rely on parameterized knowledge within the model or simple retrieval enhancement techniques. These methods have significant limitations when dealing with complex scenarios requiring multi-step reasoning, high dependence on external evidence, and rigorous logic.
[0004] 1. Illusions and Factual Errors: Relying solely on model-parameterized knowledge can easily lead to "illusions," i.e., generating seemingly coherent content that contradicts facts or domain knowledge. Without external authoritative data support, the model struggles to guarantee the accuracy of its responses.
[0005] 2. Insufficient Reasoning Depth and Fragmented Retrieval: Existing retrieval enhancement methods are typically static (one-time retrieval), and the model lacks the ability to autonomously determine "when a retrieval is needed" and "how to utilize the retrieval results." This fragmentation between the retrieval and reasoning processes prevents the model from effectively integrating external evidence for logical deduction when dealing with complex, long-chain problems.
[0006] 3. Scarcity of high-quality training data: Vertical domains often lack high-quality inference chain data that has undergone rigorous logical verification. Directly using raw data for fine-tuning makes it difficult for the model to learn a rigorous "generate-critique-optimize" thought process. These issues make it difficult for existing document retrieval augmented reasoning systems to meet the high-precision and high-compliance application requirements when handling complex tasks requiring multi-step logical deduction and strong evidence support, thus limiting their large-scale deployment in various professional fields.
[0007] Due to the aforementioned problems, the field of natural language processing urgently needs an innovative technical solution that can construct high-quality reasoning data and autonomously optimize retrieval and reasoning strategies through reinforcement learning. Summary of the Invention
[0008] To address the problems of severe model illusion, disconnect between retrieval and reasoning, and lack of high-quality logical reasoning data in existing technologies, this invention provides a knowledge retrieval and reasoning method and system based on multi-agent collaboration and reinforcement learning. This invention constructs a closed-loop data refinement pipeline based on a domain knowledge forest and designs a two-stage result-oriented reinforcement learning mechanism to jointly model data synthesis and policy optimization, thereby significantly improving the model's retrieval accuracy and reasoning logic in complex scenarios.
[0009] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0010] In a first aspect, embodiments of this application provide a knowledge retrieval and reasoning method based on multi-agent collaboration and reinforcement learning, comprising the following steps:
[0011] S1. Construct a domain knowledge forest and generate an initial question-answer pair dataset.
[0012] S2. Generate a refined dataset through the closed-loop data refining module.
[0013] The closed-loop data refinement module adopts an iterative strategy of generation-critique-optimization. The initial question-answer pair dataset is input into the closed-loop data refinement module, which performs self-critique and answer rewriting through a large language model configured with specific prompt words, generating a refined dataset containing high-quality reasoning chains.
[0014] S3. Construct and train a knowledge retrieval reasoning network.
[0015] The knowledge retrieval reasoning network includes a generative model based on the Transformer architecture and an external reasoning engine. The generative model is responsible for predicting the token sequence based on the context, while the reasoning engine is responsible for streaming the output of the generative model and executing logical control (such as pausing generation, calling interfaces, and injecting documents).
[0016] Based on the generated refined dataset, the knowledge retrieval reasoning network is trained using a two-stage outcome-oriented reinforcement learning strategy that includes a retrieval adaptation phase and a reasoning integration phase.
[0017] During the retrieval adaptation phase, the training objective focuses on enabling the generative model to accurately predict specific tag sequences pointing to external tools when knowledge gaps are detected. The inference engine captures these sequences to achieve linkage with the retrieval environment, ensuring strong logical coupling between model output and tool invocation.
[0018] S4. Input the user query to be processed into the trained knowledge retrieval reasoning network. The network autonomously executes retrieval decisions and logical reasoning through probability prediction to obtain the final response result.
[0019] In one possible implementation, constructing the domain knowledge forest includes: determining a set of root node labels for the domain; recursively generating a set of sub-labels based on the root node labels using a large language model to form a hierarchical tree structure; and using a vector similarity matching method to integrate entities and rules from an external knowledge base into the branches of the tree structure. Based on the paths of the domain knowledge forest and combined with a set difficulty level, an initial question-answer pair dataset containing questions and initial responses is automatically synthesized using a generative large language model and an instruction generation template.
[0020] In one possible implementation, the closed-loop data refining module performs the following operations on each sample in the initial question-answer pair dataset:
[0021] First, a critique and rewriting process is performed. For the initial response, a large language model is used as the critique agent. Inputting prompts containing strict logical review instructions, the model generates critique content, which is structured data including strengths analysis, weaknesses analysis, and modification suggestions. Then, based on these modification suggestions, the large language model is used as the rewriter agent. Inputting prompts containing correction instructions, the initial response is rewritten and optimized to generate a refined response. If the refined response does not meet the preset standard (e.g., the critique score is below a threshold) and the iteration count is not reached, the currently generated refined response is used as the initial response, and the critique and rewriting process is repeated.
[0022] In one possible implementation, the knowledge retrieval inference network includes a vocabulary, which is expanded upon the basic language model vocabulary and predefined with special tags, including a query start button.<begin_of_query> "Query ended"<end_of_query> "The chain of thought begins" <think>"End of thought chain"< / think> "And the answer begins" <answer>The knowledge retrieval reasoning network autonomously drives the streaming decoding and generation operations of retrieval decisions, external environment linkage, and logical reasoning by predicting the probability distribution of different tokens in the vocabulary.
[0023] In one possible implementation, the knowledge retrieval inference network employs streaming decoding to monitor the output sequence during the generation process: if the model predicts through probability that the active output contains...<begin_of_query> and<end_of_query> If a special marker sequence is encountered, a "generation decision" operation is performed to express the retrieval intent. At this point, the inference engine physically interrupts the current generation process, extracts the query statement, calls external tools, and appends the returned retrieval results to the end of the model's current input cache (KV Cache) after encapsulating them in a preset format, thus completing the "environmental feedback injection" to update the context. Subsequently, the model resumes generation based on the updated context and continues to output data containing the logical deduction process. <think>Mark the content.
[0024] In one possible implementation, the two phases of the two-stage result-oriented reinforcement learning strategy are executed sequentially and have a parameter inheritance relationship: Phase 1 (retrieval adaptation phase) starts in the early stage of training and ends when the accuracy of the retrieval label format generated by the model reaches a preset threshold; Phase 2 (reasoning integration phase) inherits the model weights trained in Phase 1, introduces answer rewards while maintaining retrieval format constraints, and focuses on optimizing the model's ability to integrate and infer retrieval information. The specific process is as follows:
[0025] Phase 1: Search Adaptation Phase
[0026] The goal of this stage is to train the model to master the ability to invoke external retrieval tools. The reward function for stage one is defined as the sum of the retrieval reward and the format reward.
[0027] The search reward Defined as:
[0028]
[0029] Where n represents the complete output of the model.<begin_of_query> ...<end_of_query> The number of times the structure is defined. The format reward. The setting is: if the text output by the knowledge retrieval inference network contains the correct query tag pair<begin_of_query> and<end_of_query> If the content is not empty, the reward value is 0.5; otherwise, the reward value is -1.
[0030] Phase Two: Reasoning and Integration Phase
[0031] The goal of this stage is to train the model to make accurate inferences using the retrieved information. The reward function for stage two is defined as the sum of the answer reward and the format reward. The retrieval reward is removed in this stage.
[0032] The answer reward The F1 score is calculated based on the predicted answer and the standard answer. The predicted answer is extracted from the network output sequence. <answer>The text content following the mark.
[0033] The formula is as follows:
[0034]
[0035] Wherein, PN represents the number of words in the predicted answer, RN represents the number of words in the standard reference answer, and IN represents the number of overlapping words between the predicted answer and the standard reference answer.
[0036] The format reward The setting is: use regular expressions to match the output of the knowledge retrieval and reasoning network; if the output contains complete... <think> ...< / think> Reasoning chain structure and <answer>If the answer tag is correct, the format is considered correct, and the reward value is 0; if the key tag is missing, the reward value is -2.
[0037] In one possible implementation, a loss calculation mechanism based on retrieval masks is employed during training; when the knowledge retrieval inference network performs a retrieval and obtains external document content, the external document content is encapsulated in...<doc_start> and<doc_end> Interspersed with context, when calculating the reinforcement learning loss, the context will be placed between the tags.<doc_start> and<doc_end> The gradient weights of the text tokens between them are reset to 0 (Mask), so that the gradients only propagate backward on the inference text generated by the model, without being affected by the external document text.
[0038] In one possible implementation, during the inference phase, the knowledge retrieval inference network receives user input, performs probability prediction using a trained parameter distribution, and makes an autonomous judgment: if the model predicts the next token to be...<begin_of_query> This indicates that the model has decided to perform the retrieval. The model then continues to generate query terms and...<end_of_query> The inference engine detected...<end_of_query> Generation is then paused, and an external search engine or database is invoked to retrieve Top-K related documents. The system then encapsulates the document content into...<doc_start> ...<doc_end> The format is appended to the end of the model context. Model generation resumes, and the model continues to be generated based on the search results. <think> Reasoning process< / think> and finally output <answer> Final conclusion< / answer> .
[0039] Secondly, embodiments of this application provide a knowledge retrieval and reasoning system based on multi-agent collaboration and reinforcement learning, including:
[0040] The data synthesis module is used to build a domain knowledge forest and automatically synthesize the initial question-answer pair dataset by calling a large language model based on knowledge paths and difficulty levels.
[0041] The closed-loop data refinement module receives the initial question-answer pairs from the initial question-answer pair dataset, performs self-criticism and answer rewriting through a large language model configured with specific prompt words, and generates a refined dataset containing high-quality reasoning chains.
[0042] The model building and training module is used to construct a knowledge retrieval and reasoning network, and to train the network using a refined dataset through a two-stage result-oriented reinforcement learning strategy that includes a retrieval adaptation stage and a reasoning integration stage; during the training process, a retrieval masking mechanism is used to shield the influence of external documents on the gradient.
[0043] The knowledge retrieval reasoning network includes a generative model based on the Transformer architecture and an external reasoning engine. The generative model is responsible for predicting the token sequence based on the context, while the reasoning engine is responsible for streaming the output of the generative model and executing logical control (such as pausing generation, calling interfaces, and injecting documents).
[0044] During the retrieval adaptation phase, the training objective focuses on enabling the generative model to accurately predict specific tag sequences pointing to external tools when knowledge gaps are detected. The inference engine captures these sequences to achieve linkage with the retrieval environment, ensuring strong logical coupling between model output and tool invocation.
[0045] The reasoning execution module receives user queries to be processed, uses a trained knowledge retrieval reasoning network to autonomously execute retrieval decisions and logical reasoning through probability prediction, and obtains the final response result.
[0046] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0047] (1) Significantly improves data quality and logic: The closed-loop data refining module (corresponding generation-critique-optimization mechanism) proposed in this invention can automatically synthesize and self-correct high-quality, logically rigorous reasoning data by using domain knowledge forest as a guide in the absence of expert manual annotation, thus solving the problem of scarce training data in vertical domains.
[0048] (2) Achieving deep integration of retrieval and reasoning: Through two-stage result-oriented reinforcement learning, the model no longer passively receives retrieval results, but actively learns "when to retrieve" and "how to use retrieval results to optimize answers". Stage 1 strengthens the ability to call tools, and Stage 2 strengthens information integration and accuracy through F1 score rewards, enabling the model to have true dynamic reasoning ability.
[0049] (3) Strong training stability: The retrieval mask loss calculation mechanism introduced in this invention effectively eliminates the interference of changes in the length and content of external retrieval text on the model gradient update, ensuring that the model focuses on optimizing its own reasoning logic generation strategy, thereby improving the convergence speed and stability of the training process.
[0050] (4) High interpretability and accuracy of results: Compared with traditional end-to-end generation models, the responses generated by this system contain explicit thought chains and clear retrieval criteria, which not only greatly reduces the risk of illusion, but also significantly improves the accuracy and interpretability of answers to complex questions. Attached Figure Description
[0051] Figure 1 This is a schematic diagram of the overall framework of the method according to an embodiment of the present invention.
[0052] Figure 2 This is a flowchart illustrating the closed-loop data refining module (generation-critique-optimization) in an embodiment of the present invention.
[0053] Figure 3 This is a flowchart illustrating the two-stage result-oriented reinforcement learning strategy in an embodiment of the present invention.
[0054] Figure 4 This is a schematic diagram illustrating the interaction between the knowledge retrieval reasoning network and the external environment, as well as the retrieval masking mechanism in an embodiment of the present invention. Detailed Implementation
[0055] To make the technical means, inventive features, objectives, and effects of the invention readily understandable, the invention is further described below with reference to specific illustrations. However, the invention is not limited to the embodiments described below.
[0056] Example: Knowledge reasoning in a specific field (such as market compliance review)
[0057] This embodiment provides a knowledge retrieval and reasoning method based on multi-agent collaboration and reinforcement learning, such as Figure 1 As shown, it includes the following steps:
[0058] S1. Construct a domain knowledge forest and generate an initial question-answer pair dataset.
[0059] This invention first constructs a hierarchical domain knowledge forest (T).
[0060] Root node generation: Initialize a set of root tags covering the core issues of the target domain. For example, in the area of compliance review, root labels may include "market access", "pricing regulations", "exclusivity restrictions", etc.
[0061] Recursive expansion: for each root tag The large language model (Llama-3-70B) is used for recursive expansion. The specific operation is as follows: input the prompt word "Please list the subordinate concepts and key attributes of concept X in the vertical domain", use the regular expression extractor to parse the list text with number or separator returned by the model, and convert it into a tree-structured child node object. The depth-first (DFS) strategy is adopted, and the newly generated sub-labels are used as input variables for the next round of prompt words and re-injected into the large language model. The above process is executed repeatedly until the level depth of each tree-structured branch from the root node to the child node in the domain knowledge forest reaches the preset value (3 levels in the example). For example: (1) Price specification - (1-1) Uniform pricing - (1-1-1) Impact on competition).
[0062] External rule fusion: Construct an external knowledge base S containing unstructured rule document summaries and structured entity triples. Using the vector similarity matching method, the leaf node labels in the tree structure are converted into vector representations. Top-k related content is retrieved from the external knowledge base S and attached as "knowledge attributes" to the corresponding leaf nodes to form a complete knowledge structure T.
[0063] Dynamic Update: Define the update function U(T, t) as an incremental monitoring process based on a time window. Newly published regulations or cases within the domain are periodically retrieved via web crawler, and the cosine similarity between the new text and existing knowledge forest node vectors is calculated. If the maximum similarity is lower than a preset threshold (e.g., 0.6), it is determined to be new knowledge, and the aforementioned recursive expansion steps are invoked to generate a new branch and insert it into the knowledge forest. Based on the constructed knowledge forest, the complexity of the generated data is controlled by setting the following three difficulty levels:
[0064] Simple: It only involves factual questions and answers for a single leaf node in the knowledge forest, without the need for cross-node reasoning.
[0065] Medium: Involves comparisons or associations of 2 to 3 child nodes under the same parent node, and requires judgment based on the mounted external rules.
[0066] Challenges: Involving the integration of multiple knowledge points across branches (under different root nodes), or complex arguments requiring prioritization of conflict rules.
[0067] In the constructed domain knowledge forest, a random walk algorithm is used to extract the complete semantic link from the root node through intermediate nodes to the leaf node (e.g., 'market access - access restrictions - foreign investment ratio regulations'). Based on the selected path depth and node association density, the corresponding difficulty weight is matched, and then the link string is filled into the standardized prompt word template "Based on the knowledge path [Path], for the scenario [Task_Type], generate a question with a difficulty of [Difficulty] and its logically rigorous answer". The filled template is then input into the large language model to generate the initial question-answer pair dataset (DI) in batches.
[0068] S2. Construct a closed-loop data refining module.
[0069] To improve the logicality and accuracy of the data, this embodiment is designed as follows: Figure 2 The closed-loop data refining module shown employs a "generation-critique-optimization" closed-loop strategy. For each sample s in the dataset, the closed-loop data refining module performs the following operations:
[0070] Self-criticism: The closed-loop data refinement module first processes the initial response. Input a large language model (acting as the critic agent; in this embodiment, the language model is Gemini3-pro), and input the command: "You are a rigorous logic reviewer. Please check the logical flaws, factual errors, and lack of legal basis in the above answer. Please output in JSON format: {Advantages, Disadvantages, Specific modification suggestions, rating 0 or 1 (where 1 represents logically acceptable, and 0 represents logical flaws that need to be corrected)}". Based on this, the model generates structured criticism content c.
[0071] Rewriting and Optimization: The closed-loop data refinement module concatenates the original question, initial response, and criticism content c, inputs it into the large language model (acting as the rewriter agent; in this embodiment, the language model is Gemini3-pro), and inputs the instruction: "Please revise the initial response according to the above modification suggestions to ensure the logical chain is complete and the references are accurate." The model then generates a refined response r*.
[0072] Iteration control: Set the maximum number of iterations If the judgment criterion (score) is 0 and the maximum number of iterations has not been reached, the refined response is used as the initial response for the next round, and the above self-criticism and rewrite optimization process is repeated to finally output the refined dataset. .
[0073] S3. Construct a knowledge retrieval reasoning network and conduct two-stage reinforcement learning training.
[0074] This embodiment constructs a knowledge retrieval inference network based on a Decoder-only Transformer model (such as Llama-3-8B). Building upon this, the model's tokenizer is expanded, and a special control marker, "Query Start," is added.<begin_of_query> "Query ended"<end_of_query> "、"<doc_start> "、"<doc_end> "The thought chain begins" <think> "End of thought chain"< / think> "And the answer begins" <answer>Integrate with external search tool interfaces (such as the Google Search API or Elasticsearch interface).
[0075] The training process is divided into two phases, such as Figure 3 As shown:
[0076] Phase 1: Search Adaptation Phase
[0077] This stage aims to teach the model "when" and "how" to invoke retrieval tools.
[0078] Reward Design:
[0079] Retrieval reward: This reward is given when the knowledge retrieval inference network outputs a complete result during the inference process.<begin_of_query> ...<end_of_query> A positive reward (e.g., +0.5) is given if the search request is successfully initiated; otherwise, it is 0.
[0080] Formatting Reward (Rformat): Strictly constrains the output format of the model. The model must use...<begin_of_query> ...<end_of_query> The query statement is wrapped in tags. A reward of +0.5 is given for correct formatting, and a penalty (e.g., -1.0) is applied for incorrect formatting. At this stage, the correctness of the answer is not considered; only the behavior pattern of the tool call is examined.
[0081] Phase Two: Reasoning Integration
[0082] This stage aims to teach the model to use the retrieved information to generate correct reasoning and answers.
[0083] Reward Design:
[0084] Remove retrieval rewards to prevent the model from blindly searching for rewards.
[0085] Answer Rewards ( ): Introduces a reward based on the correctness of the final answer. Uses the F1 score to measure the model's predicted answer ( <answer>The degree of overlap between the text following the label and the standard answer. The calculation formula is: ,in To predict the number of words in the answer, For reference, the word count of the answer is [number]. The number of words in the intersection.
[0086] Formatting Reward: Inherit the query label constraints from Phase 1 and add inference label constraints. Check if the output contains paired values. <think> ...< / think> (Mind Chain) and <answer>(Final Answer). If missing <answer>The tag incurs a heavy penalty (e.g., -2.0), or 0 if the format is correct.
[0087] Search masking mechanism:
[0088] like Figure 4 As shown, during training, when the model calls the retrieval tool and obtains external document content (the model automatically encapsulates it in...),<doc_start> ...<doc_end> When inserting context, the tokens of the external text are masked (gradient weights are set to 0) during the loss function calculation. This means that the weights related to the external document content are not updated during gradient backpropagation, ensuring that the model optimizes only for its own inference logic and query generation capabilities, avoiding interference from the length or distribution characteristics of the external text.
[0089] S4. Input the user query to be processed into the trained knowledge retrieval reasoning network to obtain the final response result.
[0090] During the inference phase, the knowledge retrieval inference network receives user input and uses the trained parameter distribution to perform probability prediction, enabling autonomous judgment: if the model predicts the next token to be...<begin_of_query> This indicates that the model has decided to perform the retrieval. The model then continues to generate query terms and...<end_of_query> The inference engine detected...<end_of_query> Then pause generation, call an external search engine or database to retrieve Top-K related documents. Encapsulate the document content into...<doc_start> ...<doc_end> The format is appended to the end of the model context. Model generation resumes, and the model continues to be generated based on the search results. <think> Reasoning process< / think> and finally output <answer> Final conclusion< / answer> .
[0091] This application also provides a knowledge retrieval and reasoning system based on multi-agent collaboration and reinforcement learning, including:
[0092] The data synthesis module is used to build a domain knowledge forest and automatically synthesize the initial question-answer pair dataset by calling a large language model based on knowledge paths and difficulty levels.
[0093] The closed-loop data refinement module receives the initial question-answer pairs from the initial question-answer pair dataset, performs self-criticism and answer rewriting through a large language model configured with specific prompt words, and generates a refined dataset containing high-quality reasoning chains.
[0094] The model building and training module is used to construct a knowledge retrieval and reasoning network, and to train the network using a refined dataset through a two-stage result-oriented reinforcement learning strategy that includes a retrieval adaptation stage and a reasoning integration stage; during the training process, a retrieval masking mechanism is used to shield the influence of external documents on the gradient.
[0095] The knowledge retrieval reasoning network includes a generative model based on the Transformer architecture and an external reasoning engine. The generative model is responsible for predicting the token sequence based on the context, while the reasoning engine is responsible for streaming the output of the generative model and executing logical control (such as pausing generation, calling interfaces, and injecting documents).
[0096] During the retrieval adaptation phase, the training objective focuses on enabling the generative model to accurately predict specific tag sequences pointing to external tools when knowledge gaps are detected. The inference engine captures these sequences to achieve linkage with the retrieval environment, ensuring strong logical coupling between model output and tool invocation.
[0097] The reasoning execution module receives user queries to be processed, uses a trained knowledge retrieval reasoning network to autonomously execute retrieval decisions and logical reasoning through probability prediction, and obtains the final response result.
[0098] In one possible implementation, the data synthesis module operates as follows: It determines the set of root node labels for the domain; based on the root node labels, it recursively generates a set of sub-labels using a large language model, forming a hierarchical tree structure; and it uses vector similarity matching to integrate entities and rules from an external knowledge base into the branches of the tree structure. Based on the paths of the domain knowledge forest and combined with the set difficulty levels, it automatically synthesizes an initial question-and-answer pair dataset containing questions and initial responses using a generative large language model and an instruction generation template.
[0099] In one possible implementation, the closed-loop data refining module performs the following operations on each sample in the initial question-answer pair dataset:
[0100] First, a critique and rewriting process is performed. For the initial response, a large language model is used as the critique agent. Inputting prompts containing strict logical review instructions, the model generates critique content, which is structured data including strengths analysis, weaknesses analysis, and modification suggestions. Then, based on these modification suggestions, the large language model is used as the rewriter agent. Inputting prompts containing correction instructions, the initial response is rewritten and optimized to generate a refined response. If the refined response does not meet the preset standard (e.g., the critique score is below a threshold) and the iteration count is not reached, the currently generated refined response is used as the initial response, and the critique and rewriting process is repeated.
[0101] In one possible implementation, the knowledge retrieval inference network includes a vocabulary, which is expanded upon the basic language model vocabulary and predefined with special tags, including "Query Start".<begin_of_query> "Query ended"<end_of_query> "The thought chain begins" <think> "End of thought chain"< / think> "And the answer begins" <answer>The knowledge retrieval reasoning network autonomously drives the streaming decoding and generation operations of retrieval decisions, external environment linkage, and logical reasoning by predicting the probability distribution of different tokens in the vocabulary.
[0102] In one possible implementation, the knowledge retrieval inference network employs streaming decoding to monitor the output sequence during the generation process: if the model predicts through probability that the active output contains...<begin_of_query> and<end_of_query> If a special marker sequence is encountered, a "generation decision" operation is performed to express the retrieval intent. At this point, the inference engine physically interrupts the current generation process, extracts the query statement, calls external tools, and appends the returned retrieval results to the end of the model's current input cache (KV Cache) after encapsulating them in a preset format, thus completing the "environmental feedback injection" to update the context. Subsequently, the model resumes generation based on the updated context and continues to output data containing the logical deduction process. <think>Mark the content.
[0103] In one possible implementation, the two phases of the two-stage result-oriented reinforcement learning strategy are executed sequentially and have a parameter inheritance relationship: Phase 1 (retrieval adaptation phase) starts in the early stage of training and ends when the accuracy of the retrieval label format generated by the model reaches a preset threshold; Phase 2 (reasoning integration phase) inherits the model weights trained in Phase 1, introduces answer rewards while maintaining retrieval format constraints, and focuses on optimizing the model's ability to integrate and infer retrieval information. The specific process is as follows:
[0104] Phase 1: Search Adaptation Phase
[0105] The goal of this stage is to train the model to master the ability to invoke external retrieval tools. The reward function for stage one is defined as the sum of the retrieval reward and the format reward.
[0106] The search reward Defined as:
[0107]
[0108] Where n represents the complete output of the model.<begin_of_query> ...<end_of_query> The number of times the structure is defined. The format reward. The setting is: if the text output by the knowledge retrieval inference network contains the correct query tag pair<begin_of_query> and<end_of_query> If the content is not empty, the reward value is 0.5; otherwise, the reward value is -1.
[0109] Phase Two: Reasoning and Integration Phase
[0110] The goal of this stage is to train the model to make accurate inferences using the retrieved information. The reward function for stage two is defined as the sum of the answer reward and the format reward. The retrieval reward is removed in this stage.
[0111] The answer reward The F1 score is calculated based on the predicted answer and the standard answer. The predicted answer is extracted from the network output sequence. <answer>The text content following the mark.
[0112] The formula is as follows:
[0113]
[0114] Wherein, PN represents the number of words in the predicted answer, RN represents the number of words in the standard reference answer, and IN represents the number of overlapping words between the predicted answer and the standard reference answer.
[0115] The format reward The setting is: use regular expressions to match the output of the knowledge retrieval and reasoning network; if the output contains complete... <think> ...< / think> Reasoning chain structure and <answer>If the answer tag is correct, the format is considered correct, and the reward value is 0; if the key tag is missing, the reward value is -2.
[0116] In one possible implementation, a loss calculation mechanism based on retrieval masks is used during the training process of the model building and training module; when the knowledge retrieval inference network performs a retrieval and obtains external document content, the external document content is encapsulated in...<doc_start> and<doc_end> Interspersed with context, when calculating the reinforcement learning loss, the context will be placed between the tags.<doc_start> and<doc_end> The gradient weights of the text tokens between them are reset to 0 (Mask), so that the gradients only propagate backward on the inference text generated by the model, without being affected by the external document text.
[0117] In one possible implementation, within the inference execution module, the knowledge retrieval inference network receives user input, performs probability prediction using a trained parameter distribution, and achieves autonomous judgment: if the model predicts the next token to be...<begin_of_query> This indicates that the model has decided to perform the retrieval. The model then continues to generate query terms and...<end_of_query> The inference engine detected...<end_of_query> Generation is then paused, and an external search engine or database is invoked to retrieve Top-K related documents. The system then encapsulates the document content into...<doc_start> ...<doc_end> The format is appended to the end of the model context. Model generation resumes, and the model continues to be generated based on the search results. <think> Reasoning process< / think> and finally output <answer> Final conclusion< / answer> .
[0118] The above description is merely a preferred embodiment of the present invention and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of the present invention is not limited to the technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalent features without departing from the above-described inventive concept.< / answer> < / answer> < / think> < / answer> < / answer> < / answer> < / answer> < / answer> < / answer> < / answer> < / think> < / answer>
Claims
1. A knowledge retrieval and reasoning method based on multi-agent collaboration and reinforcement learning, characterized in that, Includes the following steps: S1. Construct a domain knowledge forest and generate an initial question-answer pair dataset; S2. Generate a refined dataset through the closed-loop data refining module; The closed-loop data refining module adopts an iterative strategy of generation-critique-optimization. The initial question-answer pair dataset is input into the closed-loop data refining module, and the module performs self-critique and answer rewriting through a large language model configured with specific prompt words to generate a refined dataset containing high-quality reasoning chains. S3. Construct and train a knowledge retrieval reasoning network; The knowledge retrieval reasoning network includes a generative model based on the Transformer architecture and an external reasoning engine; wherein, the generative model is responsible for predicting the token sequence based on the context, and the reasoning engine is responsible for streaming monitoring of the output of the generative model and executing logical control. Based on the generated refined dataset, the knowledge retrieval reasoning network is trained using a two-stage outcome-oriented reinforcement learning strategy that includes a retrieval adaptation stage and a reasoning integration stage. S4. Input the user query to be processed into the trained knowledge retrieval reasoning network. The network autonomously executes retrieval decisions and logical reasoning through probability prediction to obtain the final response result.
2. The knowledge retrieval and reasoning method based on multi-agent collaboration and reinforcement learning according to claim 1, characterized in that, The construction of the domain knowledge forest includes: determining the set of root node labels for the domain; recursively generating a set of sub-labels based on the root node labels using a large language model to form a hierarchical tree structure; using a vector similarity matching method to integrate entities and rules from an external knowledge base into the branches of the tree structure; and automatically synthesizing an initial question-answer pair dataset containing questions and initial responses based on the paths of the domain knowledge forest, combined with the set difficulty level, using a generative large language model and instruction-generated templates.
3. The knowledge retrieval and reasoning method based on multi-agent collaboration and reinforcement learning according to claim 1, characterized in that, The closed-loop data refinement module performs the following operations on each sample in the initial question-and-answer pair dataset: First, a critique and rewriting process is performed. For the initial response, a large language model is used as the critique agent. Prompt words containing strict logical review instructions are input to generate critique content. The critique content is structured data, including analysis of advantages, analysis of disadvantages, and suggestions for modification. Subsequently, based on the aforementioned modification suggestions, a large language model was used as the rewriting agent to rewrite and optimize the initial response by inputting prompt words containing correction instructions, thereby generating a refined response; If the refined response does not meet the preset standard and the number of iterations is not reached, the currently generated refined response will be used as the initial response to repeat the above critique and rewriting process.
4. The knowledge retrieval and reasoning method based on multi-agent collaboration and reinforcement learning according to claim 1, characterized in that, The knowledge retrieval reasoning network includes a vocabulary, which is expanded upon the basic language model vocabulary and predefined with special tags, including "Query Start".<begin_of_query> "Query ended"<end_of_query> "The thought chain begins" <think> "End of thought chain"< / think> "And the answer begins" <answer> The knowledge retrieval reasoning network autonomously drives the streaming decoding and generation operations of retrieval decisions, external environment linkage, and logical reasoning by predicting the probability distribution of different tokens in the vocabulary.< / answer> 5. The knowledge retrieval and reasoning method based on multi-agent collaboration and reinforcement learning according to claim 4, characterized in that, The knowledge retrieval and reasoning network employs streaming decoding to monitor the output sequence during the generation process: if the model actively outputs information based on probability prediction...<begin_of_query> and<end_of_query> If a special marker sequence is encountered, a "generation decision" operation is performed to express the retrieval intent. At this point, the inference engine physically interrupts the current generation process, extracts the query statement, calls external tools, and appends the returned retrieval results to the end of the model's current input cache after encapsulating them in a preset format, completing the "environmental feedback injection" to update the context. Subsequently, the model resumes generation based on the updated context and continues to output data containing the logical deduction process. <think> Mark the content.< / think> 6. The knowledge retrieval and reasoning method based on multi-agent collaboration and reinforcement learning according to claim 1, characterized in that, The two-stage result-oriented reinforcement learning strategy is executed sequentially and has a parameter inheritance relationship: the retrieval adaptation stage is started in the early stage of training and ends when the accuracy of the retrieval label format generated by the model reaches a preset threshold. The reasoning and integration stage inherits the model weights trained in the retrieval adaptation stage. While maintaining retrieval format constraints, answer rewards are introduced to optimize the model’s ability to integrate and infer retrieval information. The specific process is as follows: Phase 1: Search Adaptation Phase The reward function for Phase 1 is defined as the sum of the retrieval reward and the format reward; The search reward Defined as: Where n represents the complete output of the model.<begin_of_query> ...<end_of_query> Number of times the structure; the format reward The setting is: if the text output by the knowledge retrieval inference network contains the correct query tag pair<begin_of_query> and<end_of_query> If the content is not empty, the reward value is 0.5; otherwise, the reward value is -1. Phase Two: Reasoning and Integration Phase The reward function for Phase Two is defined as the sum of the answer reward and the format reward; The answer reward The F1 score is calculated based on the predicted answer and the standard answer; the predicted answer is extracted from the network output sequence. <answer> The text content following the mark;< / answer> The formula is as follows: Wherein, PN represents the number of words in the predicted answer, RN represents the number of words in the standard reference answer, and IN represents the number of overlapping words between the predicted answer and the standard reference answer; The format reward The setting is as follows: Use regular expressions to match the output of the knowledge retrieval and reasoning network; if the output contains complete... <think> ...< / think> Reasoning chain structure and <answer> If the answer tag is correct, the format is considered correct, and the reward value is 0; if the key tag is missing, the reward value is -2.< / answer> 7. The knowledge retrieval and reasoning method based on multi-agent collaboration and reinforcement learning according to claim 6, characterized in that, A loss calculation mechanism based on retrieval masks is used during training; when the knowledge retrieval inference network performs a retrieval and obtains the content of an external document, the external document content is encapsulated in...<doc_start> and<doc_end> Interspersed with context, when calculating the reinforcement learning loss, the context will be placed between the tags.<doc_start> and<doc_end> The gradient weights of the text tokens between them are reset to 0, so that the gradients only propagate backward on the inference text generated by the model, without being affected by the external document text.
8. The knowledge retrieval and reasoning method based on multi-agent collaboration and reinforcement learning according to claim 7, characterized in that, During the inference phase, the knowledge retrieval inference network receives user input and uses the trained parameter distribution to perform probability prediction, enabling autonomous judgment: if the model predicts the next token to be...<begin_of_query> This indicates that the model has decided to perform the retrieval; the model then continues to generate query terms and...<end_of_query> The inference engine detected...<end_of_query> Then pause generation, call an external search engine or database to retrieve Top-K related documents; encapsulate the document content into...<doc_start> ...<doc_end> The format is appended to the end of the model context; this resumes model generation, and the model continues to be generated based on the search results. <think> Reasoning process< / think> and finally output <answer> Final conclusion< / answer> .
9. A knowledge retrieval and reasoning system based on multi-agent collaboration and reinforcement learning, characterized in that, include: The data synthesis module is used to build a domain knowledge forest and automatically synthesize the initial question-answer pair dataset by calling a large language model based on knowledge paths and difficulty levels. The closed-loop data refinement module receives the initial question-answer pairs from the initial question-answer pair dataset, performs self-criticism and answer rewriting through a large language model configured with specific prompt words, and generates a refined dataset containing high-quality reasoning chains. The model building and training module is used to construct a knowledge retrieval and reasoning network, and to train the network using a refined dataset through a two-stage result-oriented reinforcement learning strategy that includes a retrieval adaptation stage and a reasoning integration stage; during the training process, a retrieval masking mechanism is used to shield the influence of external documents on the gradient; The knowledge retrieval reasoning network includes a generative model based on the Transformer architecture and an external reasoning engine; wherein, the generative model is responsible for predicting the token sequence based on the context, and the reasoning engine is responsible for streaming monitoring of the output of the generative model and executing logical control. The reasoning execution module receives user queries to be processed, uses a trained knowledge retrieval reasoning network to autonomously execute retrieval decisions and logical reasoning through probability prediction, and obtains the final response result.