A timing-dependent data prediction model optimization method based on agent trajectory feedback
By collecting the interaction trajectories of intelligent agents to construct supervised samples and training a temporal gating coding model, the problem of mismatch between existing retrieval models and intelligent agent scenarios is solved, and continuous iterative optimization of the retrieval model and efficient evidence recall are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- RENMIN UNIVERSITY OF CHINA
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-26
AI Technical Summary
Existing retrieval models struggle to capture the real information needs of agents during multi-step problem-solving processes, fail to effectively utilize the agent's search, browsing, and reasoning trajectories, resulting in low evidence recall and task success rates, and the inability to continuously iterate and optimize the models.
By collecting the multi-round interaction trajectories between the agent and the retrieval system, query-document supervised samples are constructed. Noise is filtered out using the inference text after browsing, the document relevance strength weights are estimated, and a temporal gating coding model is trained to form a closed-loop update mechanism.
It significantly reduces the misalignment between human retrieval training objectives and agent usage objectives, improves the purity of supervised samples and the stability of the retrieval model, enhances the contribution of high-value evidence documents, and supports continuous iterative optimization and long-term operation of the model.
Smart Images

Figure CN122285705A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer technology, Internet information collection, information retrieval, retrieval enhancement generation, and machine learning. More specifically, it relates to an optimization method for a temporal dependent data prediction model based on agent trajectory feedback, used to optimize the model for predicting query-document relevance or evidentiary value in multi-round search and browsing trajectories. Background Technology
[0002] Information retrieval systems have long primarily served human users, and their ranking and retrieval models typically rely on human interaction logs such as clicks, dwell time, and bounce rate for training. In this paradigm, queries are usually directly submitted by users, and the validity of search results is mainly characterized by whether humans clicked or remained on the page.
[0003] With the rapid development of search agents driven by large language models, deep research agents, and retrieval-enhanced generation systems, the service targets of retrieval modules are gradually shifting from human users to intelligent agents with reasoning and action capabilities. When performing complex tasks, intelligent agents often generate intermediate queries multiple times around the same task, view candidate result summaries, browse selected documents, and continuously update subsequent plans based on the evidence obtained.
[0004] However, most existing retrieval models are still built using training methods geared towards human users, making it difficult to capture the real information needs generated by the agent during multi-step problem-solving. Intermediate queries issued by the agent usually correspond to local information gaps, rather than the user's final question itself; whether a document is truly useful often requires consideration of the agent's reasoning process after browsing the document, and simply relying on whether it has been browsed is insufficient to accurately reflect the document's value.
[0005] Furthermore, agents naturally generate a large number of search, browsing, and reasoning trajectories during task execution. However, existing methods typically only use these trajectories for task solving itself, rather than using them as training data for retrieval models. This results in retrieval models being unable to continuously iterate and optimize as agents are applied, thus limiting evidence recall, task success rate, and overall execution efficiency.
[0006] Therefore, how to automatically extract high-quality supervision signals from the interaction trajectory between the agent and the retrieval system, and use these signals to train an information retrieval model that better reflects the agent's actual usage, has become an urgent technical problem to be solved. Summary of the Invention
[0007] The purpose of this invention is to provide an optimization method for a temporally dependent data prediction model based on agent trajectory feedback, which aims to solve problems such as the mismatch between existing retrieval training paradigms and agent search scenarios, large noise in positive and negative samples, difficulty in reflecting the strength of document contributions to task progress, and lack of continuous iterative update mechanism.
[0008] In one general aspect, a method for optimizing a temporal dependent data prediction model based on agent trajectory feedback is provided. This method is applied to tasks such as webpage evidence text collection, open encyclopedia question answering, enterprise knowledge base question answering, or retrieval enhancement generation, and includes steps S101-S106: S101 collects the multi-round execution trajectory formed by the interaction between the intelligent agent and the retrieval system during task execution; S102, Construct initial supervised samples of query-document based on the candidate document set returned by the search action and subsequent browsing actions; S103, based on the reasoning text after browsing, determine the relevance of the browsed document, perform reasoning perception filtering on positive sample candidates, and obtain valid positive samples; S104. Estimate the relevance strength weight of valid positive samples based on the length of the inference text after browsing, the number of evidence citations, the number of fact entries, changes in subsequent actions, or combinations thereof. S105, a retrieval model is trained based on valid positive samples, negative samples and relevant strength weights; S106, the optimized retrieval model is redeployed to the agent system to continuously collect new trajectories and form a closed-loop update.
[0009] Preferably, the browsable value score, relevance determination model, relevance strength weight, joint value estimation, temporal gating coding, and training objective function of the candidate document can be implemented using the formulas in this specification for calculating the browsable value of the candidate document, determining the consistency between reasoning and evidence, length-induced weighting, joint value estimation, temporal gating coding, and weighted training objective.
[0010] The technical effects to be achieved by the embodiments of the present invention are as follows: First, this invention directly utilizes the real interaction trajectory of the intelligent agent as the source of retrieval supervision, which can significantly reduce the misalignment between the human retrieval training objective and the intelligent agent's usage objective. Second, this invention filters noisy positive samples through browsing and inference text, which can improve the purity of supervised samples and the stability of the retrieval model. Third, this invention characterizes the document value difference through relevance strength weights, which can enhance the contribution of high-value evidence documents in training. Fourth, this invention supports the backflow deployment and continuous iteration of the optimized model, making it suitable for long-term operation in real-world application systems. Attached Figure Description
[0011] The above and other objects and features of the present invention will become clearer from the following description taken in conjunction with the accompanying drawings.
[0012] Figure 1 This is a schematic diagram illustrating the shift from human retrieval training to training based on agent interaction trajectories; Figure 2 This is a schematic diagram illustrating the architecture of a temporally dependent data prediction model optimization method based on agent trajectory feedback according to an embodiment of the present invention; Figure 3 This is a schematic diagram illustrating the iterative process of deploying the optimized retrieval model backflow and continuously forming a data flywheel according to an embodiment of the present invention. Detailed Implementation
[0013] The present invention will be further described below with reference to specific embodiments. It should be understood that the embodiments described are only for explaining the present invention and are not intended to limit the scope of protection of the present invention; without departing from the concept of the present invention, those skilled in the art can make equivalent substitutions or modifications to the order of steps, parameter settings, model structure, or system implementation, and all such substitutions or modifications should fall within the scope of protection of the present invention. Unless the context clearly defines otherwise, the terms "comprising," "including," and "having" in this document are open-ended expressions.
[0014] Figure 2 This is a schematic diagram illustrating an optimization method for a time-dependent data prediction model based on agent trajectory feedback according to an embodiment of the present invention.
[0015] To achieve the aforementioned objectives, the present invention employs the following technical framework: Figure 2 As shown.
[0016] In this invention, the intelligent agent can be a deep research intelligent agent, a search intelligent agent, a retrieval-enhanced generative intelligent agent, or other systems with retrieval and invocation capabilities; the retrieval system can be a dense retrieval system, a sparse retrieval system, a hybrid retrieval system, a search interface, or a combination thereof. The interaction trajectory is preferably recorded at the task or query granularity, and each trajectory saves at least one or more search actions, browsing actions, and corresponding reasoning text.
[0017] As a preferred embodiment, the agent performs tasks using a multi-round interaction pattern of "thinking-searching / browsing-rethinking" around the user's task. In each round, the agent first analyzes the current context and identifies unresolved information gaps, then generates corresponding intermediate queries; the retrieval system returns a set of candidate documents based on these intermediate queries, which may include titles, summary fragments, document fragments, or combinations thereof; the agent then chooses whether to continue browsing the complete content of a target document based on the candidate information.
[0018] In step S101, the system collects the multi-round execution trajectory formed by the interaction between the agent and the retrieval system. Preferably, the target task is webpage evidence text collection or complex question-and-answer evidence retrieval task, such as collecting evidence text that can support answer generation from webpage encyclopedia entries, technical announcements, product descriptions, policy documents, or enterprise knowledge base documents. Each search action returns the top-K candidate documents, for example, K can be set to 10; each candidate document first returns the title and a summary fragment of a certain number of terms to simulate the usage mode of viewing the summary first and then deciding whether to open the full text in a real retrieval environment.
[0019] In step S101, each search action returns K candidate documents, and each candidate document returns a title, summary fragment, or body text fragment; the K candidate documents are the top K candidate documents ranked according to the agent's browsable value score; for the i-th candidate document in the t-th round, the agent's browsable value score is calculated as follows:
[0020] Redundant items: Where σ is the Sigmoid function, cos is the cosine similarity, and ln is the natural logarithm; u is the information gap vector obtained by encoding the current task context, historical reasoning text, and unmet information needs; q is the intermediate query vector generated based on the information gap; and s is the candidate summary vector obtained by the text encoder from the candidate document title, summary fragment, or body text fragment; each α coefficient is a learnable or preset parameter; when the historical viewed summary set is empty, the redundant term is 0.
[0021] When the highest browsable value score reaches the preset browsing threshold, the corresponding candidate document is browsed; when the highest browsable value score does not reach the preset browsing threshold, the intermediate query is rewritten or a new search action is initiated; when multiple documents need to be browsed, they are selected in descending order of browsable value score, and candidate documents whose similarity to the selected abstract exceeds the preset diversity threshold are removed.
[0022] In step S102, the system constructs initial supervised samples based on the temporal correspondence between search and browsing actions. For an intermediate query and its candidate document set generated by a search action, if the agent subsequently browses one of the documents, it is considered a positive candidate; the remaining unbrowsed documents in the same candidate set are considered negative candidates. Unlike traditional human click logs, in the agent scenario, browsing behavior is less dependent on sorting position, and unbrowsed documents usually reflect the agent's explicit rejection after comparison, thus serving as a relatively reliable negative signal.
[0023] In step S103, the system performs reasoning-aware filtering on positive sample candidates. Specifically, the reasoning text generated immediately by the agent after the target document is viewed can be extracted and input into the relevance determination model. The relevance determination model can determine whether the reasoning text indicates that the target document is substantially helpful in answering the current question, supplementing factual evidence, planning the next search step, or forming the final answer; if so, it is retained as a valid positive sample; if not, the document is removed or transformed into a difficult negative sample.
[0024] The relevance assessment model is a reasoning-evidence consistency assessment model. The evidence coverage score characterizes the degree to which a candidate document covers the key entities, facts, and constraints of the current problem; the information gap reduction score characterizes the degree to which unmet information needs are reduced after browsing the document; the citation or fact extraction score characterizes the strength of the reasoning text's citation of the document or extraction of fact items; the subsequent planned contribution score characterizes the degree to which the document facilitates subsequent queries, browsing, or answer integration; and the rejection or abandonment score characterizes the degree to which the reasoning text marks the document as irrelevant, duplicate, or no longer used. The relevance assessment model calculates the relevance probability based on these scores: When the relevance probability reaches a preset relevance threshold, it is retained as a valid positive sample; otherwise, it is discarded or transformed into a difficult negative sample. The relevance determination model is fine-tuned through manual annotation, feedback on the correctness of the final answer, or weak labels generated by a large language model. The fine-tuning objective is: In the formula, the five c terms correspond to the evidence coverage score, information gap reduction score, citation or fact extraction score, subsequent plan contribution score, and denial or abandonment score, respectively. The constant term in the formula is the bias term, and the five γ weight terms are the weight coefficients of the above five scores, respectively. y represents the relevance label, θ represents the relevance determination model parameter, λ represents the regularization coefficient, and the R term represents the regularization constraint on the relevance determination model parameter, which is used to constrain the model complexity and suppress overfitting.
[0025] In step S104, the system estimates the relevance strength of valid positive samples based on the inference text after browsing. Preferably, the system uses the length of the inference text after browsing as a proxy signal for document value; if a document triggers longer and deeper inference after being browsed, it indicates that the document is more likely to be used by the agent for evidence integration, plan updates, or answer generation. Furthermore, the system can also combine signals such as the number of evidence citations, the number of fact entries, whether a new browsing action is triggered, and whether the number of subsequent search rounds is reduced to jointly estimate the document value.
[0026] Let l represent the number of terms in the inference text generated immediately after the target document is viewed, β represent the median length of all valid positive sample inference texts, and the length-induced unnormalized intensity score is:
[0027] The length-induced correlation strength weights are:
[0028] Where ε is a smoothing constant, and the mean term in the denominator represents the global mean of the unnormalized intensity scores; the longer the reasoning text, the more likely the document is to be used for evidence integration, plan updates, or answer generation.
[0029] In step S104, after calculating the length-induced relevance strength weights, a joint value estimation is further performed, which includes calculating the evidence citation strength and the fact entry strength: The joint value vector is composed of length strength, evidence citation strength, fact entry strength, subsequent browsing trigger strength, and search round reduction strength, and the joint value score is calculated as follows: Define the normalized denominator: The final sample weights are: Where c represents the number of citations, f represents the number of fact entries, M represents the number of training samples, η is the learnable weight, the softplus function is used for smoothing the mapping, and the clip function is used to limit the weights to a preset range.
[0030] In step S105, the system trains an information retrieval model based on valid positive samples, negative samples, and relevance strength weights. Preferably, the retrieval model uses temporal gating encoding to obtain query vectors and document vectors, and is trained based on matching scores; during training, a weighted contrastive learning objective function is used, so that high-value documents that trigger deep inference have a greater impact on parameter updates.
[0031] The retrieval model employs temporal gating coding, first obtaining a trajectory memory vector based on the inference text after browsing: Then we obtain the query vector and document vector: Match score: The redundancy penalty term is: Where E and Enc represent text encoders, Norm represents vector normalization, and W... q W u W m W d W s and W a The parameter matrix is ρ, and the coefficients of each δ are learnable or preset parameters.
[0032] The training objective function in step S105 is: The weighted comparison loss term is: The interval loss term is:
[0033] The normalized denominator is: Where τ is the temperature coefficient, m is the interval threshold, the weight term represents the correlation strength weight of each sample, N is the batch size, and the negative sample set includes unviewed documents within the trajectory, viewed documents judged as invalid, and other query-related documents within the batch.
[0034] In step S106, the system redeploys the optimized retrieval model back to the agent system to continue providing retrieval services. As new tasks are continuously executed, the agent will continuously generate new search, browsing, and reasoning trajectories. The system then uses these new trajectories to construct training data and perform incremental updates, thus forming a closed-loop process of "deployment-collection-filtering-weighting-training-redeployment" to achieve long-term evolution of the retrieval model.
[0035] In Example 1, a deep research agent can be run in an environment that collects evidence from open encyclopedic corpora or web pages. For each complex question-answering task, the agent sequentially generates intermediate queries around multiple sub-questions. After the system returns candidate summary fragments from web encyclopedia entries, technical announcements, or product descriptions, the agent selects some documents to browse. The system transforms these trajectories into retrieval training samples and uses the inference text after browsing to filter out truly valuable documents to train a new dense retrieval system. After redeployment, the agent can recall more critical evidence in fewer steps in similar tasks.
[0036] In Example 2, this invention can be run within an enterprise knowledge base scenario. Internal enterprise policy documents, project documents, historical reports, and process manuals can be uniformly constructed into a retrieval corpus. After an employee raises a complex business question through an intelligent assistant, the intelligent agent generates queries around multiple sub-tasks and browses internal documents. The system automatically converts the interaction trajectory during this process into retrieval supervision data, thereby ensuring that the retrieval model continuously aligns with the enterprise's internal business language and knowledge distribution, improving the traceability and compliance of answers.
[0037] In Example 3, the invention can be implemented in a streaming manner within an online service environment. The system summarizes newly generated interaction trajectories according to a preset time window, periodically performs sample construction, inference filtering, sample weighting, and incremental training, and switches to a new version of the retrieval model after verification. This implementation method requires no additional manual annotation and can continuously improve retrieval capabilities by relying on data naturally generated during the normal operation of the agent.
[0038] In addition to the method implementation, the present invention can also be implemented as an apparatus, a computer device, or a computer-readable storage medium. The apparatus includes a trajectory acquisition unit, a sample mining unit, a correlation determination unit, an intensity estimation unit, a time-series coding training unit, and a model backflow deployment unit; the computer device includes a processor and a memory, the memory storing program instructions, which, when executed by the processor, can complete the steps of any of the above embodiments.
[0039] While some embodiments of the invention have been shown and described, those skilled in the art will understand that modifications may be made to these embodiments without departing from the principles and spirit of the invention as defined by the claims and their equivalents.
Claims
1. A method for optimizing a time-dependent data prediction model based on agent trajectory feedback, characterized in that, The method is used in tasks such as webpage evidence text collection, open encyclopedia question answering, enterprise knowledge base question answering, or retrieval enhancement generation to optimize the retrieval model used to predict the relevance or evidence value of intermediate queries and candidate documents. The method includes steps S101-S106: S101, Collect the multi-round execution trajectory formed by the interaction between the intelligent agent and the retrieval system during the task execution process. The execution trajectory includes the reasoning text, search action, browsing action, candidate document summary, browsing document content and final answer feedback for each round. S102, Construct initial supervised samples of query-document based on the temporal correspondence between the candidate document set returned by the search action and subsequent browsing actions; S103: Based on the reasoning text generated immediately after the target document is viewed, the relevance of the viewed document is determined, and reasoning perception filtering is performed on the positive sample candidates to obtain valid positive samples. S104. Estimate the relevance strength weight of valid positive samples based on the length of the inference text after browsing, the number of times evidence is cited, the number of fact entries, changes in subsequent actions or their combinations. S105, Train the retrieval model based on effective positive samples, negative samples, and relevance strength weights; S106, the optimized retrieval model is redeployed to the agent system to continuously collect new trajectories and form a closed-loop update.
2. The method as described in claim 1, characterized in that, In step S101, each search action returns K candidate documents, and each candidate document returns a title, summary fragment, or body text fragment; the K candidate documents are the top K candidate documents ranked according to the agent's browsable value score; for the i-th candidate document in the t-th round, the agent's browsable value score is calculated as follows: Redundant items: Where σ is the Sigmoid function, cos is the cosine similarity, and ln is the natural logarithm; u is the information gap vector obtained by encoding the current task context, historical reasoning text, and unmet information needs; q is the intermediate query vector generated based on the information gap; and s is the candidate summary vector obtained by the text encoder from the candidate document title, summary fragment, or body text fragment; each α coefficient is a learnable or preset parameter; when the historical viewed summary set is empty, the redundant term is 0.
3. The method as described in claim 2, characterized in that, When the highest browsable value score reaches the preset browsing threshold, the corresponding candidate document is browsed; when the highest browsable value score does not reach the preset browsing threshold, the intermediate query is rewritten or a new search action is initiated; when multiple documents need to be browsed, they are selected in descending order of browsable value score, and candidate documents whose similarity to the selected abstract exceeds the preset diversity threshold are removed.
4. The method as described in claim 1, characterized in that, In step S102, if the agent browses candidate documents within a preset time window after the search action, the intermediate query and the browsed document are taken as positive sample candidates; unbrowsed documents, browsed documents that are determined to be invalid, and documents corresponding to other queries in the same candidate set are taken as negative samples.
5. The method as described in claim 1, characterized in that, The relevance determination model in step S103 is a reasoning-evidence consistency determination model. The evidence coverage score characterizes the degree to which the candidate document covers the key entities, facts, and constraints of the current problem; the information gap reduction score characterizes the degree to which unmet information needs are reduced after browsing the document; the citation or fact extraction score characterizes the strength of the reasoning text's citation of the document or extraction of fact items; the subsequent planned contribution score characterizes the degree to which the document promotes subsequent queries, browsing, or answer integration; and the rejection or abandonment score characterizes the degree to which the reasoning text marks the document as irrelevant, duplicate, or no longer used. The relevance determination model calculates the relevance probability based on the above scores: When the relevance probability reaches a preset relevance threshold, it is retained as a valid positive sample; otherwise, it is discarded or transformed into a difficult negative sample. The relevance determination model is fine-tuned through manual annotation, feedback on the correctness of the final answer, or weak labels generated by a large language model. The fine-tuning objective is: In the formula, the five c terms correspond to the evidence coverage score, information gap reduction score, citation or fact extraction score, subsequent plan contribution score, and denial or abandonment score, respectively. The constant term in the formula is the bias term, and the five γ weight terms are the weight coefficients of the above five scores, respectively. y represents the relevance label, θ represents the relevance determination model parameter, λ represents the regularization coefficient, and the R term represents the regularization constraint on the relevance determination model parameter, which is used to constrain the model complexity and suppress overfitting.
6. The method as described in claim 1, characterized in that, In step S104, let l represent the number of terms in the inference text generated immediately after the target document is viewed, β represent the median length of all valid positive sample inference texts, and the length-induced unnormalized intensity score is: The length-induced correlation strength weights are: Where ε is a smoothing constant, and the mean term in the denominator represents the global mean of the unnormalized intensity scores; the longer the reasoning text, the more likely the document is to be used for evidence integration, plan updates, or answer generation.
7. The method as described in claim 6, characterized in that, In step S104, after calculating the length-induced relevance strength weights, joint value estimation is further performed. This joint value estimation includes calculating the evidence citation strength and the fact entry strength. The joint value vector is composed of length strength, evidence citation strength, fact entry strength, subsequent browsing trigger strength, and search round reduction strength, and the joint value score is calculated as follows: Define the normalized denominator: The final sample weights are: Where c represents the number of citations, f represents the number of fact entries, M represents the number of training samples, η is the learnable weight, the softplus function is used for smoothing the mapping, and the clip function is used to limit the weights to a preset range.
8. The method as described in claim 1, characterized in that, In step S105, the retrieval model uses temporal gating coding to first obtain the trajectory memory vector based on the inference text after browsing: Then we obtain the query vector and document vector: Match score: The redundancy penalty term is: where E and Enc represent text encoders, Norm represents vector normalization, W q , W u , W m , W d , W s and W a are parameter matrices, and each δ coefficient is a learnable or preset parameter.
9. The method as described in claim 8, characterized in that, The training objective function in step S105 is: The weighted comparison loss term is: The interval loss term is: The normalized denominator is: Where τ is the temperature coefficient, m is the interval threshold, the weight term represents the correlation strength weight of each sample, N is the batch size, and the negative sample set includes unviewed documents within the trajectory, viewed documents judged as invalid, and other query-related documents within the batch.
10. A system for optimizing a temporally dependent data prediction model based on agent trajectory feedback, characterized in that, It includes a processor and a memory, wherein the memory stores program instructions that, when executed by the processor, implement the method according to any one of claims 1 to 9.