D-optimal distillation retrieval enhanced generation method based on gradient features

By using gradient-feature-based D-optimal distillation retrieval to enhance the generation method, and by combining the embedding model, teacher model and student model, we solve the problems of high online computation, reliance on the golden answer and gradient noise in the existing technology, and achieve efficient and low-cost context selection and generator-aware evidence fragment selection.

CN122240896APending Publication Date: 2026-06-19TIANJIN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TIANJIN UNIV
Filing Date
2026-03-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing retrieval enhancement generation methods suffer from high online computation costs, reliance on golden answers or high-quality pseudo-answers, gradient noise and instability, and potentially unstable/high-complexity log det calculations, making it difficult to select high-quality contexts and suppress redundancy and conflicts under budget constraints.

Method used

We employ a gradient-feature-based D-optimal distillation retrieval enhancement generation method. By constructing an embedding model, a teacher model, and a student model, we utilize gradient contribution features and ensemble-level information gain to select evidence fragments. Combining distillation with lightweight inference strategies, we reduce online computational overhead and suppress redundancy and conflicts.

🎯Benefits of technology

It significantly reduces token costs under budget constraints, selects a set of evidence that is useful and non-redundant for generating the correct answer, achieves efficient context selection, and reduces online computational overhead, making it suitable for practical system deployment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240896A_ABST
    Figure CN122240896A_ABST
Patent Text Reader

Abstract

This invention discloses a gradient-feature-based D-optimal distillation retrieval enhancement generation method, relating to the fields of retrieval enhancement generation and large language model technology. The method includes the following steps: constructing a retrieval enhancement generation framework, including an embedding model, a teacher model, a student model, and a generator model; the teacher model acquires gradient contribution features and outputs a teacher supervision signal; the retrieval enhancement generation framework performs offline training, using a two-stage recall method to acquire distilled data for student model training, and distilling the marginal gain score and selection result output by the teacher model as a knowledge transfer signal into the student model; the retrieval enhancement generation framework performs online inference, the embedding model recalls candidate evidence fragments based on the question, the student model outputs scores for each candidate evidence fragment, and the evidence fragment selection is completed under budget constraints; a subset of evidence fragments is input into the generator model to generate the answer. This invention significantly reduces token cost and suppresses redundancy and conflicts in the selected evidence fragments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the fields of retrieval enhancement generation and large language model technology, and particularly relates to a D-optimal distillation retrieval enhancement generation method based on gradient features. Background Technology

[0002] The basic workflow of retrieval-enhanced generation (RAG) is currently: retrieval → context splicing → generation. However, the evidence fragments returned by the retrieval are often lengthy, complex, and repetitive. Direct splicing can lead to budget overruns and noise interference. Therefore, context selection is required from these evidence fragments. Typically, in retrieval-enhanced generation (RAG) and long-context-based question answering (QA) systems, the retrieval unit returns a large number of candidate evidence fragments (chunks) related to the question. Limited by the model context length and inference cost, the system must select a subset from the candidate set as the generator's input context. This selection process presents the following typical technical challenges:

[0003] (1) Selection of evidence subset under budget constraints: Within a fixed context budget, the selected context should maximize the quality of downstream responses while avoiding redundancy and noise.

[0004] (2) Redundancy and conflict: Search results are often semantically similar, repetitive, or contain conflicting information, which dilutes or misleads the generator’s attention.

[0005] (3) Generator perception: It is impossible to assess the true contribution of a piece of evidence to generating the correct answer by simply calculating the similarity between the question and the evidence fragment. The ideal choice should be to utilize the generator's sensitivity to evidence.

[0006] (4) Online deployability: Each inference relies on expensive multiple generation / backpropagation, making it difficult to deploy in real-world systems.

[0007] To address the aforementioned technical challenges, existing work has conducted in-depth research in areas such as problem-based similarity reordering, set-level selection, generator-aware contextual estimation / selection, and data selection. While current implementations and representative works have improved upon these technical challenges, shortcomings still exist. For problem-based similarity-based ranking, popular existing methods include BM25 / embedding top-k, cross-encoder rerank, and MMR (maximum marginal correlation). These methods typically score segment by segment, but they lack sufficient modeling for redundancy and conflict between segments and usually do not utilize generator feedback.

[0008] For set-level selection using DPP / logdet, SMART-RAG proposes a training-free context selection framework. It uses the Determinantal Point Process (DPP) to simultaneously model relevance and diversity in a determinant / logdet form, and further considers conflicting factors to mitigate the problems of redundancy and contradictory evidence. A common feature of these methods is that the objective function is set-level, but the evidence representation / kernel function is often based on embedding representations or heuristics, typically lacking direct signals with generator sensitivity.

[0009] For generator-aware context evaluation / selection, Influence Guided Context Selection refactors context quality assessment into a data evaluation problem at inference time, proposing Contextual Influence Value (CIvalue): the performance degradation after removing a certain context segment, used to simultaneously reflect query-aware, list-aware, and generator-aware aspects, and training the surrogate model to predict the CI value at inference time to reduce computational overhead. This approach explicitly utilizes the generator, but Oracle often requires multiple generation or evaluations to calculate the CI value, making it inherently computationally heavy.

[0010] For data selection in the context of information gain / Fisher / logdet, FisherSFT (ICML 2025) proposed using the Fisher information matrix to measure information gain, selecting a subset of training samples under budget constraints to improve the data efficiency of SFT, and discussed a computable approximation. This approach proves the universality of Fisher-logdet in the budget-information gain problem, but mainly focuses on the selection of training data subsets rather than the selection of contextual evidence fragments during inference.

[0011] The aforementioned implementation schemes and representative works still suffer from drawbacks such as high online computation costs, reliance on a "golden answer" or high-quality pseudo-answers, gradient noise and instability, and potentially unstable / complex log det calculations. The main reasons for this are: (1) For each candidate evidence fragment, a forward propagation + backward propagation is performed to obtain the embedding gradient of the input; if the number of candidates is too large, the inference latency and memory overhead will be unacceptable. (2) Gradient features It is constructed around the loss of the golden answer; without the golden answer, it is difficult to define the supervision signal, which makes it unusable during inference; (3) Pure gradients are sensitive to scale and easily affected by details of cue words, and token averaging will dilute the truly effective tokens; (4) If matrix updates and log det are performed on the original hidden layer dimension, the calculation is complex and numerically sensitive (requiring eps, stabilization, and approximation).

[0012] To address this, this invention proposes a gradient-feature-based D-optimal distillation retrieval enhancement generation method. In RAG / QA scenarios, it provides a high-quality context selection method under budget constraints, enabling the selected evidence fragments to maintain high answer quality while significantly reducing token costs and suppressing redundancy and conflicts. It introduces a generator-aware evidence contribution characterization method, elevating the selection criterion from semantic relevance to causal contribution to generating the correct answer. Through distillation and lightweight inference strategies, it transforms the expensive teacher model into a deployable student model (typically a small model) during inference, thus eliminating the need for backpropagation and a "golden answer" during the inference phase. It maintains complementarity and diversity at the ensemble-level modeling level and can be combined with conflict modeling / filtering mechanisms. Summary of the Invention

[0013] The purpose of this invention is to provide a gradient feature-based D-optimal distillation retrieval enhancement generation method to solve the problems of high online computation cost, reliance on golden answers or high-quality pseudo answers, gradient noise and instability, and potentially unstable / high complexity of log det calculation in the above-mentioned background art.

[0014] To achieve the above objectives, the present invention employs the following technical solution: This invention proposes a gradient feature-based D-optimal distillation retrieval enhancement generation method, comprising the following steps: S1. Construct a retrieval enhancement generation framework, which includes an embedding model, a teacher model, a student model, and a generator model; Among them, the embedding model is used to recall Top-K candidate evidence fragments from the candidate evidence fragment set according to the question; the teacher model is used to construct gradient contribution features and output teacher supervision signals based on the question, candidate evidence fragments, generator model and golden answer during the training phase; the student model is used to score candidate evidence fragments during the inference phase; and the generator model is used to generate answers based on the finally selected subset of evidence fragments. S2. Improve the teacher model, which obtains gradient contribution features and outputs teacher supervision signals; S3. The retrieval enhancement generation framework performs offline training; the embedding model is used to perform the first stage of recall for candidate evidence fragments corresponding to the question; the teacher model is used to calculate the gradient contribution features of the recalled candidate evidence fragments; and the second stage of recall is achieved by selecting a subset of evidence fragments under budget constraints based on the set-level log det information gain; the two-stage recall is used to obtain distilled data for training the student model; and the marginal gain score and selection results output by the teacher model are distilled into the student model as knowledge transfer signals. S4. The retrieval-enhanced generative framework performs online reasoning; the embedded model recalls candidate evidence fragments based on the question, the student model outputs scores for each candidate evidence fragment, completes the selection of evidence fragments under budget constraints, and inputs the selected subset of evidence fragments into the generator model to generate the answer.

[0015] Preferably, the enhanced generation framework is retrieved in S1, specifically as follows: The embedding model is used to recall Top-K candidate evidence fragments from the candidate evidence fragment set according to the question. The embedding model adopts a dual-tower text embedding model to achieve vector recall. The teacher and student models use the same basic Transformer decoder architecture, differing only in parameter size.

[0016] Furthermore, the teacher model adopts a pre-trained large language model with a larger parameter scale than the student model; the teacher model adopts an 8B parameter scale Llama model, and the student model adopts a 1B parameter scale Llama model.

[0017] The inputs and outputs of the retrieval enhancement generation framework during the offline training phase are as follows: Input includes: Question Candidate evidence fragment set Context budget Generator model and the golden answer ; context budget The preset token limit is used for the selection of evidence fragments under budget constraints in the teacher and student models, rather than for the output constraints embedded in the model itself; The output includes: a subset of semantically relevant evidence fragments selected within the budget. The marginal gain score or selection label corresponding to each candidate evidence fragment output by the teacher model.

[0018] The inputs and outputs of the retrieval-enhanced generation framework during the online inference phase are as follows: Input includes: questions to be answered Candidate evidence set C and context budget B; The output includes: a subset S of evidence fragments selected by the student model, and the answer output by the generator model based on this subset of evidence fragments.

[0019] Preferably, the improved teacher model in S2 is as follows: The teacher model uses each candidate evidence fragment and question... And the Golden Answer As input, combined with the generator model Gradient contribution representations are obtained; a set of Fisher approximation matrices is constructed based on these gradient contribution representations, and the informativeness and complementarity of the selected evidence fragment set are measured by maximizing the logdet information gain; context budgeting is then performed. Under constraints, a cost-normalized greedy strategy is used to progressively select the evidence fragments with the largest marginal gain, thus obtaining the subset of evidence fragments selected by the teacher model; The teacher model not only outputs the selected subset of evidence fragments, but also outputs the marginal gain score and / or the label of whether each candidate evidence fragment is selected; the marginal gain score and / or the selection label constitute the teacher supervision signal.

[0020] Furthermore, the gradient contribution is represented as follows: For each candidate evidence fragment Construct the input template:

[0021] Define loss Only applies to the Gold Answer From the tokens, we get:

[0022] Take the gradient on the context token, construct the token contribution, and then aggregate them into a fragment of evidence representation:

[0023] in, Pool For average pooling, This represents the input of the embedded representation layer.

[0024] Furthermore, the subset of evidence fragments selected by the teacher model is as follows: First, define the set Fisher's approximation matrix:

[0025] Then maximize the information content of the set:

[0026] Each step selects to make The largest piece of evidence.

[0027] Preferably, in step S3, a two-stage recall is used to obtain distillation data, as follows: Recall problem using pre-trained embedding models Top-K candidate evidence fragments; for each question The Top-K candidate evidence fragments are used as input to the teacher model, and the teacher model outputs the marginal gain. ,Right now And whether or not they are ultimately selected. The input and output of the teacher model are used as distillation data.

[0028] Preferably, the student model training in S3 is specifically as follows: The student model employs a regression / ranking scorer, and the learning mapping is specifically as follows: ; The student model scores each candidate evidence fragment, and the selected evidence fragment is obtained by sorting the fragments by score within the context budget.

[0029] Preferably, the retrieval enhancement generation framework in S4 performs online inference, specifically as follows: Using embedded models to address recall problems The Top-K candidate evidence fragments are used to score each candidate evidence fragment by running only the forward pass of the student model, and the set of selected evidence fragments is input into the generator model to answer.

[0030] Compared with the prior art, the beneficial effects of the present invention are: (1) Compared with context selection methods that rely solely on similarity or heuristic quality scores based on embedded representations, this invention utilizes generator-aware gradient contribution representation and ensemble-level information gain. This enables joint modeling of the validity and complementarity of evidence, thus favoring the selection of evidence sets that are truly useful and non-redundant in generating the correct answer under the same token budget. This advantage comes from specific technical means: using gradient construction that calculates the loss only for the golden answer, and using the marginal gain of logdet as the set-level selection criterion, rather than sorting by segment similarity.

[0031] (2) Compared with generator-aware differential estimation methods (such as CI value), this invention has significant potential in terms of engineering deployability: through two-stage (recall and then fine sorting) and teacher → student distillation, the expensive gradient-log det process is mainly placed in the offline training stage, and the inference stage only needs to run the forward pass of the student model to complete the selection within budget, thereby significantly reducing the online computation overhead; this advantage comes from specific technical means: distilling the marginal gain / selection decision of the teacher model to the scorer of the student model, and adopting a budget-constrained score-based selection strategy during inference. Attached Figure Description

[0032] Figure 1 This is a flowchart of the gradient feature-based D-optimal distillation retrieval enhancement generation method in this invention. Detailed Implementation

[0033] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0034] Example 1: The gradient feature-based D-optimal distillation retrieval enhancement generation method includes the following steps: Step 1: Construct an overall retrieval enhancement generation framework and dataset.

[0035] The overall retrieval enhancement generative framework is configured with an embedding model, a teacher model, a student model, and a generator model. The embedding model is used for candidate evidence fragment recall, the teacher model is used for offline construction of high-quality supervision signals, the student model is used for online on-budget evidence selection, and the generator model is used to output the final answer. A training set is pre-built, and the dataset includes questions... Documents or pieces of evidence, context budget And the corresponding golden answer .

[0036] In this embodiment, wherein: (1) The embedding model is used to represent the document or evidence fragments into vectors according to the question, and to recall the Top-K candidate evidence fragments that are semantically related to the question from the candidate evidence fragment set; Specifically, the embedding model can be implemented using the existing dual-tower text embedding model for vector recall; the embedding model itself does not directly perform the final evidence selection within the budget, and the final evidence selection under budget constraints is completed by the teacher model or the student model.

[0037] (2) The teacher model is used in the offline training phase to calculate the gradient contribution features of each candidate evidence fragment to the generation of the answer based on the question, candidate evidence fragments, generator model, and golden answer, and outputs the teacher supervision signal by combining the ensemble-level log det information gain. The teacher model uses an 8B Llama large language model as the backbone model.

[0038] (3) The student model is used to learn the supervision signals output by the teacher model and to score candidate evidence fragments during the online reasoning stage. The student model uses a Llama large language model with a parameter scale of 1B as the backbone model.

[0039] (4) The generator model is used to receive the final selected subset of evidence fragments and generate the answer to the question.

[0040] The student model can be obtained by distilling the teacher model, but this invention is not limited to the model size mentioned above. The teacher and student models can employ the same basic Transformer decoder architecture, differing only in parameter size; the teacher model is used for gradient calculation and ensemble selection in the offline phase, while the student model is used for lightweight forward scoring in the online phase. Due to the larger parameter size of the teacher model, it exhibits stronger representation and generation sensitivity, making it suitable for constructing high-quality supervision signals; due to the smaller parameter size of the student model, its online inference computation cost is lower, making it suitable for deployment in practical retrieval enhancement generation systems.

[0041] Step 2: Obtain gradient contribution features and output teacher supervision signals based on the teacher model.

[0042] In this embodiment, the teacher model is a large language model based on a Transformer decoder architecture, and its backbone network can be Llama 8B. The teacher model receives a sequence input formed by concatenating a question, a single candidate evidence fragment, and an answer hint template. During the prediction of the golden answer, it only backpropagates the loss of the token corresponding to the golden answer, thereby obtaining the gradient information of the input context. Furthermore, the gradient contribution of the context token can be extracted in the embedding representation layer or the input representation layer of the teacher model, and the token-level contributions can be pooled into an evidence fragment-level representation through pooling operations.

[0043] (a) Obtaining gradient contribution features; In existing technologies, context selection largely relies on similarity based on embedding representations or heuristic quality scores; for example, the determinant mechanism of SMART-RAG mainly depends on similarity-based kernels to reflect diversity, without requiring gradient contribution features. This invention considers gradient contribution features, using the input gradient of the generator on the answer prediction task to construct an evidence fragment representation, directly characterizing the contribution of this evidence fragment to the answer generation loss.

[0044] In this embodiment, the process of obtaining gradient contribution features is as follows: For each candidate evidence fragment Construct the input template:

[0045] And defining the loss only to apply to the tokens of the golden answer (with the tokens of Question and Context set to ignore), we get:

[0046] Input in the embedded presentation layer Gradients are taken from the context token to construct token contributions, which are then aggregated into evidence fragments representing Grad×Input:

[0047] in This represents the gradient contribution feature, i.e., the evidence fragment constructed by the gradient; Pool can be average (mean) pooling to reduce the dilution effect of long chunks.

[0048] This invention advances the selection criteria from semantic relevance to the effectiveness at the generator behavior level through gradient representation, but the teacher model is computationally expensive and requires further distillation.

[0049] (ii) Obtain the log det information gain (D-optimal / Fisher-like) of the set-level target; In existing technologies, CI value achieves query / list / generator-aware estimation by removing a context and incurring a performance degradation, and then further trains surrogate predictions; its essence is differential estimation, rather than a log-det information gain objective. This invention generates ensemble-level supervision signals on the teacher model side, the results of which will serve as a source for subsequent knowledge transfer, such as... Figure 1 As shown.

[0050] Specifically, the teacher model constructs a set of Fisher approximation matrices based on the gradient contribution representations of each candidate evidence segment, and measures the information content and complementarity of the selected evidence segment set by maximizing the log det information gain. Under the context budget constraint, a cost-normalized greedy strategy is used to progressively select the evidence segments with the largest marginal gain, resulting in the subset of evidence segments selected by the teacher model. The teacher model not only outputs the final selected subset of evidence segments, but also outputs the marginal gain score and / or selection label for each candidate evidence segment; the aforementioned marginal gain score and / or selection label constitute the... Figure 1 The teacher supervision signals required for "knowledge transfer" are used in the next step to distill and train the student model.

[0051] First, define the set Fisher's approximation matrix:

[0052] Then maximize the information content of the set:

[0053] And use a cost-normalized greedy algorithm: each step of the selection makes The largest piece of evidence.

[0054] This invention utilizes The target inherently suppresses redundancy, making it suitable for set complementarity modeling; unlike CI values ​​which more directly align downstream performance but are more costly for Oracle. The difference between the two lies in their specific technical approaches to set modeling.

[0055] Step 3: Obtain the student model based on knowledge distillation.

[0056] Existing CI value work also proposes surrogates to reduce inference overhead, but their supervision sources and structural designs differ (modeling around CI oracle and hierarchical interactions). This invention uses high-quality marginal gain supervision signals (gradients) generated offline by a teacher model. The student model is trained by knowledge distillation, which enables the student model to learn the teacher model's ability to judge the importance and complementarity of candidate evidence fragments, thereby avoiding backpropagation and reliance on the golden answer in the online stage.

[0057] In this embodiment, the retrieval enhancement generation framework performs an offline training phase and an online inference phase. The offline training phase includes: using an embedding model to perform a first-stage recall of candidate evidence fragments corresponding to the question; using a teacher model to calculate gradient contribution features of the recalled candidate evidence fragments; and selecting a subset of evidence under budget constraints based on ensemble-level log det information gain, thereby constructing distilled data to train the student model. The online inference phase includes: first, recalling Top-K candidate evidence fragments for the question to be answered using the embedding model; then, running only the student model to score the candidate evidence fragments and selecting a subset of evidence fragments within the context budget; finally, inputting the selected subset of evidence fragments into the generator model to generate the answer.

[0058] (a) Offline training phase; The inputs and outputs of the retrieval enhancement generative framework during the overall training phase are as follows: The inputs for the overall training phase include: the problem q The candidate evidence fragment set C, the context budget B, the generator model G, and the golden answer corresponding to the training samples. y ; The output of the overall training phase includes: a subset S of evidence fragments selected within the budget B, and marginal gain scores or selection labels corresponding to each candidate evidence fragment output by the teacher model.

[0059] In this embodiment, the student model is a lightweight large language model based on the Transformer decoder architecture, and its backbone network can be Llama 1B. The student model receives the question and candidate evidence fragments as input, and outputs the relevance score, regression score, or ranking score of the candidate evidence fragments through forward propagation. The training objectives of the student model include the following: (1) The regression objective of fitting the marginal gain score of the teacher model output; (2) Fit the classification target of whether the label is selected from the teacher model output; (3) Construct ranking learning objectives based on the candidate ranking relationships given by the teacher model.

[0060] To address computational redundancy and label dependency, this invention trains the teacher and student models differently. For the teacher model, a two-stage recall process is employed to construct distilled data, specifically: (1) Two-stage recall: First, use the embedding model to perform coarse recall on the candidate evidence fragments to obtain the Top-K candidate evidence fragments, where K is much smaller than the original total number of candidates N; (2) Teacher supervision construction: The teacher model calculates the gradient contribution representation for each Top-K candidate evidence fragment, and makes greedy selections with budget constraints of log det information gain, outputting the marginal gain score, ranking information and / or the label of whether each candidate evidence fragment is selected; (3) Student distillation training: The marginal gain score, ranking relation and / or selection label output by the teacher model are used to distill the student model so that the student model learns the evidence selection ability of the teacher model.

[0061] This invention employs a teacher model (gradient- Offline generation of high-quality marginal gain supervision trains student models as scorers during inference, avoiding online backpropagation and answer dependence.

[0062] (ii) Online reasoning stage; The inputs and outputs of the retrieval-enhanced generative framework during the inference phase are as follows: The input for the online reasoning stage includes: unanswered questions. q Candidate evidence set C and context budget B; The output of the online reasoning phase includes: a subset S of evidence fragments selected by the student model, and the answer output by the generator model based on this subset of evidence fragments.

[0063] The online inference phase only runs the student model forward pass, which includes: the embedding model recalls the Top-K candidate evidence fragments, the student model outputs a score for each candidate evidence fragment, completes the evidence fragment selection within the context budget, and inputs the selected subset of evidence fragments into the generator model to generate the answer.

[0064] This invention transfers knowledge from an "expensive but powerful" ensemble-level teacher model to a cheap and usable student model through distillation, making it highly practical for engineering applications. The key lies in the consistency between the distillation objective design (regression / ranking) and the budget selection strategy.

[0065] In summary, the core process of this embodiment can be summarized as follows: First, the embedding model completes the coarse recall of candidate evidence fragments. Then, the teacher model calculates the gradient contribution features based on the golden answer supervision and combines the log det set-level objective to complete the selection of high-quality evidence under budget constraints. Then, the marginal gain score and selection results output by the teacher model are distilled into the student model as knowledge transfer signals. Finally, in the online inference stage, only the lightweight process of "embedded recall + student model forward scoring + in-budget evidence selection + generator to generate answer" is retained.

[0066] The above description is only for the purpose of helping to understand the method and core essence of the present invention, but the scope of protection of the present invention is not limited thereto. For those skilled in the art, any equivalent substitutions or modifications made to the technical solution and inventive concept disclosed in the present invention within the scope of the technology disclosed in the present invention should be covered within the scope of protection of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.

Claims

1. A gradient-feature-based D-optimal distillation retrieval enhancement generation method, characterized in that, Includes the following steps: S1. Construct a retrieval enhancement generation framework, which includes an embedding model, a teacher model, a student model, and a generator model; S2. Improve the teacher model, which obtains gradient contribution features and outputs teacher supervision signals; S3. The retrieval enhancement generation framework performs offline training; the embedding model is used to perform the first stage of recall for candidate evidence fragments corresponding to the question; the teacher model is used to calculate the gradient contribution features of the recalled candidate evidence fragments; and the second stage of recall is achieved by selecting a subset of evidence fragments under budget constraints based on the set-level log det information gain; the two-stage recall is used to obtain distilled data for training the student model; and the marginal gain score and selection results output by the teacher model are distilled into the student model as knowledge transfer signals. S4. The retrieval-enhanced generative framework performs online reasoning; the embedded model recalls candidate evidence fragments based on the question, the student model outputs scores for each candidate evidence fragment, completes the selection of evidence fragments under budget constraints, and inputs the selected subset of evidence fragments into the generator model to generate the answer.

2. The gradient feature-based D-optimal distillation retrieval enhancement generation method according to claim 1, characterized in that, The enhanced generation framework in S1 is as follows: The embedding model is used to recall Top-K candidate evidence fragments from the candidate evidence fragment set according to the question. The embedding model adopts a dual-tower text embedding model to achieve vector recall. The teacher and student models use the same basic Transformer decoder architecture, differing only in parameter size.

3. The gradient feature-based D-optimal distillation retrieval enhancement generation method according to claim 1, characterized in that, The improved teacher model in S2 is as follows: The teacher model uses each candidate evidence fragment and question... And the Golden Answer As input, combined with the generator model Gradient contribution representations are obtained; a set of Fisher approximation matrices is constructed based on these gradient contribution representations, and the informativeness and complementarity of the selected evidence fragment set are measured by maximizing the log det information gain; context budgeting is then performed. Under constraints, a cost-normalized greedy strategy is used to progressively select the evidence fragments with the largest marginal gain, thus obtaining the subset of evidence fragments selected by the teacher model; The teacher model outputs a subset of selected evidence fragments, along with the marginal gain score and / or selection label corresponding to each candidate evidence fragment; the marginal gain score and / or selection label constitute the teacher supervision signal.

4. The gradient feature-based D-optimal distillation retrieval enhancement generation method according to claim 3, characterized in that, The gradient contribution is represented as follows: For each candidate evidence fragment Construct the input template: Define loss Only applies to the Gold Answer From the tokens, we get: Take the gradient on the context token, construct the token contribution, and then aggregate them into a fragment of evidence representation: in, Pool For average pooling, This represents the input of the embedded representation layer.

5. The gradient feature-based D-optimal distillation retrieval enhancement generation method according to claim 4, characterized in that, The subset of evidence fragments selected by the teacher model is as follows: First, define the set Fisher's approximation matrix: Then maximize the information content of the set: Each step selects to make The largest piece of evidence.

6. The gradient feature-based D-optimal distillation retrieval enhancement generation method according to claim 5, characterized in that, In step S3, a two-stage recall process is used to obtain distillation data, as detailed below: Recall problem using pre-trained embedding models Top-K candidate evidence fragments; for each question The Top-K candidate evidence fragments are used as input to the teacher model, and the teacher model outputs the marginal gain. ,Right now And whether or not they are ultimately selected. The input and output of the teacher model are used as distillation data.

7. The gradient feature-based D-optimal distillation retrieval enhancement generation method according to claim 6, characterized in that, The training of the student model in S3 is as follows: The student model employs a regression / ranking scorer, and the learning mapping is specifically as follows: ; The student model scores each candidate evidence fragment, and the selected evidence fragment is obtained by sorting the fragments by score within the context budget.

8. The gradient feature-based D-optimal distillation retrieval enhancement generation method according to any one of claims 1-7, characterized in that, The retrieval-enhanced generative framework in S4 performs online inference, as follows: Using embedded models to address recall problems The Top-K candidate evidence fragments are used to score each candidate evidence fragment by running only the forward pass of the student model, and the set of selected evidence fragments is input into the generator model to answer.