A Two-Stage Document Filtering and Robust Fine-Tuning Method Based on Graph Attention Networks

By employing a two-stage document filtering and robust fine-tuning method based on graph attention networks, the credibility problem of retrieval enhancement generation systems in mixed document environments is solved. This method effectively identifies and filters irrelevant and noisy documents, improving the accuracy and reliability of large language models in complex scenarios.

CN121997918BActive Publication Date: 2026-06-30CHANGCHUN UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHANGCHUN UNIV OF SCI & TECH
Filing Date
2026-04-10
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing retrieval enhancement generation systems struggle to accurately identify useful documents in mixed document environments, are susceptible to counterfactual information interference, leading to unreliable outputs. Furthermore, traditional methods are unstable when noise or counterfactual documents constitute a high proportion of the content.

Method used

A two-stage document filtering method based on graph attention networks is adopted, including constructing multi-class document pools, paragraph-level semantic graphs, training irrelevant and noisy document discrimination models, and enhancing the credibility of the large language model through joint robust fine-tuning. Irrelevant documents are first removed and then noisy documents are identified. Document discrimination training samples and question answering training samples are constructed to improve the robustness of the model.

Benefits of technology

It effectively captures the semantic relationships between queries and multiple documents, reduces noise and misleading information interference, improves the credibility and accuracy of the system's output in mixed noise environments, and avoids insufficient evidence due to over-filtering.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121997918B_ABST
    Figure CN121997918B_ABST
Patent Text Reader

Abstract

A two-stage document filtering and robust fine-tuning method based on graph attention networks is proposed, relating to the fields of natural language processing and deep learning. It addresses the problems of existing retrieval augmentation generation systems struggling to accurately identify useful documents in mixed document environments and being susceptible to counterfactual information interference. This method constructs a multi-class document pool including correct documents, counterfactual documents, noisy documents, and irrelevant documents, and builds a semantic graph at the paragraph level. A two-stage graph attention network is used to sequentially filter irrelevant and noisy documents to obtain a reference document set. Based on this reference document set, document discrimination training samples and question-answering training samples are constructed. These two sets are combined into joint fine-tuning data to fine-tune a large language model, enabling the model to possess document reliability discrimination capabilities and maintain robust output in mixed document scenarios. This improves the factual accuracy and credibility of the retrieval augmentation generation system under counterfactual attacks and noisy environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of natural language processing, information retrieval, and deep learning, specifically to a document filtering and model training method for improving the robustness of a Retrieval Augmentation Generation (RAG) system, and more specifically, to a two-stage document filtering and robust fine-tuning method based on graph attention networks. Background Technology

[0002] Currently, RAG technology is widely used in fields such as intelligent question answering, text analysis, and knowledge services. However, as the scale and complexity of external knowledge bases continue to increase, the candidate documents retrieved by the system often contain correct evidence, noisy fragments, topic-irrelevant content, and misleading counterfactual information. When these contents are mixed into the input, they can easily cause the large language model to deviate from the true answer during the reasoning process, thereby causing risks such as factual errors and hallucinatory answers, which seriously affect the reliability of the system in real-world scenarios.

[0003] Existing document filtering methods largely rely on similarity retrieval, single-document scoring, or rule-based ranking strategies, which have limited ability to capture deep semantic relationships between queries and multiple documents. When the document set contains semantically similar but contradictory counterfactual information, traditional methods struggle to distinguish them, easily retaining misleading documents at higher positions. Furthermore, rules and thresholds are often sensitive to data distribution, easily affected by noise, and exhibit unstable performance. In addition, re-ranking methods based on large models are computationally expensive, require high-quality training data, and remain significantly vulnerable to counterfactual attacks.

[0004] Traditional RAG (Research Aggregator) models still primarily rely on their own attention mechanisms to judge document quality when handling mixed document environments. However, large language models lack explicit perception of document credibility, making it difficult to make reliable choices when multiple documents contradict each other. When noisy or counterfactual documents account for a high proportion, the model is easily led by erroneous evidence and outputs unreliable answers. Therefore, to maintain stable performance in complex retrieval scenarios, a technical solution is needed that can model the overall document structure, effectively identify noisy documents, and simultaneously enhance the robustness of large language models under multi-document conditions.

[0005] To address the aforementioned issues, this invention proposes a two-stage document filtering and robust fine-tuning method based on graph attention networks. Summary of the Invention

[0006] To address the problems of existing retrieval enhancement generation systems struggling to accurately identify useful documents in mixed document environments and being susceptible to counterfactual information interference, this invention provides a two-stage document filtering and robust fine-tuning method based on graph attention networks.

[0007] A two-stage document filtering and robust fine-tuning method based on graph attention networks is proposed, which is implemented by the following steps:

[0008] Step 1: Construct multiple document pools and build a mixed document collection;

[0009] Step 2: Perform text preprocessing on the mixed document set to construct a paragraph-level semantic graph;

[0010] Step 3: Train an irrelevant document discrimination model and a noisy document discrimination model based on graph attention network to obtain a reference document set;

[0011] Step 4: Construct document discrimination training samples and question answering training samples based on the reference document set obtained in Step 3, and merge the document discrimination training samples and question answering training samples to obtain a joint robust fine-tuning dataset;

[0012] Step 5: Use the joint robust fine-tuning dataset obtained in Step 6 to perform joint instruction fine-tuning on the large language model, and integrate the fine-tuned large language model with the irrelevant document discrimination model and noisy document discrimination model from Step 3 into the retrieval enhancement generation system. Input the filtered document set and query text from the retrieval enhancement generation system into the robustly fine-tuned large language model to output the final answer.

[0013] The beneficial effects of this invention are:

[0014] The method described in this invention constructs a two-stage document filtering model based on graph attention networks, namely: an irrelevant document discrimination model and a noisy document discrimination model based on graph attention networks. This model can effectively capture the semantic connections and contextual relationships between queries and multiple documents. It first eliminates irrelevant documents that are not related to the query, and then identifies and eliminates noisy documents that may mislead the direction of the answer, thereby reducing the interference of noise and misleading information on the retrieval enhancement generation system from the source. At the same time, this invention retains a certain number of counterfactual documents after filtering and performs binary classification labeling on them. It combines question answering training and document reliability discrimination training for robust fine-tuning of the large language model, enabling the model to have the ability to explicitly identify misleading documents, avoiding insufficient usable evidence due to over-filtering, thereby significantly improving the factuality and credibility of the system output in the context of counterfactual attacks and mixed noise. Attached Figure Description

[0015] Figure 1 This is a flowchart illustrating the construction of multiple document pools in a two-stage document filtering and robust fine-tuning method based on graph attention networks as described in this invention.

[0016] Figure 2 This is a flowchart illustrating the two-stage document filtering and robust fine-tuning method based on a hybrid document set in the present invention. Detailed Implementation

[0017] Combination Figure 1 and Figure 2 This embodiment describes a two-stage document filtering and robust fine-tuning method based on graph attention networks. This method constructs, represents graphically, and filters candidate documents in a retrieval enhancement generation task in stages. It establishes semantic relationships between the query text and multiple candidate documents at the paragraph level. After filtering, it constructs document discrimination training samples and question-answering training samples based on the retained documents, and performs joint robust fine-tuning on the large language model. This improves the accuracy and reliability of the retrieval enhancement generation system's responses even in the presence of irrelevant, noisy, and misleading documents. The method is specifically implemented through the following steps:

[0018] Step 1. Construct multiple document pools, the specific process is as follows:

[0019] Step 11. Obtain the raw data and external knowledge base corpus for the retrieval enhancement generation task. The raw data shall at least include the query text, the standard answer, and the corpus document used for retrieval.

[0020] Step 12. Perform vector encoding on the external knowledge base corpus to establish a similarity retrieval index;

[0021] In this implementation, each document in the knowledge base is encoded into a fixed-dimensional semantic vector using the sentence vector encoding model (e5-base-v2), and an index is built using a vector similarity retrieval library to support the retrieval of candidate documents for query text, providing a source of candidate documents for the subsequent construction of document pools of different categories.

[0022] Step 13. Divide the query texts and their corresponding standard answers in the original data into training set and test set according to preset rules; for each query document in the training set and test set, retrieve several documents according to the index constructed in step 2, and sample and organize these documents to form a correct document pool (pos), irrelevant document pool (nsy), and noisy document pool (neg) corresponding to each query text.

[0023] Step 14. Generate a counterfactual document pool (cf) based on the query text and correct documents;

[0024] In this embodiment, semantic replacement, rewriting, or perturbation operations are performed on the correct documents using a large language model to obtain counterfactual documents that are misleading but superficially reasonable. The counterfactual document pool is then aligned with the various document pools in step 13 according to the query text number to form multiple document pools.

[0025] Step 2. Extract documents from the correct document pool, simulate the Top-k candidate document sequence returned by RAG during the retrieval phase, and construct a simulated attack document sequence (mixed document set) according to the preset attack type (counterfactual attack, noise attack, irrelevant attack, mixed attack) and replacement probability (attack strength) τ.

[0026] In this implementation, the top k1 correct documents for each query text are selected as initial candidate documents, denoted as:

[0027]

[0028] In the formula, D q To query the collection of documents containing text q, d i For the i-th document;

[0029] Then, based on the preset attack type and attack strength, replacement documents are selected from the corresponding document pool, and some documents in the initial candidate documents are replaced to obtain a mixed document set:

[0030]

[0031] In the formula, To query the mixed document set formed by text q under attack strength τ, The document is any document in the mixed document set; the mixed documents in the mixed document set are as follows: when the attack type is a counterfactual attack, the replacement document comes from the counterfactual document pool; when the attack type is an irrelevant attack, the replacement document comes from the irrelevant document pool; when the attack type is a noise attack, the replacement document comes from the noise document pool; when the attack type is a mixed attack, the replacement document comes from multiple document pools.

[0032] Step 3: Process the mixed document set obtained in Step 2. Text preprocessing is performed to remove invalid characters and normalize whitespace; the semantic vector of each document is obtained using the sentence vector encoding model (e5-base-v2); at the same time, document attribute features are calculated, and the semantic vector and attribute features are concatenated to form the document node feature vector.

[0033] For any document in the mixed document collection Its node characteristics are:

[0034]

[0035] In the formula, e( (Document) semantic embedding, To query text q and documents The document attribute features between the query and the document include the similarity between the query and the document, the document length, and the retrieval score; the node feature vector Used for subsequent paragraph-level semantic graph construction.

[0036] Step 4: Construct a paragraph-level semantic graph based on the node feature vectors obtained in Step 3;

[0037] Use query text as query nodes and combine mixed document collections. Using documents as document nodes, construct a paragraph-level semantic graph corresponding to the query text:

[0038]

[0039] In the formula, For a set of nodes, Let be the set of edges, where the set of nodes is... Includes query nodes and document nodes, document node v i With Documents One-to-one correspondence, its node feature vector is x i Based on the similarity between document semantic vectors, k2 nearest neighbor nodes are selected for each document node to establish document-to-document edges, and query nodes are connected to the document nodes with the highest similarity to establish query-to-document edges; when two document nodes are neighbors, reciprocal bidirectional edges are established, thus obtaining a paragraph-level semantic graph for graph attention network processing.

[0040] Step 5: Train an irrelevant document discrimination model based on graph attention network (first-stage filtering);

[0041] The paragraph-level semantic graph constructed in step 4 The first-stage graph attention network is input to score each document node, obtaining the predicted probability that each document node belongs to an irrelevant document. During training, irrelevant document nodes are used as positive supervision signals, and the remaining document nodes are used as negative supervision signals. Binary cross-entropy loss is used to optimize the model parameters, and the irrelevant document determination threshold t is determined according to a preset accuracy target. A .

[0042] In this embodiment, the irrelevant document discrimination model is used to analyze the paragraph-level semantic graph. Filter out each document node in the document;

[0043] For any document node, the predicted probability that it belongs to an irrelevant document is denoted as . Each document node is labeled. If the predicted probability of a document node is greater than or equal to the irrelevant document determination threshold t, then... A If a document node has a label of 0, mark it as 0; mark the remaining document nodes as 1. After labeling all document nodes, delete the document nodes with a label of 0 and their corresponding edges, and keep the document nodes with a label of 1. The document set obtained after the first stage of filtering is denoted as:

[0044]

[0045] In the formula, This is the collection of documents retained after the first stage of filtering.

[0046] Step 6: Train a noisy document discrimination model based on graph attention network (second-stage filtering);

[0047] The document nodes in the document set filtered in the first stage are then screened out to obtain the predicted probability that each document node belongs to a noisy document. During training, noisy documents are used as supervision signals for binary classification training, enabling the model to identify noisy documents that may mislead the answer direction or introduce bias. The noisy document judgment threshold t is determined according to the preset accuracy target. B .

[0048] In this embodiment, a noisy document discrimination model is used to score and filter out each document node in the document set after the first stage of filtering. For any document node, the predicted probability that it belongs to a noisy document is denoted as... Each document node is labeled. If the predicted probability is greater than or equal to the noisy document determination threshold t, then... B If a document node has a label, mark it as 0; mark the rest of the document nodes as 1. After labeling all document nodes, delete the document nodes with a label of 0 and their corresponding edges, and keep the document nodes with a label of 1. The document set obtained after the second stage of filtering is denoted as:

[0049]

[0050] In the formula, This is the document set (reference document set) after the second stage of filtering. The document set mainly contains correct documents and counterfactual documents.

[0051] Step 7: Based on the reference document set obtained in Step 6 Construct document discrimination training samples; generate binary classification reliability labels for each document in the mixed reference document set: when the document comes from the correct document pool, it is labeled as "Evaluation 1: Helpful"; when the document does not come from the correct document pool, it is labeled as "Evaluation 2: Contains error information and is not helpful"; and output the corresponding labels of each document in the format of "Doc i: Evaluation j" to form document discrimination training samples.

[0052] Step 8: Based on the hybrid reference document set obtained in Step 6 Construct question-and-answer training samples;

[0053] Using the query text and the aforementioned mixed reference document set as input, and the standard answer as output, a question-answering training sample is formed for training a large language model to answer questions based on multiple documents.

[0054] Step 9: Merge the question-answering training samples and the document discrimination training samples to form a joint robust fine-tuning dataset, and randomly adjust the order of the joint robust fine-tuning dataset to improve training stability.

[0055] Step 10: Use the joint robust fine-tuning dataset obtained in Step 9 to perform joint instruction fine-tuning on the large language model. Through instruction fine-tuning, the large language model learns two types of abilities simultaneously: "answering questions based on documents" and "judging whether documents are reliable". In this way, in a mixed document environment, it can prioritize relying on credible documents to generate answers and remain robust to the misleading information generated by counterfactual documents.

[0056] Step 11: Integrate the robustly fine-tuned large language model, the irrelevant document discrimination model, and the noisy document discrimination model into the retrieval enhancement generation system: After receiving a new query, the system first retrieves a document set based on the similarity between the query text vector and the knowledge base document vector, then sequentially performs noisy document filtering and irrelevant document filtering, and finally inputs the filtered document set and query text into the robustly fine-tuned large language model to output the final answer, and simultaneously outputs the binary classification evaluation results of the reference documents when necessary.

[0057] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0058] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention. Therefore, the protection scope of this invention patent should be determined by the appended claims.

Claims

1. A two-stage document filtering and robust fine-tuning method based on graph attention networks, characterized by: This method is implemented by the following steps: Step 1: Construct multiple document pools and build a mixed document collection; Step 2: Perform text preprocessing on the mixed document set to construct a paragraph-level semantic graph; Step 3: Train an irrelevant document discrimination model and a noisy document discrimination model based on a graph attention network to obtain a reference document set; the specific process is as follows: The irrelevant document discrimination model is used to filter out document nodes in the paragraph-level semantic graph, specifically as follows: During training, irrelevant document nodes are used as positive supervision signals, and the remaining document nodes are used as negative supervision signals. The binary cross-entropy loss is used to optimize the model parameters, and the irrelevant document judgment threshold is determined according to the preset accuracy target. If the predicted probability of a document node is greater than or equal to the irrelevant document determination threshold, then the document node is marked as 0, and the remaining document nodes are marked as 1. After all document nodes are labeled, the document nodes with a label of 0 and their corresponding associated edges are deleted, and the document nodes with a label of 1 are retained to obtain the document set after the first stage of filtering. A noisy document discrimination model is used to filter out document nodes in the document set after the first stage of filtering, specifically as follows: During the training process, noisy documents are used as supervision signals for binary classification training, so that the noisy document discrimination model can identify noisy documents that may mislead the answer direction or introduce bias, and determine the noisy document judgment threshold according to the preset accuracy target. For any document node, if the predicted probability of the document node is greater than or equal to the noisy document determination threshold, then it is marked as 0, and the remaining document nodes are marked as 1. After labeling all document nodes, delete the document nodes with a label of 0 and their corresponding edges, and keep the document nodes with a label of 1 to obtain the document set after the second stage of filtering, which is used as the reference document set. Step 4: Construct document discrimination training samples and question answering training samples based on the reference document set obtained in Step 3, and merge the document discrimination training samples and question answering training samples to obtain a joint robust fine-tuning dataset; The specific process is as follows: For each document in the reference document set, a binary classification label is generated. When the document comes from the correct document pool, it is labeled as evaluation 1; when the document comes from other document pools, it is labeled as evaluation 2, thus forming a document discrimination training sample. Using the query text and the set of reference documents as input, and the standard answer as output, a question-answering training sample is formed for training a large language model to answer questions based on multiple documents; Step 5: Use the joint robust fine-tuning dataset obtained in Step 4 to perform joint instruction fine-tuning on the large language model, and integrate the fine-tuned large language model with the irrelevant document discrimination model and noisy document discrimination model from Step 3 into the retrieval enhancement generation system. Input the filtered document set and query text from the retrieval enhancement generation system into the robustly fine-tuned large language model to output the final answer.

2. The two-stage document filtering and robust fine-tuning method based on graph attention networks according to claim 1, characterized in that: In step one, the specific process of constructing multiple document pools is as follows: Step 1: Obtain the original data of the retrieval enhancement generation system and the external knowledge base corpus, and perform vector encoding on the external knowledge base corpus to establish a similarity retrieval index; Step 1 and Step 2: Divide the query texts and their corresponding standard answers in the original data into training set and test set according to preset rules. For each query text in the training set and test set, retrieve several documents according to the index and form a correct document pool, irrelevant document pool and noisy document pool corresponding to each query text. Step 13: Generate a counterfactual document pool based on the query text and the correct documents in the correct document pool; align the counterfactual document pool with the correct document pool, irrelevant document pool, and noisy document pool according to the query number to obtain multiple types of document pools.

3. The two-stage document filtering and robust fine-tuning method based on graph attention networks according to claim 2, characterized in that: In steps one and three, semantic substitution, rewriting, or perturbation operations are performed on the correct documents in the correct document pool using a large language model to obtain a counterfactual document pool that is misleading but superficially reasonable.

4. The two-stage document filtering and robust fine-tuning method based on graph attention networks according to claim 3, characterized in that: In step one, documents are extracted from the correct document pool, and a candidate document sequence is generated by simulating the retrieval enhancement system. A mixed document set is constructed according to the set attack type and replacement probability.

5. The two-stage document filtering and robust fine-tuning method based on graph attention networks according to claim 1, characterized in that: In step two, the semantic vector of each document is obtained using a sentence vector encoding model, and the document attribute features are calculated. The semantic vector and the document attribute features are concatenated to form a document node feature vector. The query text is used as the query node, and the documents in the mixed document set are used as document nodes to construct a paragraph-level semantic graph corresponding to the query text.

6. The two-stage document filtering and robust fine-tuning method based on graph attention networks according to claim 1, characterized in that: In step five, after receiving a new query, the retrieval enhancement generation system first retrieves a document set based on the similarity between the query text vector and the document vector in the knowledge base, then sequentially performs irrelevant document filtering and noisy document filtering, and finally inputs the filtered document set and query text into the robust fine-tuned large language model to output the final answer, and simultaneously outputs the binary classification evaluation result of the reference documents.