Multi-level quality evaluation method and system for retrieval augmented generation based on large model

By using parallel heterogeneous model evaluation and rule-guided self-questioning mechanism, the problems of insufficient accuracy and single evaluation method in RAG system evaluation are solved. It realizes the accuracy, reliability and self-correction of multi-level quality evaluation, provides direct optimization guidance, and forms an evaluation-optimization closed loop.

CN122196138APending Publication Date: 2026-06-12ZHONGAN WANGMAI (BEIJING) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHONGAN WANGMAI (BEIJING) TECH CO LTD
Filing Date
2026-04-01
Publication Date
2026-06-12

Smart Images

  • Figure CN122196138A_ABST
    Figure CN122196138A_ABST
Patent Text Reader

Abstract

The application discloses a multi-level quality evaluation method and system for a retrieval enhancement generation system based on a large model. The method prepares a structured test set, and calls heterogeneous evaluator and supervisor large language models in parallel to perform multi-dimensional independent scoring. Then, the standard deviation between the scores is calculated based on statistics to quantify the difference, and whether to trigger the subsequent process is decided by comparing with a preset threshold. When the difference is significant, rule-guided self-questioning is started, and a deep analysis is performed by a questioner model to calibrate the score and generate targeted system optimization suggestions. Finally, a comprehensive evaluation report is output. Correspondingly, the system includes data preprocessing, parallel evaluation execution, difference analysis and decision, self-questioning and calibration, and comprehensive report generation modules which cooperate in sequence. Through the above multi-level closed-loop mechanism, the application realizes self-verification and continuous improvement of the accuracy and reliability of the RAG system evaluation, and provides direct and operable optimization guidance.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a quality assessment method and system, and more particularly to a multi-level quality assessment method and system for a large-model-based retrieval enhancement generation system. Background Technology

[0002] In recent years, Large Language Models (LLMs) have demonstrated powerful generative capabilities in the field of natural language processing, but their inherent problems, such as insufficient knowledge timeliness, have become increasingly prominent. Retrieval-enhanced generation (RAG) technology, by combining external knowledge bases with LLMs, partially solves these problems. It locates relevant information from massive amounts of data through a retrieval module, and then integrates the output with the generation module, significantly improving the accuracy and credibility of the responses. However, the quality assessment of existing RAG systems still faces challenges such as limited assessment dimensions, insufficient accuracy, and simplistic evaluation methods. Summary of the Invention

[0003] To address the shortcomings of existing technologies, this invention discloses a multi-level quality assessment method for a retrieval enhancement generation system based on a large model, characterized by the following steps:

[0004] S1. Test Data Preparation and Formatting: Obtain the output of the RAG system under test, construct a structured test case set, where each test case is organized in the form of a four-tuple, including: user question, answer generated by the RAG system for the question, preset standard answer, and context text retrieved and used by the RAG system when generating the answer;

[0005] S2. Parallel Heterogeneous Model Evaluation: The test case set is input into the evaluation framework, which simultaneously calls the first type of large language model as the evaluator and the second type of large language model as the supervisor to independently evaluate the same test case; by sending structured evaluation prompts to the two models, they are guided to score the generated answer according to multiple preset quality dimensions and output reasons;

[0006] S3. Disagreement Quantification Detection and Decision: For each test case, calculate the quantification value of the dispersion of the evaluation results and supervisor scores in each quality dimension; compare the quantification value with a preset decision threshold, and if it exceeds the threshold, determine that there is a significant disagreement and trigger a self-questioning process;

[0007] S4. Rule-guided self-questioning and calibration: When the self-questioning process is triggered, the questioner's big language model is invoked to provide enhanced prompts that incorporate targeted review rules; based on the rules, the questioner model performs in-depth analysis of the dimensions with disagreements and outputs calibrated scores and optimization suggestions for specific modules of the RAG system;

[0008] S5. Evaluation Results Output: Combining the scores from step S2, the calibration results from step S4, and optimization suggestions, a readable evaluation report is generated that includes the final quality score and directions for system improvement.

[0009] The present invention also discloses an evaluation system for implementing the above method, characterized in that it comprises:

[0010] The data preprocessing and input module is used to perform step S1, receive and format test data, and output a structured test case set;

[0011] The parallel evaluation execution module is connected to the data preprocessing and input module and is used to load the first and second type large language models, construct and send structured evaluation prompt words to perform the parallel evaluation in step S2.

[0012] The divergence analysis and decision module is connected to the parallel evaluation execution module and has a built-in calculation unit for calculating the standard deviation and a comparator for threshold comparison to perform the divergence quantification detection and decision in step S3.

[0013] The self-questioning and calibration module is connected to the disagreement analysis and decision-making module. After receiving a trigger signal, it calls the questioner model and generates enhanced prompt words based on the rule knowledge base to perform the rule-guided analysis and calibration in step S4.

[0014] The integrated report generation module is used to receive and integrate the outputs of the parallel evaluation execution module and the self-questioning and calibration module, execute step S5, and generate the readable evaluation report. Beneficial effects

[0015] 1. Significantly improved accuracy and reliability of assessments

[0016] By employing a combination of parallel heterogeneous model evaluation and standard deviation-based quantitative discrepancy detection, this approach ensures evaluation quality at both the source and process levels. The cross-validation mechanism of heterogeneous models effectively mitigates the inherent bias and "illusion" risk of a single model, while statistically based discrepancy quantification transforms subjective evaluation differences into objective, measurable confidence indicators. This combined approach ensures that evaluation results no longer rely on a single, unreliable judgment, but are based on internal consistency checks, thereby significantly improving the accuracy and authority of the final scores and conclusions.

[0017] 2. The evaluation process was made interpretable and self-correcting.

[0018] By introducing a rule-guided self-questioning and calibration mechanism, this solution endows the assessment system with "metacognitive" capabilities. The system no longer simply outputs scores mechanically, but can automatically invoke the rule engine for in-depth root cause analysis when internal disputes are detected. This technique not only provides credible arbitration and calibration for disputed scores, but more importantly, it makes the assessment process transparent and explainable, allowing users to understand the logic and basis behind the scores. This enables the assessment system to self-verify and self-correct, representing a significant breakthrough from traditional black-box assessments.

[0019] 3. Provide direct and actionable optimization guidance to form a closed-loop technology system.

[0020] The most practical benefit of this solution lies in its inherent coupling of quality assessment and system optimization. The structured test suite ensures comprehensive RAG (Research, Analysis, and Gain) coverage; multi-dimensional scoring directly pinpoints weaknesses; and the rule engine and root cause analysis in the self-questioning module directly map diagnostic conclusions to specific optimization suggestions for the retrieval or generation modules (such as adjusting retrieval parameters or modifying generated suggestions). This transforms the assessment output from an isolated performance report into a detailed "diagnosis" and "prescription," providing developers with clear and direct action guidelines. This shifts the assessment activity from a cost center into an optimization engine driving rapid system iteration.

[0021] 4. It has achieved standardization, automation, and high efficiency in the evaluation process.

[0022] By defining test data in a structured manner and designing a modular system pipeline, this solution achieves a high degree of standardization and automation throughout the entire evaluation process. From data preparation, parallel evaluation, disagreement decision-making to report generation, the entire process can be completed without human intervention. This not only significantly reduces the human and time costs of large-scale, routine quality evaluations, but also ensures the consistency of evaluation standards, making evaluation results from different periods and systems comparable, and providing strong support for team collaboration and continuous integration / continuous deployment (CI / CD). Attached Figure Description

[0023] Figure 1 This is a block diagram of the system structure of the present invention.

[0024] Figure 2 This is a schematic diagram of the method flow of the present invention. Detailed Implementation

[0025] Example 1: A multi-level quality assessment method for a retrieval enhancement generation system based on a large model, comprising the following steps:

[0026] (1) Test data preparation and formatting: Obtain the output data of the RAG system under test and organize it into a set of structured test cases. Each test case shall contain at least the user question, the answer generated by the RAG system for the question, the preset standard answer, and the context text retrieved and used by the RAG system when generating the answer.

[0027] To achieve automated, multi-level quality assessment of large-model-based Retrieval Augmentation (RAG) systems, the first and crucial step is to establish a standardized assessment data foundation. The core of this process lies in transforming the complex, unstructured interactive information generated during the actual operation of the RAG system into a machine-readable format that can be accurately and efficiently processed by the assessment framework and contains complete assessment logic.

[0028] Specifically, the evaluation begins with constructing a well-defined set of structured test cases. Each test case is not a simple question-answer pair, but is rigorously organized into a four-tuple data structure containing four indispensable elements: the user-posed "question," the "RAG system-generated answer" to be evaluated, the "preset standard answer" as a factual benchmark, and the "context text" actually retrieved and used by the RAG system during the generation process. This design stems from a deep understanding of how RAG works: system quality depends not only on the correctness of the final answer, but also on the quality of the retrieved information (context) and how this information is used (fidelity). Therefore, binding the question, generated result, standard answer, and used context into an atomic unit ensures that each evaluation is conducted in a closed environment with complete information and causal traceability, laying the foundation for subsequent multi-dimensional and interpretable evaluations.

[0029] To ensure the physical storage and convenient processing of this structured data, this method preferentially employs widely supported common data exchange formats, such as Excel (.xlsx / .csv) or JSON files. The choice of these formats is not accidental, but based on their specific engineering advantages. For example, in Excel, each row can be intuitively mapped to a test case, and the four columns correspond to the four fields of a quadruple. This two-dimensional table structure of row-test case and column-attribute is very convenient for manual review, batch editing, and programmatic reading. In JSON format, the test set can be represented as an array of objects, with each object encapsulating a test case through clear key-value pairs (such as `"question", `"generated_answer"`, etc.). This nested structure is particularly adept at handling fields containing multi-paragraph text or complex structures, providing greater flexibility and expressiveness.

[0030] In practice, the evaluation framework loads these files by calling common file reading interfaces (e.g., the `read_excel` function from the `pandas` library or the `json.load` function from the standard library in a Python environment). This step acts as a highly efficient "data converter," losslessly parsing and transforming the static sequence of symbols stored in the files into data objects (such as DataFrames or lists of dictionaries) that can be directly manipulated by the program in memory. This process is silent yet crucial, enabling a smooth transition of evaluation data from persistent storage to high-speed computation, providing a unified, clean, and well-structured data input stream for all subsequent modules.

[0031] Adopting simplistic approaches, such as scattering test cases across multiple text files or embedding code comments in an unstructured format, introduces a series of serious practical obstacles. First, data parsing becomes extremely fragile and complex, requiring specific, error-prone extraction logic for each messy format, severely compromising the robustness of the evaluation process. Second, maintaining, expanding, and collaborating on test suites becomes exceptionally difficult; even minor changes to data structures can trigger extensive code adjustments. More importantly, this chaos hinders the standardization and scaling of evaluations, making it virtually impossible to accumulate reusable benchmark suites.

[0032] In contrast, the standardized data preparation scheme based on a universal format adopted in this invention essentially defines a universal "test interface" specification for RAG system quality assessment. Any RAG system, as long as it outputs its test data according to this four-tuple format, can be directly integrated into this framework for evaluation without modifying its core evaluation logic, achieving true "plug and play" and cross-system comparability of evaluation results. Simultaneously, it strictly adheres to the software engineering principle of separating data from logic, making the management of massive test cases, version control, and seamless integration with continuous integration / continuous deployment (CI / CD) pipelines effortless. This standardization and engineering approach, injected from the very beginning of the evaluation, is not only a prerequisite for subsequent automated parallel evaluation (step 2) and intelligent divergence analysis (steps 3 and 4), but also enhances the overall systematicity, maintainability, and industry practicality of the quality assessment method, constituting a significant manifestation of the invention's high level of inventiveness. At this point, a high-quality evaluation data foundation is ready and can be stably delivered to the next stage of the parallel evaluation engine.

[0033] (2) Parallel heterogeneous model evaluation: The structured test case set is input into the evaluation framework, which simultaneously calls the evaluator big language model and the supervisor big language model to independently evaluate the same test case; wherein, by sending structured evaluation prompts to the two models, they are guided to score the generated answer and output the scoring reasons according to multiple preset quality dimensions; the evaluator model and the supervisor model are heterogeneous models with different ability tendencies.

[0034] This step, taking advantage of the formatted test data, initiates the core phase of the evaluation process—parallel heterogeneous model evaluation. This step aims to conduct multi-perspective, cross-validation quality diagnosis of the RAG system's output through an innovative "dual-judge" mechanism. Its key lies in eliminating the inherent biases that may exist in a single model and leveraging the differences in capabilities of heterogeneous large language models to approximate a more reliable and comprehensive evaluation.

[0035] In practical implementation, after loading the test case set, the evaluation framework simultaneously initiates two independent evaluation pipelines for each test case to be evaluated. Each pipeline calls a specific large language model role: one is the evaluator model, and the other is the supervisor model. A key design of this method is that these two models are not products of the same architecture or fine-tuning of the same source, but rather intentionally selected heterogeneous models, aiming to complement each other in terms of capabilities. For example, in a specific embodiment, the evaluator role can be played by a general instruction-following model (such as Qwen2.5-32B-Instruct), which typically performs well on a wide range of dialogue and semantic understanding tasks, excelling at grasping the overall fluency, relevance, and completeness of answers. The supervisor role, on the other hand, can be a model specifically trained on large-scale code and logical reasoning data (such as Qwen2.5-coder-32B-Instruct). Such models often exhibit stronger rule-following capabilities, rigorous fact-checking, and logical chain analysis capabilities, acting like a strict technical auditor.

[0036] In terms of technical implementation, the framework communicates with the two model services through standard API interfaces (such as HTTP RESTful API). For each test case, the framework constructs two evaluation requests, sending one to the evaluator model and the other to the supervisor model. The core and innovative approach here lies in the highly structured "cue word engineering." What is sent to the model is not a simple instruction, but a carefully designed evaluation specification. The cue word explicitly defines multiple quality dimensions of this evaluation (typically including "contextual relevance," "answer fidelity," and "answer relevance"), and provides clear, actionable scoring rules and grade examples for each dimension (e.g., specific criteria from 1 to 5 points). Simultaneously, it requires the model to provide specific reasoning based on the rules along with integer or half-integer scores. Through this structured input, we essentially transform the judgment criteria of human evaluation experts into standardized operational instructions that a large language model can stably understand and execute.

[0037] Its working process and principle lie in cross-validation using the differentiated cognition of heterogeneous models. When faced with the same answer, a general-purpose evaluator model may be better at judging from a macro perspective of "whether the overall semantics are reasonable," while a code expert-type supervisor model will, like a parser, examine each fact in the answer to see if it strictly corresponds to the retrieval context and whether the logic is self-consistent. This design cleverly simulates the "back-to-back" independent review and deliberation mechanism in human review. The underlying principle is that a single model, no matter how powerful, may have some imperceptible "blind spots" or inherent style preferences (i.e., the manifestation of "model illusion" in evaluation tasks), while the probability and direction of errors of two models with different capabilities are less overlapping. Therefore, the consistency of their scores can serve as a strong signal of the credibility of the evaluation results, while the discrepancies between them precisely indicate potential problems that need to be examined in depth, activating crucial early warnings for subsequent steps.

[0038] If this step employs traditional or simplified evaluation methods, such as using only a single model or reducing the evaluation to a general "good / bad" binary classification, the aforementioned effect cannot be achieved. Single-model evaluation is prone to falling into the model's own bias and cannot self-check the credibility of the evaluation results; while general classification loses the multi-dimensional information needed for fine-tuning system defects. In contrast, the parallel heterogeneous model evaluation mechanism adopted in this invention, combined with structured prompts, not only significantly improves the overall robustness and credibility of the evaluation results through cross-validation, but more importantly, it transforms the evaluation process from a single output behavior into a diagnostic process that generates internal verification signals (consistency) and problem location clues (disagreements) by actively inducing differentiated model perspectives. This provides a solid foundation and clear input for the next step of quantitative disagreement detection and intelligent self-questioning, enabling the entire evaluation system to possess preliminary self-reflection and problem discovery capabilities, rather than merely mechanical scoring. At this point, each test case has received detailed diagnostic opinions from two independent "experts," and the evaluation process enters the next crucial stage—quantitative analysis and decision-making based on these opinions.

[0039] (3) Disagreement Quantification Detection and Decision: For each test case, collect the scoring results of the evaluator model and the supervisor model, and calculate the dispersion quantification value of the scores of the two models on each quality dimension; compare the dispersion quantification value with the preset decision threshold. If the threshold is exceeded, it is determined that there is a significant disagreement and triggers the self-questioning process.

[0040] After obtaining independent scores from two heterogeneous models—one from evaluators and one from supervisors—the evaluation process enters a crucial decision-making juncture: discrepancy quantification detection and decision-making. The purpose of this step is to transform the subjective scoring differences that may have arisen in the previous parallel evaluation into an objective, quantifiable scientific indicator. This automatically determines the credibility of the evaluation results and decides whether to initiate a more in-depth review mechanism. This marks the shift of the evaluation from "generating judgments" to a meta-evaluation stage of "examining the judgments themselves."

[0041] Specifically, the system processes each test case individually, analyzing the two sets of scores obtained across various preset quality dimensions (such as context relevance and answer fidelity). The core technical approach is to introduce standard deviation as a unified measure of dispersion. For any dimension, assuming the evaluator's score is (S_a) and the supervisor's score is (S_b), the system first calculates the average of the two scores (μ = (S_a + S_b) / 2), and then applies the standard deviation formula (…). The σ value is calculated to precisely characterize the degree of deviation between two ratings around their common center. The smaller the value, the higher the consensus between the two; the larger the value, the more significant the disagreement.

[0042] The application of this mathematical tool is not arbitrary, but based on a profound consideration of the nature of the assessment task. In the context of using a 5-point scale, scores typically appear as integers or half-integers. Through theoretical derivation and statistical analysis of preliminary experiments, it can be found that when two scores are completely identical, the standard deviation is 0; when they differ by an integer grade (e.g., one 4 points and the other 5 points), the calculated standard deviation is approximately 0.5. Based on observations of a large amount of experimental data, this invention sets a fine-grained decision threshold (e.g., 0.2). The scientific basis for this threshold is that when the standard deviation is below 0.2, it usually corresponds to a score difference of less than 0.5 points, which can be considered an acceptable and normal inter-observer error between different "review experts," possibly stemming from slight differences in interpretation of details. However, once the standard deviation reaches or exceeds 0.2, it means that there is a substantial discrepancy of at least one grade in the scores. This discrepancy is often not random error, but more likely reveals deeper problems—for example, one model may have failed to correctly apply the assessment criteria, or some flaw in the answer (such as subtle factual contradictions or logical jumps) happens to touch upon different sensitive areas of the two models. Therefore, setting the threshold at this level can efficiently filter out "difficult cases" with low assessment certainty that are worth further investigation.

[0043] The system calculates the standard deviation of the score for each dimension and compares it with a preset threshold. If the standard deviation of any dimension exceeds the threshold, the system determines that there is a "significant disagreement" in the evaluation of that test case. This determination serves as a clear decision signal, triggering the subsequent "self-questioning" process. Conversely, if the standard deviations of all dimensions are below the threshold, the system considers the dual-model evaluation to have reached a reliable consensus and can directly aggregate the current scores (e.g., by taking the average) as the final credible evaluation result output.

[0044] If a more rudimentary or arbitrary decision-making mechanism is adopted, such as considering agreement only when two scores are absolutely equal, or setting the trigger condition to a fixed score difference lacking statistical basis, the system will either be overly sensitive (misjudging a large number of normal errors as disagreements, leading to overload of subsequent processes) or overly insensitive (missing many truly problematic evaluation disputes). The decision-making mechanism of this invention, built upon standard deviation and empirically calibrated thresholds, is innovative in that it achieves precise measurement and intelligent routing of evaluation uncertainty. It is not merely a mathematical calculation, but an intelligent decision-making node that simulates the "review" process initiated by human experts when faced with conflicting opinions. This step transforms the originally ambiguous state of "disagreement" into a clear, actionable binary decision signal, enabling the evaluation process to be adaptive and resource-optimized—concentrating valuable in-depth analysis resources on the evaluation points most likely to have problems. An unexpected technical effect of this design is that it endows the entire evaluation system with inherent quality control and confidence perception capabilities, providing a solid guarantee for the reliability of the final evaluation results and the targeted nature of subsequent optimization suggestions. At this point, the system has accurately identified the cases that need to be "re-examined," and the evaluation process naturally enters a more insightful stage of self-questioning and calibration.

[0045] (4) Rule-guided self-questioning and calibration: When the self-questioning process is triggered, the evaluation framework calls the questioner's big language model and sends it enhanced prompts containing targeted review rules; the questioner model, based on the rules, conducts in-depth analysis of quality dimensions with significant disagreements, determines the root cause of the scoring differences, calibrates the original scores and generates new scores accordingly, and generates optimization suggestions for specific modules of the RAG system based on the analysis results.

[0046] After completing the quantitative detection of discrepancies and identifying test cases with significant discrepancies, the evaluation process automatically enters its most self-correcting and diagnostic stage—rule-guided self-questioning and calibration. The core objective of this step is to transform the evaluation uncertainties exposed in the previous steps into a precise root cause diagnosis and credible arbitration, thereby achieving a "re-evaluation" and "calibration" of the evaluation results themselves, and providing direct guidance for optimizing the RAG system.

[0047] This step is triggered by a clear automated decision signal. When the "Disagreement Detection and Decision Module" determines that the standard deviation of a test case's score in any dimension exceeds a preset threshold (e.g., σ≥0.2), the test case is marked and routed to the "Self-Questioning and Calibration Module." The system then invokes a dedicated questioner language model. In one specific embodiment, to fully utilize its rigorous logical reasoning and rule-following capabilities, the system reuses the Qwen2.5-coder-32B-Instruct model for this role. The invocation is performed through the model's standard API interface, but the key innovation lies in the highly structured, dynamically generated enhanced prompts prepared for it. This essentially instantiates and delivers a lightweight but functionally defined "rule engine" for execution.

[0048] The rule engine's construction and operation mechanism is as follows: The system pre-defines a structured rule knowledge base, which stores a detailed list of review points for each evaluation dimension and cross-dimensional contradiction analysis logic in a machine-readable format (such as JSON or YAML configuration files). For example, rules for the "answer fidelity" dimension might include: "1. Verify each core factual claim in the generated answer to ensure that it has clear and unambiguous corresponding support in the provided RAG context; 2. Identify whether there are any entities, data, dates, or events in the answer that are not mentioned in the context; 3. Check whether the inferences or summaries of the answer logically contradict the overall meaning of the context." Simultaneously, the rule base contains logical mapping relationships, such as: "If 'contextual relevance' is high but 'answer fidelity' is low, prioritize suspecting that the generation module is 'illusory' or has an information integration error."

[0049] The process begins with the generation of dynamic prompts. After receiving an instruction containing specific disagreement dimensions, the system extracts the corresponding review points and logical analysis rules from the rule knowledge base. Then, it populates a pre-defined prompt template with the test case's four-tuple raw data (question, generated answer, standard answer, context), the original scores and reasons from the evaluator and supervisor, and the extracted targeted rules. This process constructs a unique, context-sensitive review task instruction, which might have the following structure: "You are the final arbitrator. For the following case, it is known that Model A and Model B have significant disagreements on the 'fidelity' dimension (scores of 4 and 2 respectively). Please conduct a deep analysis strictly according to the following review rules: [Insert 'fidelity' review rules extracted from the knowledge base here]. Based on the analysis of the raw materials (question: [question], generated answer: [generated answer], standard answer: [standard answer], context: [context]), determine the root cause of the disagreement, provide your calibrated score, and offer specific system optimization suggestions."

[0050] Upon receiving this enhanced prompt, the skeptical model operates by performing a controlled, step-by-step reasoning analysis. Instead of making free overall judgments, it is guided like a rigorous auditor, meticulously examining evidence against a given list of rules and conducting cross-dimensional logical deductions. For example, it might check each factual point in the generated answer according to the rules, locating and verifying it within the context; or analyze why a highly relevant context might produce a contradictory low-fidelity answer.

[0051] After the model completes its analysis, its output is parsed in a structured manner by the system. The output is typically required to follow a specific format, such as explicit fields like "Calibration Score: [Score]" and "Optimization Suggestion: [Text]". The system parses these fields to obtain the final, arbitrated calibration score and root cause diagnosis conclusions. These conclusions are directly mapped to actionable optimization suggestions. For example, if the root cause is determined to be "insufficient or misinterpreted use of relevant context by the generation module," a suggestion is generated to "optimize the prompts in the generation module, strengthen the instruction to 'respond strictly according to the provided context point by point,' and consider introducing a self-consistency check step." If the root cause is determined to be "missing key information in the retrieval context itself," a suggestion is made to "adjust the recall strategy of the retrieval module, or expand and optimize the source data of the knowledge base."

[0052] If this in-depth, rule-guided self-questioning step is omitted, and a simple average or random selection of a score is used as the final result, the entire assessment system will lose its core value of self-verification and in-depth diagnosis. The creative implementation of this step has produced unexpected technical effects: it endows the automated assessment system with the "re-examination" and "adjudication" capabilities similar to human experts, significantly enhancing the authority and credibility of the final assessment conclusion. More importantly, it outputs a directly actionable system diagnostic report. This achieves a qualitative leap from "informing the score" to "diagnosing the cause and prescribing a prescription," transforming the quality assessment activity itself into an intelligent engine driving the iterative optimization of the RAG system, completing the key construction of the assessment-optimization closed loop. Thus, every assessed case obtains a convincing conclusion and a clear direction for improvement.

[0053] (5) Evaluation results output: Based on the scoring results, calibration results and optimization suggestions, generate a readable evaluation report containing the final quality score and system improvement directions.

[0054] After completing the aforementioned multi-level evaluation, discrepancy detection, and self-questioning calibration chain, the method enters the final value presentation stage—evaluation result output. The purpose of this step is to aggregate and transform the complex intermediate data and intelligent analysis conclusions generated in a distributed, multi-stage process into a comprehensive, readable, and actionable decision support report for system developers or operations personnel, thereby achieving a seamless transition from automated analysis to human decision-making.

[0055] This step is not simply about writing the final score to a file, but rather about performing a structured information synthesis and report generation process. The "Result Synthesis and Output Module" acts as the main execution unit, and its input sources include: for use cases with no significant disagreement, the direct average score and original scoring reasons from the "Disagreement Detection and Decision Module"; for use cases that triggered self-questioning, the calibrated score, root cause analysis conclusions, and specific optimization suggestions from the "Self-Questioning and Calibration Module." The core technology of this module is to clean, classify, and fuse the aforementioned heterogeneous data streams based on a predefined report template engine.

[0056] Its workflow is systematic: the module first generates a standardized summary record for each test case. This record not only contains the final quality score (which may be a composite score or scores across multiple dimensions), but more importantly, it includes rich metadata and explanatory information. This information typically includes: indicators that assess whether a "self-questioning" process has been conducted, the original evaluator and supervisor scores (for traceability), key analytical points on which the calibration score was based, and the most practically valuable targeted optimization suggestions. Subsequently, the system aggregates all test case records and calculates statistical indicators for the overall dataset, such as the average score, the score distribution across dimensions, the proportion of test cases triggering self-questioning, and the categories of frequently occurring optimization suggestions.

[0057] The report's generation principle lies in achieving transparency and interpretability in the evaluation process and results. A typical output report might be presented in the form of a structured document (such as Markdown or HTML) or a standard data file (such as JSON or Excel). It not only lists "what the score is," but also strives to explain "why the score is as it is." For example, the report might clearly state: "Of the 100 use cases, 15 triggered a self-questioning process, of which 12 had their scores revised after arbitration. A total of 8 suggestions were generated for 'optimizing the generation module's prompts,' and 5 suggestions were generated for 'adjusting search parameters'..." This presentation method reveals the internal disputes and decision-making processes of the large model's "black box" evaluation in a white-box format, greatly enhancing the credibility and acceptability of the evaluation conclusions.

[0058] If this step adopts the common output method of traditional evaluation systems—that is, only providing a final total score or a simple pass / fail indicator—it will severely undermine the value of this multi-level evaluation method. Such simplified output loses all the valuable process information: developers have no way of knowing whether the score is reliable, whether there are disputes between different evaluators, or at which specific stage of the system (retrieval, generation, fusion) the problem exists, thus making optimization impossible and only allowing for blind experimentation.

[0059] In contrast, the deep, structured output achieved through step 5 of this invention yields unexpected technical effects and commercial value. It upgrades the output of automated quality assessment from a monotonous "performance dashboard" to a comprehensive "system health diagnosis and optimization manual." This not only significantly improves the ROI of the assessment activity itself, but more importantly, it provides the R&D team with a clear, prioritized roadmap, directly embedding the assessment process into the DevOps cycle and realizing an agile development model driven by assessment and guided by data. Thus, this method completes a full closed loop from data input to knowledge output, delivering not just an evaluation, but a clear guide to system evolution.

[0060] This embodiment presents a multi-level quality assessment method for a large-model-based retrieval enhancement generation system. Through a series of interconnected and progressively advancing closed-loop technical solutions, it systematically addresses the problems of insufficient accuracy, limited dimensionality, and lack of actionable feedback in existing RAG system assessments. First, by defining a standardized four-tuple data structure and organizing the test set using a common file format, the integrity, machine readability, and process standardization of the assessment data are ensured from the outset, laying a reliable foundation for automated assessment. Furthermore, an innovative parallel heterogeneous model assessment mechanism is introduced. This mechanism utilizes the capability differences between a general-purpose large language model and a code model for independent cross-validation, and combines this with meticulously designed structured prompts to guide multi-dimensional scoring, thus mitigating the assessment risks associated with a single model and enriching the assessment perspective.

[0061] To achieve self-verification and intelligent decision-making in the assessment, this method, based on statistical principles, quantifies the scoring differences of heterogeneous models into standard deviations and sets empirically calibrated scientific thresholds. This transforms subjective "disagreements" into objective "trigger signals," enabling precise perception and automated routing of assessment uncertainty. For identified significant disagreements, this solution further designs rule-guided self-questioning and calibration steps. The core of this step lies in constructing a lightweight rule engine composed of specific review points and logical reasoning rules, driving the questioner's large language model to perform a deep, controllable meta-assessment analysis. This step not only arbitrates and calibrates disputed scores but, more importantly, generates specific optimization suggestions directly targeting the RAG system's retrieval or generation modules by diagnosing the root causes of disagreements.

[0062] Finally, by structurally synthesizing all intermediate data and intelligent analysis conclusions, a comprehensive diagnostic report is generated, integrating final scoring, process traceability, and actionable recommendations. This transforms the evaluation activity from a single performance metric into an optimization guideline driving system iteration. In summary, this method, through a complete technical chain of "standardized data input → heterogeneous cross-validation → quantified divergence decision-making → rule-guided meta-evaluation → diagnostic report output," significantly improves the accuracy, reliability, and practicality of RAG system evaluation, and forms a virtuous cycle of "evaluation-diagnosis-optimization."

[0063] Example 2

[0064] Corresponding to the aforementioned multi-level quality assessment method (Example 1), the technical solution of the present invention can also be implemented through a dedicated system entity. This system, through modular design, solidifies each logical operation of the method into hardware and software components with clearly defined functions. The modules collaborate through well-defined interfaces and data flows, collectively forming an intelligent agent capable of automatically executing complex assessment tasks. The assessment system (Example 2) will be described in detail below.

[0065] The system's operation begins with the data preprocessing and input module. This module acts as a standardized "data gate" for the entire system. Its core function is to receive RAG system output data from external sources, which may be in various formats, and force them to be converted into a unified language required for the evaluation process. It has built-in parsers for specific file formats (such as Excel, CSV, and JSON), for example, by integrating the `pandas` library or a custom JSON deserializer to read files. Its operation involves receiving a file path or data stream containing the original test cases, validating, cleaning, and formatting the data according to a predefined four-tuple data structure specification, and finally generating a well-structured, directly indexable data object (such as a list or data frame) in memory. The existence of this module fundamentally ensures that all subsequent advanced processing is based on clean and consistent data, eliminating evaluation bias caused by input chaos, and is the primary guarantee for the stable operation of the automated pipeline.

[0066] The processed data is then fed into the parallel evaluation execution module. This module is the "dual-core engine" for the system's core evaluation function. Internally, it encapsulates the connection configurations for two heterogeneous large language model services; for example, it configures API clients for a general dialogue model (e.g., Qwen2.5-32B-Instruct) and a code reasoning model (e.g., Qwen2.5-coder-32B-Instruct). The key technology of this module lies in its dynamic prompt assembler. For each input test case, it calls a preset structured evaluation prompt template from the template library and fills it with the specific content of the test case (question, generated answer, etc.), generating two independent evaluation instructions containing clear scoring dimensions and details. Subsequently, it sends these two instructions simultaneously to the evaluator and supervisor models via an asynchronous call mechanism. This design not only achieves parallel evaluation to improve efficiency, but more importantly, it ensures the consistency of evaluation standards and the reliable execution of cross-validation mechanisms through heterogeneous model scheduling and prompt engineering logic embedded within the module, eliminating the potential defects of single-model evaluation from the system architecture level.

[0067] Once the two evaluation results are returned, the system will proceed to the discrepancy analysis and decision-making module. This module acts as an "intelligent scheduling center," and its core is a decision circuit integrating a standard deviation calculation unit and a threshold comparator. The calculation unit receives the scores from the two models across various dimensions, strictly adhering to the formula ( The system performs calculations to obtain a quantified disagreement index. A threshold comparator then compares this index in real-time with a preset, scientifically calibrated value (e.g., 0.2). The module operates entirely deterministically and automatically: if the comparison result is "below the threshold," a "consensus reached" signal is generated, and the data is routed to the output stage; if it is "above or equal to the threshold," an instruction package containing a "trigger self-doubt" flag and specific disagreement dimension information is generated. This module transforms statistical principles into real-time running code logic, enabling the system to quantitatively perceive and assess confidence levels and autonomously make key routing decisions—a prerequisite for subsequent intelligent in-depth analysis.

[0068] For cases marked as requiring review, the self-questioning and calibration module is activated. This module is the system's "deep diagnosis and arbitration unit." Internally, it maintains a configurable rule knowledge base, storing a list of review points for each evaluation dimension and cross-dimensional conflict analysis logic. Upon receiving an instruction package from the decision module, it first retrieves and combines targeted review rules from the knowledge base based on the disagreement dimension information. Then, it calls the questioner model (which can reuse the supervisor model or a specially configured model) and dynamically constructs an enhanced prompt that integrates the original test case data, the scoring reasons from both sides, and the specific review rules, sending it to the model. The questioner model performs deep reasoning within this rule framework, and its returned text results are processed by the module's parser, extracting structured fields such as "calibration score" and "optimization suggestions." This module, by combining the "rule engine" with the "large model reasoning capability," implements a repeatable and interpretable meta-evaluation process within the system, completing the final ruling and root cause diagnosis of the disputed points.

[0069] Finally, the evaluation results of all paths are aggregated into the comprehensive report generation module. This module is the system's "value aggregation and presentation terminal." It receives consensus results from the parallel evaluation module or calibration results and suggestions from the self-questioning module, and integrates, sorts, and summarizes all information according to a customizable report template. Its process includes calculating overall performance indicators (such as average score), statistically analyzing high-frequency problem types, and categorizing and outputting specific optimization suggestions. Ultimately, it generates a well-structured and highly readable comprehensive evaluation report (such as in JSON, HTML, or PDF format). This module not only outputs a simple score but also delivers a complete diagnostic report covering performance baselines, confidence indicators, root causes of problems, and specific optimization paths. It transforms the intelligent analysis results of all the aforementioned modules into practical knowledge that developers can directly use for decision-making and action, ultimately closing the complete "evaluation-optimization" chain.

[0070] In summary, this assessment system, through the precise coordination and data flow of the five core modules mentioned above, fully reproduces and solidifies the entire innovative process of multi-level quality assessment methods at the entity level. It is not simply a matter of coding methods, but rather a modular architecture that encapsulates key technologies such as heterogeneous model collaboration, quantitative divergence decision-making, and rule-guided meta-assessment into independent, reusable system components. These components collectively achieve the invention's objective of improving the accuracy, reliability, and operability of assessments, forming an automated RAG system quality assessment solution with industrial-grade reliability and practicality.

[0071] The multi-level quality assessment system for retrieval enhancement generation based on a large model provided by this invention solidifies the multi-level assessment method into an efficient and reliable automated entity through a modular, pipelined hardware and software architecture. Starting with a data preprocessing and input module, the system lays the foundation for standardized and repeatable assessment processes through mandatory structured data transformation. Subsequently, the parallel assessment execution module, through an embedded heterogeneous large model scheduler and dynamic prompt word assembler, achieves cross-validation and multi-dimensional coverage of the assessment, eliminating the inherent bias of single-model assessment at the system level.

[0072] To achieve intelligent quality control, the system's core disagreement analysis and decision-making module integrates a statistical calculation unit and a threshold comparator, transforming subjective scoring differences into objective electronic decision signals. This gives the system the "sensory nerves" to autonomously identify and assess uncertainty. For identified disputes, the self-questioning and calibration module, acting as a built-in "arbitration and diagnostic center," is activated. Through a configurable rule knowledge base, it drives the questioner model to conduct in-depth, controlled meta-evaluation analysis, not only arbitrating the scores but also generating root-cause diagnostic conclusions and optimized prescriptions.

[0073] Ultimately, the comprehensive report generation module, acting as a value aggregation terminal, synthesizes the data, signals, and intelligent analysis conclusions generated throughout the entire process into a structured system health diagnostic report. Through the precise collaboration and closed-loop data flow of these five core modules, the system materializes the complete technology chain of "standardized input - heterogeneous verification - quantitative decision-making - rule-based review - diagnostic output." This not only comprehensively addresses the pain points of insufficient accuracy, limited dimensions, and ambiguous feedback in RAG system assessments, but also upgrades quality assessment activities into an intelligent system with self-verification, self-interpretation, and continuous optimization capabilities through a highly automated approach. This provides powerful and reliable tool support for the industrial deployment and iteration of RAG technology.

[0074] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the claimed invention. The scope of protection claimed by the appended claims and their equivalents is defined.

Claims

1. A multi-level quality assessment method for a retrieval enhancement generation system based on a large model, characterized in that, Includes the following steps: S1. Test Data Preparation and Formatting: Obtain the output of the RAG system under test, construct a structured test case set, where each test case is organized in the form of a four-tuple, including: user question, answer generated by the RAG system for the question, preset standard answer, and context text retrieved and used by the RAG system when generating the answer; S2. Parallel Heterogeneous Model Evaluation: The test case set is input into the evaluation framework, which simultaneously calls the first type of large language model as the evaluator and the second type of large language model as the supervisor to independently evaluate the same test case; by sending structured evaluation prompts to the two models, they are guided to score the generated answer according to multiple preset quality dimensions and output reasons; S3. Disagreement Quantification Detection and Decision: For each test case, calculate the quantification value of the dispersion of the evaluation results and supervisor scores in each quality dimension; compare the quantification value with a preset decision threshold, and if it exceeds the threshold, determine that there is a significant disagreement and trigger a self-questioning process; S4. Rule-guided self-questioning and calibration: When the self-questioning process is triggered, the questioner's big language model is invoked to provide enhanced prompts that incorporate targeted review rules; the questioner model, based on the rules, performs in-depth analysis of the dimensions with disagreements and outputs calibrated scores and optimization suggestions for specific modules of the RAG system; S5. Evaluation Results Output: Combining the scores from step S2, the calibration results from step S4, and optimization suggestions, a readable evaluation report is generated that includes the final quality score and directions for system improvement.

2. The method according to claim 1, characterized in that, In step S2, the preset multiple quality dimensions include at least: "contextual relevance" for evaluating the relevance of the retrieval context to the question, "answer fidelity" for evaluating the consistency between the generated answer and the context and the standard answer, and "answer relevance" for evaluating the degree of matching between the generated answer and the question.

3. The method according to claim 1, characterized in that, In step S2, the first type of large language model is a general instruction following model, and the second type of large language model is a code model that has been trained with large-scale code data and has enhanced logic and fact-checking capabilities.

4. The method according to claim 1, characterized in that, In step (3), the standard deviation is used to calculate the quantification value of dispersion. Specifically, for each quality dimension, the standard deviation σ of the evaluator's score Sa and the supervisor's score Sb is calculated using the following formula: ,in Where μ is the average of the two; the preset decision threshold is set based on the scoring characteristics of the 5-point scale and is used to distinguish between acceptable score fluctuations and substantive disagreements that need to be reviewed.

5. The method according to claim 1, characterized in that, In step S4, the targeted review rules are pre-installed in a structured form in the rule knowledge base, including a list of low-score review points for each quality dimension and cross-dimensional logical contradiction analysis guidelines; the enhanced prompts are generated by dynamically combining the test case data, the original scores, and the relevant rules extracted from the rule knowledge base.

6. The method according to claim 1, characterized in that, In step S4, the generated optimization suggestions directly correspond to the RAG system defect types identified by the deep analysis, including suggestions for optimizing the parameters or sorting strategies of the retrieval module, and suggestions for optimizing the prompt word constraints of the generation module to suppress hallucinations.

7. An evaluation system for implementing the method according to any one of claims 1 to 6, characterized in that, include: The data preprocessing and input module is used to perform step S1, receive and format test data, and output a structured test case set; The parallel evaluation execution module is connected to the data preprocessing and input module and is used to load the first and second type large language models, construct and send structured evaluation prompt words to perform the parallel evaluation in step S2. The divergence analysis and decision module is connected to the parallel evaluation execution module and has a built-in calculation unit for calculating the standard deviation and a comparator for threshold comparison to perform the divergence quantification detection and decision in step S3. The self-questioning and calibration module is connected to the disagreement analysis and decision-making module. After receiving a trigger signal, it calls the questioner model and generates enhanced prompt words based on the rule knowledge base to perform the rule-guided analysis and calibration in step S4. The integrated report generation module is used to receive and integrate the outputs of the parallel evaluation execution module and the self-questioning and calibration module, execute step S5, and generate the readable evaluation report.

8. The system according to claim 7, characterized in that, The first and second type of large language models configured in the parallel evaluation execution module are selected in accordance with the features described in claim 3; the self-questioning and calibration module integrates a rule knowledge base and a prompt word assembler; the rule knowledge base is used to store the structured review rules as described in claim 5, and the prompt word assembler is used to dynamically generate the enhanced prompt words according to the divergence dimension.

9. A non-volatile storage medium, characterized in that, The non-volatile storage medium includes a stored program, wherein the program, when running, controls the device where the non-volatile storage medium is located to execute the method of claim 1.

10. A terminal device, characterized in that, The terminal device includes: a processor, a memory, a communication interface, and a bus; the processor, the memory, and the communication interface are connected through the bus and communicate with each other; the memory stores executable program code; the processor reads the executable program code stored in the memory to run a program corresponding to the executable program code, so as to execute the method as described in claim 1 above.