Systems and methods for interacting with a retrieval augmentation generation system

By introducing lexical and semantic metrics evaluation methods into the retrieval enhancement generation system, a quality score for the generated response is achieved. This solves the problem of inaccurate response generation in existing systems for knowledge-intensive applications, enabling fast and reliable response quality evaluation and improving user experience.

CN122309646APending Publication Date: 2026-06-30CENT FOR PERCEPTUAL & INTERACTIVE INTELLIGENCE (CPII) LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CENT FOR PERCEPTUAL & INTERACTIVE INTELLIGENCE (CPII) LTD
Filing Date
2025-02-14
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing retrieval augmentation generation systems struggle to generate accurate and reliable responses in knowledge-intensive applications, particularly in education and healthcare. General machine learning models, untrained in domain-specific environments, may produce hallucinatory responses, and existing evaluation methods are inefficient in real-time settings.

Method used

We employ lexical and semantic metrics to evaluate the reliability of responses by calculating the lexical and semantic similarity between the text output and the retrieved information. We also provide an intuitive user interface for interaction.

Benefits of technology

It improves response reliability and user experience, and provides fast and accurate response quality assessment, making it suitable for real-time chat applications, especially in knowledge-intensive fields such as education and healthcare.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309646A_ABST
    Figure CN122309646A_ABST
Patent Text Reader

Abstract

A computer-implemented method for interacting with a retrieval enhancement generation system. The method includes: receiving a text prompt; retrieving information associated with the text prompt, at least in part based on the text prompt; generating input data, at least in part based on the text prompt and the retrieved information; and generating an output, at least in part by applying the input data to a machine learning model of the retrieval enhancement generation system. The machine learning model is used to determine the output using prompt engineering, at least in part based on the input data. The method further includes: generating a quality score for the output with reference to the input data; and outputting an indication of the output and the quality score.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to systems and methods for interacting with retrieval enhancement generation systems. Background Technology

[0002] Retrieval-Augmented Generation (RAG) is a known technique that leverages information retrieval and natural language processing (e.g., generation) to provide output based on input. The quality of the output can be determined based on its accuracy and / or relevance to the input. Summary of the Invention

[0003] According to a first aspect of this disclosure, a computer-implemented method for interacting with a retrieval enhancement generation system is provided. The computer-implemented method includes: receiving a text prompt; retrieving information associated with the text prompt, at least in part based on the text prompt; generating input data, at least in part based on the text prompt and the retrieved information; and generating an output, at least in part by applying the input data to a machine learning model of the retrieval enhancement generation system. The machine learning model is used to determine the output, at least in part based on the input data, using prompt engineering. The computer-implemented method further includes: generating a quality score for the output with reference to the input data; and outputting an indication of the output and the quality score.

[0004] For example, a text prompt might correspond to a query (e.g., a question), while the output might correspond to a response to that query (e.g., an answer). For instance, this computer-implemented method could support a real-time chatbot application.

[0005] In one embodiment of the first aspect, the indication of the quality score includes: the quality score; a rating derived from the quality score; an indication having a color corresponding to the quality score; and / or an indication having a color corresponding to the rating derived from the quality score.

[0006] In one embodiment of the first aspect, retrieving information associated with the text prompt includes retrieving an electronic file (e.g., an electronic document) that includes text associated with the text prompt.

[0007] In one embodiment of the first aspect, the output includes text output.

[0008] In one embodiment of the first aspect, the quality score is related to: the lexical similarity between the text output and related text of the retrieved information; and / or the semantic similarity between the text output and related text of the retrieved information. Optionally, the quality score may be associated with one or more other metrics.

[0009] In one embodiment of the first aspect, the computer-implemented method further includes: generating a vocabulary score associated with the vocabulary similarity. The quality score may be generated at least in part based on the vocabulary score.

[0010] In one embodiment of the first aspect, the vocabulary score is generated at least in part based on calculating Jaccard similarity or F1 score.

[0011] In one embodiment of the first aspect, generating the vocabulary score includes: filtering (removing) stop words from the text output and the related text of the retrieved information; performing lemmatization on the filtered output of the retrieved information and the filtered related text; and generating the vocabulary score at least in part by calculating an F1 score associated with the lemmatization and filtering output and the lemmatization and filtered related text of the retrieved information.

[0012] In one embodiment of the first aspect, the F1 score is calculated based on the following formula:

[0013]

[0014] in, W corresponds to the number of overlapping words between the lemmatization and filtering output and the related lemmatization and filtering text of the retrieved information, R corresponds to the number of words in the lemmatization and filtering output, and S corresponds to the number of words in the lemmatization and filtering text of the retrieved information.

[0015] In one embodiment of the first aspect, the computer-implemented method further includes: outputting an indication of the vocabulary score.

[0016] In one embodiment of the first aspect, the indication of the vocabulary score includes: the vocabulary score; a rating derived from the vocabulary score; an indication having a color corresponding to the vocabulary score; and / or an indication having a color corresponding to the rating derived from the vocabulary score.

[0017] In one embodiment of the first aspect, the computer-implemented method further includes generating a semantic score associated with the semantic similarity. The quality score may be generated at least in part based on the semantic score.

[0018] In one embodiment of the first aspect, the computer-implemented method further includes: generating the semantic score by at least in part by calculating the cosine similarity between the embedding of the text output and the embedding of the related text of the retrieved information.

[0019] In one embodiment of the first aspect, the computer-implemented method further includes: outputting an indication of the semantic score.

[0020] In one embodiment of the first aspect, the indication of the semantic score includes: the semantic score; a rating derived from the semantic score; an indication having a color corresponding to the semantic score; and / or an indication having a color corresponding to the rating derived from the semantic score.

[0021] In one embodiment of the first aspect, the computer-implemented method further includes: generating a lexical score associated with the lexical similarity, and generating a semantic score associated with the semantic similarity; and the quality score may be generated at least in part based on the lexical score and the semantic score.

[0022] In one embodiment of the first aspect, the quality score is generated at least in part based on the following formula:

[0023] w1 (vocabulary score) + w2 (semantic score)

[0024] Where w1 is the weight of the word score and w2 is the weight of the semantic score.

[0025] In one embodiment of this disclosure, the quality score = w1·(vocabulary score) + w2·(semantic score). Optionally, w1 + w2 = 1. Optionally, w2 is greater than or equal to w1.

[0026] In one embodiment of this disclosure, the computer-implemented method further includes providing a user interface associated with the retrieval enhancement generation system. The user interface may include a graphical user interface (GUI).

[0027] In one embodiment of this disclosure, the text prompt is received via the user interface.

[0028] In one embodiment of this disclosure, the computer-implemented method further includes displaying the output and the indication of the quality score in the user interface.

[0029] In one embodiment of this disclosure, the computer-implemented method further includes: outputting an indication of the lexical score and an indication of the semantic score, and displaying the output, the indication of the quality score, the indication of the lexical score, and the indication of the semantic score in the user interface.

[0030] In one embodiment of this disclosure, the machine learning model includes a language model, such as a Large Language Model (LLM).

[0031] In one embodiment of this disclosure, the machine learning model includes a generative language model.

[0032] In one embodiment of this disclosure, the machine learning model includes a unimodal machine learning model.

[0033] According to a second aspect of this disclosure, a system is provided. The system includes one or more processors; and a memory for storing a computer program executable by the one or more processors. The computer program includes instructions for performing or facilitating the performance of the computer-implemented method according to the first aspect. The system may further include a display for displaying a user interface and related information and data.

[0034] According to a third aspect of this disclosure, a carrier medium carrying computer-readable instructions is provided for causing one or more processors to perform or facilitating the performance of a computer-implemented method according to the first aspect. In one example, the carrier medium comprises a computer-readable medium. In another example, the carrier medium is a non-transitory computer-readable medium storing a computer program executable by one or more processors. The computer program includes instructions for performing or facilitating the performance of the computer-implemented method according to the first aspect.

[0035] According to a fourth aspect of this disclosure, a computer program including instructions is provided, which, when executed by a computer, cause the computer to perform a computer-implemented method according to the first aspect.

[0036] Other features and aspects will become apparent from the following detailed description and accompanying drawings. Any feature described herein, as appropriate and applicable, may be combined with other features relating to any other aspect or embodiment herein. Attached Figure Description

[0037] Embodiments of this disclosure will be described below by way of example with reference to the accompanying drawings.

[0038] Figure 1 A flowchart illustrating a method for interacting with a retrieval enhancement generation system according to an embodiment is provided.

[0039] Figure 2 To illustrate another flowchart of a method for interacting with a retrieval enhancement generation system according to one embodiment;

[0040] Figure 3 A screenshot showing the user interface of a retrieval enhancement generation system according to one embodiment;

[0041] Figure 4This is a schematic diagram illustrating an exemplary indication associated with a quality score of a response provided by a retrieval enhancement generation system according to one embodiment;

[0042] Figure 5 This is a schematic diagram illustrating the interaction (hover) of a scoring component in the user interface of a retrieval enhancement generation system according to an embodiment;

[0043] Figure 6 This is a schematic diagram illustrating the interaction (click) of a score component in the user interface of a retrieval enhancement generation system according to an embodiment;

[0044] Figure 7 This is a schematic diagram illustrating the interaction (cursor leave) of a scoring component in the user interface of a retrieval enhancement generation system according to an embodiment; and

[0045] Figure 8 This is a block diagram of a data processing system according to one embodiment. Detailed Implementation

[0046] Reliable responses are crucial for the practical deployment of Retrieval Augmentation (RAG) systems. However, ensuring that the machine learning models (e.g., large language models, LLMs) of RAG systems consistently generate realistic and reliable responses is a significant challenge, especially in knowledge-intensive question-answering domains such as education and healthcare. In these cases, general machine learning models (e.g., general LLMs) may not be trained with domain-specific knowledge, increasing the risk of generating illusory responses. This problem also exists in RAG systems, where the retrieval context used to generate responses may be incomplete or missing from the provided knowledge base. Therefore, neither open-domain LLMs nor RAG systems can guarantee consistently accurate and reliable responses in these critical applications.

[0047] The embodiments disclosed herein provide a scoring method that can be applied to improve the reliability of knowledge-intensive RAG applications. In some embodiments, the system and method can combine lexical and semantic metrics to evaluate the quality of the response by performing sentence-level comparisons between the source document and the response output. In some embodiments, the system and method are designed to actively involve users in the evaluation process. In some embodiments, the system and method are intuitive (user-friendly), which can enhance the user experience and / or improve response reliability.

[0048] Figure 1 A method 100 for interacting with a retrieval enhancement generation system according to one embodiment is shown. Method 100 is a computer-implemented method and can be executed using one or more processors.

[0049] Method 100 includes, at 102, receiving a text prompt. In one embodiment, the text prompt can be received via a user interface associated with a retrieval enhancement generation system or its machine learning model. For example, the text prompt may correspond to a query (e.g., presented in the form of a question or sentence). In one embodiment, the text prompt may include a text string.

[0050] Method 100 includes, in 104, retrieving information associated with a text prompt, at least in part, based on the text prompt. For example, information associated with a text prompt may be retrieved in response to receiving the text prompt. Information may be retrieved from a database. In one embodiment, 104 includes retrieving an electronic file (e.g., an electronic document) containing text associated with the text prompt. The electronic file or its text may represent knowledge used to generate a response to a query based on the text prompt.

[0051] Method 100 includes: in 106, generating input data based at least in part on text prompts and retrieved information. The input data can be applied to a machine learning model for processing.

[0052] Method 100 includes, in 108, generating an output at least in part by applying the input data to a machine learning model of a retrieval-enhanced generative system. The machine learning model is used to determine the output based at least in part on the input data using prompting engineering. The output may include text output. For example, the output may correspond to a response to a query (e.g., in the form of an answer). For example, the machine learning model may include a language model such as an LLM. For example, the machine learning model may include a generative language model. For example, the machine learning model may include a monomorphic machine learning model.

[0053] Method 100 includes, in 110, generating a quality score for the output, referencing the input data. The quality score may reflect the accuracy and / or relevance of the output relative to the input data. The quality score may be related at least to: lexical similarity between the text output and relevant text of the retrieved information, semantic similarity between the text output and relevant text of the retrieved information, or both. The quality score may be generated at least in part based on the lexical score associated with lexical similarity and / or the semantic score associated with semantic similarity.

[0054] Method 100 includes, in 112, outputting an indication of the quality score. The indication of the quality score may include the quality score, a rating derived from the quality score, an indication with a color corresponding to the quality score, and / or an indication with a color corresponding to the rating derived from the quality score.

[0055] In one embodiment, method 100 may further include providing a user interface, such as a graphical user interface, associated with the machine learning model. In 102, text prompts may be received through the user interface. In one embodiment, method 100 may further include displaying indicators of output and quality scores in the user interface.

[0056] Those skilled in the art will understand that method 100 is merely an example embodiment and that method 100 may be modified (e.g., to include additional steps / operations) to provide other embodiments.

[0057] Figure 2 A method 200 for interacting with a retrieval enhancement generation system according to one embodiment is shown. Method 200 is a computer-implemented method and can be executed using one or more processors.

[0058] Method 200 includes receiving a text prompt in step 202. Step 202 in method 200 may be similar to or the same as step 102 in method 100. For the sake of brevity, the details will not be repeated here.

[0059] Method 200 includes, in step 204, retrieving information associated with the text prompt, at least in part, based on the text prompt. Step 204 in method 200 may be similar to or the same as step 104 in method 100. For the sake of brevity, the details will not be repeated here.

[0060] Method 200 includes, in step 206, generating input data based at least in part on text prompts and retrieved information. Step 206 in method 200 may be similar to or the same as step 106 in method 100. For the sake of brevity, the details will not be repeated here.

[0061] Method 200 includes, in step 208, generating an output at least in part by applying the input data to a machine learning model of the retrieval augmentation generative system. Step 208 in method 200 may be similar to or the same as step 108 in method 100. For the sake of brevity, the details will not be repeated here.

[0062] Method 200 includes, in step 210A, generating a lexical score relating to the lexical similarity between the text output and the relevant text of the retrieved information. In one embodiment, the lexical score is generated at least in part based on calculating a Jaccard similarity score or an F1 score. In one embodiment, the lexical score is generated at least in part by: filtering (removing) stop words from the text output and the relevant text of the retrieved information; performing lemmatization on the filtered output and the filtered relevant text of the retrieved information; and generating the lexical score at least in part by calculating an F1 score relating to the lemmatized and filtered output and the lemmatized and filtered relevant text of the retrieved information. In one example, the F1 score may be calculated based on the following formula:

[0063]

[0064] in, W corresponds to the number of overlapping words between the output after lemmatization and filtering and the related text after lemmatization and filtering of the retrieved information, R corresponds to the number of words in the output after lemmatization and filtering, and S corresponds to the number of words in the text after lemmatization and filtering of the retrieved information.

[0065] Method 200 includes, in step 210B, generating a semantic score relating semantic similarity to related text of the text output and the retrieved information. In one embodiment, the semantic score is generated at least in part by calculating the cosine similarity between the embedding of the text output and the embedding of related text of the retrieved information.

[0066] Method 200 includes, in step 212, generating an output quality score based at least in part on lexical and semantic scores, referencing the input data. In one embodiment, the quality score is generated at least in part based on the following formula:

[0067] w1 (vocabulary score) + w2 (semantic score)

[0068] Where w1 is the weight of the lexical score and w2 is the weight of the semantic score. For example, quality score = w1 (lexical score) + w2 (semantic score). For example, w1 + w2 = 1. For example, w2 is greater than or equal to w1.

[0069] Method 200 includes, in step 214, outputting an indicator of quality score, an indicator of lexical score, and an indicator of semantic score. The indicator of lexical score may include the lexical score, a rating derived from the lexical score, an indicator with a color corresponding to the lexical score, and / or an indicator with a color corresponding to the rating derived from the lexical score. The indicator of semantic score may include the semantic score, a rating derived from the semantic score, an indicator with a color corresponding to the semantic score, and / or an indicator with a color corresponding to the rating derived from the semantic score.

[0070] In one embodiment, method 200 may further include providing a user interface, such as a graphical user interface, associated with the machine learning model. In 202, text prompts may be received through the user interface. In one embodiment, method 200 further includes displaying the output, the indication of the quality score, the indication of the lexical score, and the indication of the semantic score in the user interface.

[0071] Those skilled in the art will understand that method 200 is merely an exemplary embodiment, and method 200 can be modified (e.g., including additional steps / operations, omitting one or more steps / operations) to provide other embodiments. For example, a quality score can be generated based on lexical scores or semantic scores (not both), and one of the corresponding operations 210A and 210B can be omitted. For example, the specific way the score is calculated can be modified (e.g., using different formulas or equations). For example, specific values ​​used in the calculation can be modified (e.g., using different values). For example, 210A and 210B can be performed simultaneously or sequentially.

[0072] In one embodiment, a system and method are provided for enhancing the reliability of a retrieval enhancement generation system, which incorporates a confidence scoring model that combines lexical and semantic metrics to evaluate response quality. This embodiment can be considered an exemplary implementation of methods 100 and 200.

[0073] The rapid development of large language models (LLMs) has improved these models. However, the application of LLMs in education and professional fields has raised questions about the reliability and effectiveness of their generated responses. It is well known that generative large language models are prone to illusions and may generate false information.

[0074] Retrieval-enhanced generation (RAG) can improve the performance of large language models and reduce illusions by combining external facts with parameterized knowledge. However, RAGs can still generate incorrect or contradictory results (e.g., generated content). While some RAGs can generate natural language text as output, it is difficult to evaluate the factual correctness and credibility of the output. An evaluation metric is needed to indicate whether the generated content is likely to be incorrect or irrelevant.

[0075] In this regard, not every metric is applicable depending on the use case, as various limitations may exist. If a commercial black-box large language model is used for generation, the end user may only receive the text output. This means that the logits and attention weights from the backend LLM cannot be used to evaluate the generation quality (or output quality). Furthermore, existing confidence evaluation methods for black-box large language models, such as linguistic confidence and self-consistency, may not be suitable for real-time chatbot applications. In particular, multiple generation, evaluation, and potential regeneration may be required to complete a single confidence evaluation. This can be time-consuming and inefficient, and therefore undesirable in environments where users expect fast (e.g., real-time or instant) responses.

[0076] Based on the above, this embodiment provides a relatively simple method for accurately and promptly evaluating responses (output quality). This embodiment can be applied to real-time chat applications and utilizes text output generated from LLM responses. Furthermore, this embodiment provides an intuitive user interface for interacting with RAG or its LLM. This embodiment can enhance the user experience and / or provide reliable responses.

[0077] The reliability of the generated response (as provided by RAG or its LLM) can be assessed using uncertainty and confidence metrics. This is crucial for determining the likelihood of erroneous content. Several existing methods for assessing response validity are discussed below.

[0078] An existing method utilizes Contrastive Semantic Similarity (CSS) to extract meaningful semantic relationships between text pairs, where the Metric for Evaluation of Translation with Explicit Ordering (METEOR) is used as an evaluation metric for the LLM-generated response to better capture semantic similarity. This method can be used to estimate uncertainty.

[0079] Some existing methods involve calculating prediction uncertainty and generating multiple samples for similarity comparison to determine confidence levels. For example, existing methods suggest using a Natural Language Inference (NLI) classifier to predict the logits of entailment and contradiction between generated samples, using spectral clustering to group similar samples, calculating the graphical Laplacian operator for clustering, and generating confidence levels based on the eccentricity estimate of the graphical Laplacian operator.

[0080] Some existing methods are based on non-logistic (logits) confidence guidance. These methods include verbal confidence, consistency-based confidence, and hybrid approaches. Verbal confidence generated by the model often suffers from overconfidence, even when the model's wording explicitly expresses uncertainty. Consistency-based confidence may outperform verbal methods, but it may require more time and resources compared to other techniques. Combining verbal and consistency-based confidence methods can complement each other to further improve calibration and performance. However, these cue-based confidence guidance methods do not indicate the model's source of fact, and therefore are not suitable for retrieval-based generation in some cases.

[0081] In the context of addressing the illusion problem in open-domain LLMs, five classes of existing methods for estimating fact confidence have been identified and evaluated in one example: training probes, sequence probabilities, verbal representations, alternative label probabilities, and output consistency. In this example, the training probe method is considered the most reliable LLM fact confidence estimator, demonstrating broad applicability across diverse models and out-of-domain data. While the training probe method yields encouraging results as a confidence estimator, it requires access to multiple levels of the model's internal state, making it incompatible with many existing industry-leading black-box LLMs, which only provide the final text output via an API.

[0082] One existing method for detecting illusions in RAG utilizes mechanistic interpretability, decoupling the parametric knowledge of the LLM from the use of external context, thus allowing the computation of both external context scores and parametric knowledge scores. However, this method also requires access to the internal state of the LLM to access the model's attention head.

[0083] Unlike these existing methods, the methods and systems in this embodiment do not rely on the logit or internal state of the LLM. Instead, the methods and systems in this embodiment introduce a confidence scoring model that combines lexical and semantic metrics to evaluate response quality, enabling users to interact with the RAG system more reliably and / or effectively.

[0084] In this embodiment, a method is provided to calculate a confidence score by weighted combination of scores from lexical and semantic metrics. This method determines, at least based on lexical and semantic metrics, whether two sentences (the response sentence and the source sentence obtained from the text output and the text from the retrieved information, respectively) are related.

[0085] In this embodiment, a lexical metric is used to measure the lexical overlap between the two sentences. The greater the overlap between the response and source sentences, the more likely they are to be related. Methods that can be used to calculate a score from this overlap include Jaccard similarity and Rouge-1 F1 score.

[0086] In some cases, stop words can increase the score by increasing their count, while lemmatization can decrease the score by representing the same word in different forms. To address this issue, in this embodiment, a stop word list is used to filter out (remove or discard) common stop words, the filtered text is then lemmatized using the spacy package, and finally, lexical overlap is calculated.

[0087] In this embodiment, the F1 score can be calculated based on the following formula:

[0088]

[0089] Wherein, the number of overlapping words is represented by W, the number of words in the response sentence is represented by R, the number of words in the source sentence is represented by S, and precision is represented by... Obtained, recall is get.

[0090] Now let's turn to semantic metrics. Since different words can be used to express the same meaning, it's necessary to compare the semantic similarity between the response and source sentences, in addition to word overlap. This comparison can be performed by measuring the cosine similarity between the embeddings of the two sentences, which can be normalized to obtain a cosine similarity value between 0 and 1 before multiplication.

[0091] In this embodiment, both lexical and semantic metrics are useful for determining the relationship between the response and source sentences. Therefore, both lexical and semantic scores are considered when determining whether two sentences are sufficiently relevant. In this embodiment, a simple weighted sum (i.e., weighted lexical score plus weighted semantic score) is applied to obtain the total score (also known as the confidence score or quality score). In one example, the weighted sum is 1. In some cases, due to overlap requirements, the lexical metric may output a lower score, which tends to drag down the overall score. In one example, the semantic score is weighted at 0.8, and the lexical score is weighted at 0.2.

[0092] In some cases, to ensure that overly irrelevant sentences are ignored without over-removing them, a threshold (a threshold for the overall score) can be set. In this example, the threshold is set to 0.5. Sentences with an overall score below the threshold are considered not directly supported in the source document. These sentences may be illusory output generated by the RAG system's LLM, and users should pay special attention to them.

[0093] In this embodiment, the overall score is converted into a score range of 1 to 5 using the following formula:

[0094]

[0095] Where α = 1.2, the score is rounded to the nearest integer. The power function means that lower scores are reflected worse in the rating.

[0096] To transform scores and metrics into a more comprehensive tool that helps users evaluate the effectiveness of generated content (output provided by the RAG system or its LLM), this embodiment provides a user interface designed to facilitate its implementation on the system interface.

[0097] In this embodiment, to further verify the functionality of the response evaluation method, the user interface was implemented in a system called CPIIChatDoc Master, an AI-based tool that helps users analyze documents and websites to provide answers to questions. Figure 3 A screenshot of the CPIIChatDoc Master user interface is shown.

[0098] Figure 3 A screenshot shows the user interface used in an English-language context. Figure 3 The screenshot shows that CPIIChatDoc Master first displays the greeting, "Hello, welcome! How can I help you?" Then, the user asks, "What is CUHK?" Following this, by extracting the information from "Chinese_University_of_Hong_kong.pdf" that "The Chinese University of Hong Kong is a public research university located in Sha Tin, New Territories, Hong Kong," CPII ChatDoc Master replies, "The Chinese University of Hong Kong is a public research university located in Sha Tin, New Territories, Hong Kong." Subsequently, the user continues to ask the following questions: "When was the Chinese University of Hong Kong founded?" "How many faculties does the Chinese University of Hong Kong (CUHK) have?" "Who are the award recipients related to the faculty and staff of the Chinese University of Hong Kong (CUHK)?"

[0099] In this embodiment, based on a scoring range of three scores (vocabulary score, semantic score, and overall score) (1 to 5 in this example), five colors were selected to represent five different confidence score levels of the response. Figure 4 Examples of these metrics are shown (including ratings and color codes).

[0100] The user interface is designed to facilitate interaction with the user. In this embodiment, the user can interact with the score panel.

[0101] The interactions in this embodiment include the following user actions: hovering to expand / view, clicking to view, and leaving the cursor to collapse.

[0102] For example, a user can hover the cursor over the "Confidence Score" indicator A in the scoring panel to expand and view the "Semantic Score" indicator B and the "Lexical Score" indicator C, simultaneously displaying the relationship between the confidence score and these two corresponding scores. When the user hovers the cursor over each score, a corresponding tooltip will appear, indicating the name of that score, such as... Figure 5 As shown.

[0103] For example, users can click on the scoring component to view corresponding tooltips that explain the definitions of terms, such as... Figure 6 As shown.

[0104] For example, users can move the cursor away from the score or score panel to collapse the score component, such as... Figure 7 As shown.

[0105] In this embodiment, the user interface design emphasizes an intuitive user experience and adherence to established heuristic principles, with a particular focus on flexibility and efficiency. The user interface, especially the scoring panel, employs responsive button styles that trigger hover actions, providing users with immediate feedback and enhancing interactivity. Furthermore, tooltips and information prompts are integrated to clarify complex concepts related to the functionality, ensuring users can easily see key information without confusion. Adhering to heuristic design principles of flexibility and efficiency, the interface displays three layers of information (three scores) within a single, compact display element. This allows information to be presented in a concise format, seamlessly integrated into quotation boxes or panels, ensuring valuable context and insights are provided to the user without hindering ongoing chatbot conversations.

[0106] Based on the above embodiments, an experiment (case study) was conducted. In this experiment, a chatbot named CUHK Chatbot was developed based on the CUHK Wikipedia page (PDF format). The chatbot answered questions about the Chinese University of Hong Kong, and its answers reliably aligned with the source document. As shown in Table 1, each response-source pair demonstrated consistency with the motivation and design of the above embodiments. Each pair was evaluated using lexical scores, semantic scores, and a final confidence score, ranging from 1 to 5.

[0107] As shown in Table 1, responses are generally correct when confidence levels are high. Scores can point to relevant sources when a reasonable source exists, but still require very similar word choices to achieve high word scores. This should serve as an auxiliary tool for end-users, helping them find relevant original items and determine whether answers are correct, reliable, or credible.

[0108] Table 1: Response-source sentence pairs and their corresponding lexical scores, semantic scores, and confidence scores in a sample experiment.

[0109]

[0110]

[0111] The response sentences and source sentences in Table 1 are presented in English to reflect the response sentences and source sentences in real-world scenarios during model application. For ease of understanding, the translations of Table 1 are presented in Table 2 below.

[0112] Table 2: Response-source sentence pairs and their corresponding lexical scores, semantic scores, and confidence scores in an example experiment (translation from Table 1).

[0113]

[0114]

[0115] The above embodiments provide a confidence scoring method aimed at improving the reliability of responses in RAG systems, particularly suitable for knowledge-intensive fields such as education and healthcare. This embodiment provides a response quality assessment metric and an interactive user interface design. In this embodiment, by utilizing lexical and semantic metrics for sentence-level comparison, the assessment metric can effectively evaluate response quality based on the source document. Furthermore, the method and user interface provided in this embodiment are intuitive, easy to implement, and user-friendly, enabling users to reliably evaluate response content. The method and system in this embodiment improve user experience by providing clear indications of response reliability.

[0116] Figure 8 A data processing system 800 according to one embodiment is shown. Figure 8 Only the main components of the data processing system 800 are shown. This data processing system 800 can be used to perform data processing operations, such as any of the operations disclosed herein. For example, the data processing system 800 can be used to partially or completely perform any of the disclosed methods (e.g., method 100, method 200, and references). Figures 3 to 7 (Discussion method). The data processing system 800 is used to support interaction with the retrieval enhancement generation system. The data processing system 800 can be used to operate the retrieval enhancement generation system.

[0117] Data processing system 800 includes components required to receive, store, and execute appropriate computer instructions, commands, and / or code. Data processing system 800 includes processor 802 and memory 804. Processor 802 may include one or more of the following components: central processing unit (CPU), microcontroller unit (MCU), graphics processing unit (GPU), neural processing unit (NPU), video processing unit (VPU), tensor processing unit (TPU), logic circuitry, Raspberry Pi chip, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), and digital and / or analog circuitry (or circuitry) configured to interpret program instructions, execute program instructions (e.g., program instructions relating to any methods or operations disclosed herein), and / or process signals, information, and / or data. Memory 804 may include one or more volatile memories (such as RAM, DRAM, SRAM, etc.), one or more non-volatile memories (such as ROM, PROM, EPROM, EEPROM, FRAM, MRAM, FLASH, SSD, NAND, NVDIMM, etc.), or any combination thereof. Appropriate computer instructions, commands, codes, information, and / or data are stored in memory 804. For example, computer instructions for performing or facilitating the performance of the method steps or operations disclosed herein (e.g., instructions in the methods disclosed herein) may be stored in memory 804. Processor 802 and memory 804 may be integrated together or disposed separately (and operatively connected).

[0118] Optionally, the data processing system 800 also includes one or more input devices 806. Examples of input devices 806 include: keyboards, mice, styluses, image scanners, microphones, haptic / touch input devices (e.g., touchscreens), image / video input devices (e.g., cameras), etc. Input devices 806 can be used to receive input from users (e.g., location information input). Input devices 806 can provide a user interface, such as a graphical user interface, for interacting with users (e.g., receiving text prompts).

[0119] Optionally, the data processing system 800 also includes one or more output devices 808. Examples of output devices 808 include displays (e.g., monitors, screens, projectors, etc.), speakers, headphones, earphones, printers, etc. The display may include an LCD display, an LED / OLED display, or other suitable display, and may or may not have touch-sensitive functionality. The output device 808 can provide a user interface, such as a graphical user interface, for interacting with a user (e.g., displaying information or data to the user, such as outputs or scores).

[0120] The data processing system 800 may also include one or more disk drives 812, which may include one or more of the following: solid-state drives, hard disk drives, optical disk drives, flash drives, tape drives, etc. A suitable operating system may be installed in the data processing system 800, for example, in the disk drive 812 or in the memory 804. The memory 804 and the disk drive 812 may be operated by the processor 802.

[0121] The data processing system 800 may also include a communication device 810 for establishing a communication connection with one or more computing devices, such as a server, database, personal computer, terminal, tablet computer, mobile phone, watch, internet-connected device (e.g., Internet of Things device), or other computing device. The communication device 810 may include one or more of the following: modem, network interface card (NIC), integrated network interface, NFC transceiver, ZigBee transceiver, Wi-Fi transceiver, etc. Transceiver, radio frequency transceiver, cellular (2G, 3G, 4G, 5G, 6G or similar) transceiver, optical port, infrared port, USB connection or other wired or wireless communication interface. The transceiver can be implemented through one or more devices (integrated transmitter and receiver, separate transmitter and receiver, etc.). The communication link can be wired or wireless, used for transmitting commands, instructions, information and / or data.

[0122] Processor 802 and memory 804 (optionally including input device 806, output device 808, communication device 810, and disk drive 812, if present) can be directly or indirectly connected together in any of the following ways: bus, peripheral component interconnect (PCI) such as PCI Express, universal serial bus (USB), optical bus, or other similar structures. In one embodiment, at least some of these components can be wirelessly connected, for example, via a network such as the Internet, cloud computing network, edge computing network, etc.

[0123] Those skilled in the art will understand that the data processing system 800 is only one example embodiment, and the data processing system 800 can be modified (e.g., adding additional components, deleting one or more components, including alternative components, etc.) to provide other embodiments.

[0124] While not strictly necessary, the embodiments described with reference to the figures can be implemented as an application programming interface (API), or as a series of libraries for developers to use, or can be included in another software application, such as a terminal or computer operating system or a portable computing device operating system. Typically, since program modules include routines, programs, objects, components, and data files that help perform specific functions, those skilled in the art will understand that the functionality of a software application can be distributed among multiple routines, objects, or components to achieve the same functionality required herein. Furthermore, when the methods and systems are implemented entirely or partially by a computing system, any suitable computing system architecture can be employed. This may include stand-alone computers, networked computers, dedicated or non-dedicated hardware devices. When the terms "computing system" and "computing device" are used, these terms are intended to include any suitable computer or information processing hardware configuration capable of implementing the described functionality.

[0125] In one embodiment, a carrier medium is provided carrying computer-readable instructions intended to cause or facilitate the execution of a computer-implemented method according to an embodiment (as disclosed herein). The carrier medium may include a computer-readable medium, such as a non-transitory computer-readable storage medium for storing a computer program executable by one or more processors. The computer program includes instructions for performing or facilitating the execution of the computer-implemented method according to the embodiment.

[0126] In one embodiment, a computer program is provided that includes instructions that, when executed by a computer, cause the computer to perform a computer-implemented method according to one embodiment (as disclosed herein).

[0127] Those skilled in the art will understand that the described and / or illustrated embodiments can be changed and / or modified to provide other embodiments. Therefore, the described and / or illustrated embodiments should be considered exemplary in all respects and not restrictive.

[0128] Unless otherwise stated, terms of degree such as “usually,” “approximately,” “basically,” or similar terms are used herein to take into account one or more of the following factors: manufacturing tolerances, degradation, trends, tendencies, imperfect actual conditions, etc.

Claims

1. A computer-implemented method for interacting with a retrieval enhancement generation system, characterized in that, include: Receive text prompts; Information associated with the text prompt is retrieved, at least in part, based on the text prompt. Input data is generated based at least in part on the text prompts and the retrieved information; The output is generated at least in part by applying the input data to a machine learning model of the retrieval enhancement generation system; wherein the machine learning model is used to determine the output based at least in part on the input data using prompting engineering. The quality score of the output is generated based on the input data; as well as Output the output and the quality score indication.

2. The computer-implemented method according to claim 1, characterized in that, in, The quality score indicates the following: The quality score; The rating derived from the quality score; An indicator with a color corresponding to the quality score; and / or An indicator with a color corresponding to the rating derived from the quality score.

3. The computer-implemented method according to claim 2, characterized in that, in, The information retrieved in association with the text prompt includes: Retrieve electronic files including text associated with the text prompt; and The output includes text output.

4. The computer-implemented method according to claim 3, characterized in that, in, The quality score is related to the following: The lexical similarity between the text output and the relevant text of the retrieved information; and / or The semantic similarity between the text output and the related text of the retrieved information.

5. The computer-implemented method according to claim 4, characterized in that, in, The computer-implemented method further includes: Generate vocabulary scores associated with the vocabulary similarity; and The quality score is generated at least in part based on the vocabulary score.

6. The computer-implemented method according to claim 5, characterized in that, in, The vocabulary scores are generated at least in part based on calculating Jaccard similarity or F1 scores.

7. The computer-implemented method according to claim 6, characterized in that, in, The generation of the vocabulary score includes: Stop words are filtered from the text output and related text of the retrieved information; Lexical reconstruction is performed on the filtered output of the retrieved information and the filtered related text; and The lexical score is generated at least in part by calculating an F1 score that correlates the lexical reconstruction and filtering output with the relevant text of the retrieved information after the lexical reconstruction and filtering.

8. The computer-implemented method according to claim 7, characterized in that, in, The F1 score is calculated based on the following formula: in, W corresponds to the number of overlapping words between the lemmatization and filtering output and the related lemmatization and filtering text of the retrieved information, R corresponds to the number of words in the lemmatization and filtering output, and S corresponds to the number of words in the lemmatization and filtering text of the retrieved information.

9. The computer-implemented method according to claim 8, characterized in that, in, The computer-implemented method further includes: outputting an indication of the vocabulary score; and The indicators of the vocabulary score include: The vocabulary score; The rating derived from the scores of the stated vocabulary; Indicators with colors corresponding to the vocabulary scores; and / or An indicator with a color corresponding to the rating derived from the vocabulary score.

10. The computer-implemented method according to claim 9, characterized in that, in, The computer-implemented method further includes: Generate a semantic score associated with the semantic similarity; and The quality score is generated at least in part based on the semantic score.

11. The computer-implemented method according to claim 10, characterized in that, in, The generation of the semantic score includes: The semantic score is generated at least in part by calculating the cosine similarity between the embedding of the text output and the embedding of the relevant text of the retrieved information.

12. The computer-implemented method according to claim 11, characterized in that, in, The computer-implemented method further includes: outputting an indication of the semantic score; and The indication of the semantic score includes: The semantic score; The rating derived from the semantic score; An indicator with a color corresponding to the semantic score; and / or An indicator with a color corresponding to the rating derived from the semantic score.

13. The computer-implemented method according to claim 4, characterized in that, in, The computer-implemented method further includes: Generate a lexical score associated with the lexical similarity, and generate a semantic score associated with the semantic similarity; and The quality score is generated at least in part based on the vocabulary score and the semantic score.

14. The computer-implemented method according to claim 13, characterized in that, in, The quality score is generated at least in part based on the following formula: w1 (vocabulary score) + w2 (semantic score) Where w1 is the weight of the word score and w2 is the weight of the semantic score.

15. The computer-implemented method according to claim 14, characterized in that, in, Quality score = w1 (vocabulary score) + w2 (semantic score); and w1+w2=1.

16. The computer-implemented method according to claim 15, characterized in that, in, The computer-implemented method further includes: Provide a user interface associated with the search enhancement generation system; The text prompt is received through the user interface; and The computer-implemented method further includes: Output the indication of the lexical score and the indication of the semantic score, and The user interface displays the output, the indication of the quality score, the indication of the vocabulary score, and the indication of the semantic score.

17. The computer-implemented method according to claim 1, characterized in that, in, The computer-implemented method further includes: Provide a user interface associated with the search enhancement generation system; The text prompt is received through the user interface; and The computer-implemented method further includes: The output and the indication of the quality score are displayed in the user interface.

18. The computer-implemented method according to claim 1, characterized in that, in, The machine learning model includes a language model.

19. The computer-implemented method according to claim 1, characterized in that, in, The machine learning model includes a generative language model.

20. A system, characterized in that, include: One or more processors; as well as A memory for storing computer programs that can be executed by the one or more processors; The computer program includes instructions for performing or facilitating the performance of the computer-implemented method according to claim 1.