Question answering method and device based on large language model, equipment and medium

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By introducing a multi-round optimization method with Chosen reinforcement constraints and Rejected weakening constraints into the large language model, the problems of output quality fluctuation and insufficient accuracy of the large language model in question answering scenarios are solved, and higher accuracy and stability are achieved, especially in scenarios with high accuracy requirements such as mathematical derivation and code generation.

CN121936631BActive Publication Date: 2026-06-26NEW H3C TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: NEW H3C TECH CO LTD
Filing Date: 2026-03-26
Publication Date: 2026-06-26

Application Information

Patent Timeline

26 Mar 2026

Application

26 Jun 2026

Publication

CN121936631B

IPC: G06N20/00

CPC: G06N20/00

AI Tagging

Technology Topics

Code generation Linguistic model

Technical Efficacy Phrases

reliable answerreliable answer information

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing large language models suffer from inconsistent output quality and insufficient accuracy in question-answering scenarios, especially in scenarios with high requirements for accuracy and stability, such as mathematical problems and code generation, where logical breaks and security vulnerabilities exist.

Method used

By using the preferred and unpreferred answer information output by the first and second language models respectively, the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters are determined. The large language model is optimized to maintain the generation confidence of high-quality answers and suppress the generation tendency of low-quality answers. Multiple rounds of optimization iteration are adopted until the termination condition is met.

Benefits of technology

It improves the accuracy and stability of responses from large language models in scenarios such as mathematical derivation and code generation, significantly reduces the generation of logical breaks and security vulnerabilities, alleviates fluctuations in output quality, and improves the reliability of responses.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN121936631B_ABST

Patent Text Reader

Abstract

The application provides a large language model-based question and answer method, device, equipment and medium. The application introduces Chosen reinforcement constraints and Rejected weakening constraints to optimize the large language model, uses Chosen reinforcement constraint parameters to maintain the stable generation ability of the large language model for Chosen answers, effectively maintains or even improves the generation quality of the large language model. The use of Rejected weakening constraint parameters effectively suppresses the generation of incorrect or unreliable low-quality answers, avoiding the reduction of the generation quality of the large language model. The two work together to effectively improve the tendency of the large language model to generate high-quality answers and maintain the stability of the output answer quality, so that the large language model can stably output accurate and reliable answers in mathematical reasoning, code generation and other question and answer scenarios, effectively solving the problem of output quality fluctuation and insufficient accuracy in existing question and answer scenarios.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to question-answering methods, apparatus, devices and media based on large language models. Background Technology

[0002] Currently, in various question-and-answer scenarios such as intelligent customer service, assisted diagnosis, and code generation, the accuracy of answers generated by Large Language Models (LLMs) still has issues. For example, in mathematical question-and-answer scenarios, LLMs may skip crucial steps and only provide answers that are formally correct but logically broken. Similarly, in code generation question-and-answer scenarios, LLMs may generate seemingly feasible code that contains security vulnerabilities. Even for the same question, the output quality of LLMs is extremely inconsistent; sometimes they provide rigorous, accurate, high-quality answers, while at other times they output low-quality answers that are self-contradictory or factually incorrect.

[0003] Therefore, there is an urgent need for a new question-answering method based on a large language model to solve the problems of fluctuating output quality and insufficient accuracy in existing question-answering scenarios. Summary of the Invention

[0004] In view of this, embodiments of this application provide a question-answering method, apparatus, device, and storage medium based on a large language model to solve the problems of fluctuating output quality and insufficient accuracy in existing question-answering scenarios.

[0005] This application provides a question-answering method based on a large language model, the method comprising:

[0006] For each issue in the current optimization round, perform the following operations:

[0007] The algorithm obtains the first preferred Chosen answer information output by the first largest language model based on the question, and the second preferred Chosen answer information output by the second largest language model based on the question. In the initial optimization round, the first largest language model is the specified large language model, and the second largest language model is the fine-tuned first largest language model. In each subsequent optimization round, the first largest language model is the second largest language model before optimization in the previous round, and the second largest language model is the second largest language model after optimization in the previous round. Based on the first and second Chosen answer information, the Chosen reinforcement constraint parameters corresponding to the question are determined. These Chosen reinforcement constraint parameters are used to maintain or improve the quality of the Chosen answer information.

[0008] Obtain the first rejected answer information output by the first major language model based on the question, and obtain the second rejected answer information output by the second major language model based on the question; determine the rejected weakening constraint parameter corresponding to the question based on the first rejected answer information and the second rejected answer information; the rejected weakening constraint parameter is used to suppress the generation of rejected answer information;

[0009] Based on the Chosen strengthening constraint parameters and Rejected weakening constraint parameters corresponding to each problem in the current optimization round, optimize the second largest language model; if the termination optimization condition is met, the optimized second largest language model is taken as the target large language model; if the termination optimization condition is not met, proceed to the next optimization round.

[0010] This application also provides a question-answering device based on a large language model, the device comprising:

[0011] The determination module performs the following operations for each problem in the current optimization round:

[0012] The algorithm obtains the first preferred Chosen answer information output by the first largest language model based on the question, and the second preferred Chosen answer information output by the second largest language model based on the question. In the initial optimization round, the first largest language model is the specified large language model, and the second largest language model is the fine-tuned first largest language model. In each subsequent optimization round, the first largest language model is the second largest language model before optimization in the previous round, and the second largest language model is the second largest language model after optimization in the previous round. Based on the first and second Chosen answer information, the Chosen reinforcement constraint parameters corresponding to the question are determined. These Chosen reinforcement constraint parameters are used to maintain or improve the quality of the Chosen answer information.

[0013] Obtain the first rejected answer information output by the first major language model based on the question, and obtain the second rejected answer information output by the second major language model based on the question; determine the rejected weakening constraint parameter corresponding to the question based on the first rejected answer information and the second rejected answer information; the rejected weakening constraint parameter is used to suppress the generation of rejected answer information;

[0014] The optimization module is used to optimize the second language model based on the Chosen reinforcement constraint parameters and Rejected weakening constraint parameters corresponding to each problem in the current optimization round.

[0015] The iterative module is used to select the optimized second largest language model as the target largest language model if the termination optimization condition is met; otherwise, it proceeds to the next optimization round.

[0016] This application also provides an electronic device, including: a processor and a machine-readable storage medium for storing machine-executable instructions, wherein the machine-executable instructions, when run by the machine-readable storage medium, cause the processor to perform the steps of the above method.

[0017] This application also provides a machine-readable storage medium storing machine-executable instructions that, when executed, enable the implementation of the steps described above.

[0018] As can be seen from the above technical solution, in this embodiment, the Chosen reinforcement constraint parameters are determined by using the first and second Chosen answer information output by the first and second large language models for the same question, respectively. Similarly, the Rejected reinforcement constraint parameters are determined by using the first and second Rejected answer information output by the first and second large language models for the same question, respectively. Since the Chosen reinforcement constraint parameters can suppress the decrease in the confidence of the optimized second large language model in generating known high-quality Chosen answers relative to the unoptimized first large language model, they maintain the stable generation capability of the large language model for Chosen answers, effectively maintaining or even improving the generation quality of the large language model. The Rejected weakening constraint parameters can suppress the abnormal enhancement of the optimized second large language model's tendency to generate low-quality rejected answers relative to the unoptimized first large language model, effectively suppressing the generation of erroneous or unreliable low-quality answers and avoiding a decrease in the generation quality of the large language model. The synergistic effect of these two parameters effectively improves the accuracy of the large language model's generated answers and maintains the stability of the output answer quality, that is, enabling the large language model to output accurate and reliable answers.

[0019] Furthermore, the method provided in this application can significantly reduce the generation of logically broken or security-vulnerable answers in question-and-answer scenarios such as mathematical derivation and code generation, which require high accuracy and stability. It can also effectively alleviate the problem of quality fluctuations when generating the same question multiple times, thereby improving the reliability of the answers. In other words, it effectively solves the problems of output quality fluctuations and insufficient accuracy in existing question-and-answer scenarios. Attached Figure Description

[0020] Figure 1 A flowchart illustrating the method provided in the embodiments of this application;

[0021] Figure 2 A flowchart illustrating the optimization of the second major language model provided in this application embodiment;

[0022] Figure 3 Another flowchart illustrating the method provided in this application embodiment;

[0023] Figure 4 This is a schematic diagram of the structure of the device provided in the embodiments of this application;

[0024] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0025] To enable those skilled in the art to better understand the technical solutions provided in the embodiments of this application, and to make the above-mentioned objectives, features and advantages of the embodiments of this application more apparent and understandable, the technical solutions in the embodiments of this application will be further described in detail below with reference to the accompanying drawings.

[0026] Before introducing the method provided in the embodiments of this application, the existing technical problems will be explained:

[0027] As mentioned above, in practical applications, the answers generated by large language models not only have obvious shortcomings in terms of accuracy, but also exhibit significant instability in output quality even when faced with the same question.

[0028] In related technologies, Direct Preference Optimization (DPO) is commonly used to optimize large language models. In DPO, pre-labeled preferred and rejected answers are used to optimize the large language model's ability to distinguish between chosen and rejected responses. This improves the large language model's ability to generate chosen answers, thus optimizing it towards a greater preference for generating chosen answers. In other words, it increases the large language model's tendency to generate high-quality answers, thereby improving the accuracy of its generated responses.

[0029] It's important to clarify here that a Chosen answer is a high-quality answer, referring to the most accurate and logically rigorous response among multiple answers to the same question. A Rejected answer, on the other hand, is a low-quality answer that lacks accuracy or contains errors. The distinction between Chosen and Rejected answers is not based on subjective preference but is verifiable and determined by quantifiable verification standards. These standards include verifying the completeness of mathematical derivations, whether code passes test cases, and whether medical recommendations conform to clinical guidelines. The degree of verification determines whether a high-quality or low-quality answer is distinguished.

[0030] However, the above optimization methods focus too much on improving the ability of large language models to distinguish between Chosen and Rejected answers, and pay too much attention to the differences between the two. This has led to instability in the quality of generated Chosen answers. For example, the quality of generated Chosen answers may degrade due to a lack of confidence or uncertainty in the generated Chosen answers.

[0031] Therefore, improving the accuracy of responses generated by large language models while maintaining the stability of output response quality remains a pressing issue that needs to be addressed.

[0032] Based on this, embodiments of this application provide a question-answering method, apparatus, electronic device, and storage medium based on a large language model to solve the problems of fluctuating output quality and insufficient accuracy in existing question-answering scenarios.

[0033] The method provided in the embodiments of this application is described in detail below:

[0034] See Figure 1 , Figure 1 This is a flowchart illustrating the method provided in an embodiment of this application. Figure 1 As shown, the process includes the following steps:

[0035] S101, For each problem in the current optimization round, perform the following steps S102 and S103.

[0036] S102, obtain the first preferred Chosen answer information output by the first large language model based on the question, and obtain the second preferred Chosen answer information output by the second large language model based on the question; based on the first Chosen answer information and the second Chosen answer information, determine the Chosen reinforcement constraint parameters corresponding to the question; the Chosen reinforcement constraint parameters are used to maintain or improve the quality of the Chosen answer information.

[0037] In this embodiment, the optimization of the large language model involves multiple optimization training rounds (optimization training rounds can be simply referred to as optimization rounds; for ease of description, this application uses optimization rounds for illustration). Each optimization round uses multiple problems, and the problems used in different optimization rounds are different. In the initial optimization round, the first large language model is the specified large language model, and the second large language model is the first large language model after fine-tuning. In each optimization round after the initial optimization round, the first large language model is the second large language model before optimization in the previous optimization round, and the second large language model is the second large language model after optimization in the previous optimization round.

[0038] In this embodiment, the Chosen reinforcement constraint parameter ensures that the second largest language model does not degrade in the confidence level of generated Chosen responses compared to the first largest language model. For known high-quality Chosen responses, if the confidence level generated by the second largest language model is significantly lower than that of the first largest language model, it indicates that the optimized second largest language model has degraded in the quality of generated Chosen responses compared to the unoptimized first largest language model. Therefore, constraining the confidence level of generated Chosen responses to prevent degradation is essential to maintaining the stable generation capability of the large language model for Chosen responses, that is, maintaining or even improving the quality of responses generated by the large language model.

[0039] S103, obtain the first rejected answer information output by the first language model based on the question, and obtain the second rejected answer information output by the second language model based on the question; based on the first rejected answer information and the second rejected answer information, determine the rejected weakening constraint parameter corresponding to the question; the rejected weakening constraint parameter is used to suppress the generation of rejected answer information.

[0040] In this embodiment, the Rejected weakening constraint parameter ensures that the confidence level of the second-largest language model in generating rejected answers is not increased compared to the first-largest language model. If, for a known low-quality rejected answer, the confidence level of the second-largest language model in generating it is significantly higher than that of the first-largest language model, it indicates that the optimized second-largest language model has an abnormally increased tendency (generation probability) to generate rejected answers compared to the unoptimized first-largest language model. Therefore, constraining the confidence level of generating rejected answers to not increase is crucial to suppressing the large language model's erroneous preference for rejected answers, preventing it from generating incorrect or unreliable answers, and thus maintaining the quality of the large language model's generated answers.

[0041] The specific implementation methods for determining the Chosen strengthening constraint parameters and the Rejected weakening constraint parameters will be explained later, and will not be elaborated here.

[0042] S104. Based on the Chosen strengthening constraint parameters and Rejected weakening constraint parameters corresponding to each problem, optimize the second largest language model. If the termination condition is met, the optimized second largest language model is taken as the target large language model. If the termination condition is not met, proceed to the next optimization round.

[0043] The specific implementation of step S104 will be explained later, and will not be elaborated here.

[0044] It should be noted that in this embodiment, in the initial optimization round, the basic large language model is used as the first large language model, and the basic large language model after supervised fine-tuning (SFT) (i.e., the SFT large language model) is used as the second large language model. Then, after one round of optimization through steps S101 to S104, the optimized second large language model is obtained. It is then determined whether the optimization termination condition is met. If it is, the optimized second large language model obtained when the optimization termination condition is met is used as the target large language model, which is used to generate answers to user questions in a question-and-answer scenario. If it is not met, the process proceeds to the next optimization round. Specifically, multiple questions are obtained for the next optimization round. In the next optimization round, the optimized second large language model from the current optimization round is used as the second large language model in the next optimization round, and the unoptimized second large language model from the current optimization round is used as the first large language model in the next optimization round. The next optimization round is then used as the current optimization round, and step S101 is returned to be executed, and so on, until the optimization termination condition is met.

[0045] The specific implementation of determining whether the termination optimization condition is met will be illustrated with examples later, and will not be elaborated here.

[0046] This concludes the process. Figure 1 The process is shown below.

[0047] pass Figure 1As shown in the flowchart, in this embodiment, Chosen reinforcement constraint parameters are determined by using the first and second Chosen answer information output by the first and second large language models for the same question, respectively. Rejected reinforcement constraint parameters are determined by using the first and second rejected answer information output by the first and second large language models for the same question, respectively. Since the Chosen reinforcement constraint parameters can suppress the decrease in the confidence of the optimized second large language model in generating known high-quality Chosen answers compared to the unoptimized first large language model, they maintain the stable generation ability of the large language model for Chosen answers, effectively maintaining or even improving the generation quality of the large language model. The rejected weakening constraint parameters can suppress the abnormal enhancement of the optimized second large language model's tendency to generate low-quality rejected answers compared to the unoptimized first large language model, effectively suppressing the generation of erroneous or unreliable low-quality answers and avoiding a decrease in the generation quality of the large language model. The synergistic effect of these two parameters effectively improves the accuracy of the large language model's generated answers and maintains the stability of the output answer quality, that is, enabling the large language model to output accurate and reliable answers.

[0048] Furthermore, the method provided in this application can significantly reduce the generation of logically broken or security-vulnerable answers in question-and-answer scenarios such as mathematical derivation and code generation, which require high accuracy and stability. It can also effectively alleviate the problem of quality fluctuations when generating the same question multiple times, thereby improving the reliability of the answers. In other words, it effectively solves the problems of output quality fluctuations and insufficient accuracy in existing question-and-answer scenarios.

[0049] The following section elaborates on determining the Chosen strengthening constraint parameters and the Rejected weakening constraint parameters:

[0050] For each question, it is input multiple times into the first language model, which generates multiple answers to the question. Each answer has a corresponding probability and perplexity (PPL).

[0051] The probability corresponding to this answer is the conditional probability that the leading language model generates this answer for a given input (the question), determined by the joint conditional probability of all tokens in generating the answer. This probability indicates the leading language model's tendency to generate this answer in the current context. A higher probability indicates that the answer more closely matches the language distribution patterns learned by the leading language model, and the leading language model is more inclined to generate this answer; conversely, a lower probability indicates a lower probability.

[0052] The Probability of Probability (PPL) for this answer is obtained through a specific calculation based on the predicted probabilities of all tokens in the answer. For example, the average negative log-likelihood of the probabilities of all tokens in the answer is calculated, and then the exponentiation of the average negative log-likelihood is taken to obtain the PPL of the answer. The PPL of the answer quantifies the uncertainty when the model generates the answer. The lower the PPL, the lower the uncertainty, and the higher the confidence of the first language model in generating the answer. Conversely, the higher the PPL, the greater the uncertainty, and the lower the confidence.

[0053] Through manual annotation, Chosen and Rejected answers are determined from the multiple responses output by the first language model for this question, and are denoted as the first Chosen answer and the first Rejected answer. Similarly, Chosen and Rejected answers are selected from the multiple responses output by the second language model for this question, and are denoted as the second Chosen answer and the second Rejected answer.

[0054] It should be noted that the corresponding probability can be used as a reference factor for selecting Chosen and Rejected answers, but the final method for measuring the quality of Chosen and Rejected answers needs to be verified in conjunction with the quantifiable indicators mentioned above.

[0055] Among them, the first Chosen answer and the first PPL corresponding to the first Chosen answer are recorded as the first Chosen answer information.

[0056] The second Chosen response and the second PPL corresponding to the second Chosen response are recorded as the second Chosen response information.

[0057] The first rejected answer and the third PPL corresponding to the first rejected answer are recorded as the first rejected answer information.

[0058] The second rejected answer and the fourth PPL corresponding to the second rejected answer are recorded as the second rejected answer information.

[0059] Thus, the first Chosen response, the second Chosen response, the first Rejected response, and the second Rejected response were obtained.

[0060] Subsequently, the Chosen strengthening constraint parameters and the Rejected weakening constraint parameters for the problem are determined using the methods provided in the following embodiments.

[0061] As an example, if the second PPL is greater than the first PPL, then the Chosen reinforcement constraint parameter is the difference between the first PPL and the second PPL. If the second PPL is less than or equal to the first PPL, then the Chosen reinforcement constraint parameter is a set first value.

[0062] Specifically, if the second PPL is greater than the first PPL, it indicates that the second largest language model has higher perplexity and lower certainty (i.e., lower confidence) in generating Chosen answers compared to the first largest language model, reflecting a degradation in the quality of generated answers during the optimization process. Therefore, it is necessary to quantify this degree of degradation using the Chosen reinforcement constraint parameter and incorporate it into the total loss value to suppress such confidence decline in subsequent optimizations, thereby maintaining or even improving the model's stable ability to generate high-quality answers. In this case, the difference between the first and second PPLs is determined as the Chosen reinforcement constraint parameter.

[0063] If the second PPL is less than or equal to the first PPL, it indicates that the perplexity of the second language model in generating Chosen answers is lower or unchanged compared to the first language model, while the confidence in generating Chosen answers is higher or unchanged. This reflects that the large language model has not experienced a degradation in generation quality during the optimization process. Therefore, the first value (e.g., 0) can be set as the Chosen reinforcement constraint parameter.

[0064] For example, the Chosen reinforcement constraint parameters can be calculated using the following formula 1:

[0065] Formula 1

[0066] in, Strengthen the Chosen constraint parameters;

[0067] The second PPL for the second Chosen answer in the second language model;

[0068] The first PPL for the first Chosen answer in the first language model.

[0069] As an example, if the third PPL is greater than the fourth PPL, the rejected weakening constraint parameter is the difference between the third PPL and the fourth PPL. If the third PPL is less than or equal to the fourth PPL, the rejected weakening constraint parameter is the set second value.

[0070] Specifically, when the third PPL is greater than the fourth PPL, it indicates that the second largest language model has lower perplexity in generating rejected answers compared to the first largest language model. In other words, its confidence in generating these low-quality answers is higher, reflecting that the larger language model has mistakenly amplified its preference for inferior content during the optimization process. Since rejected answers are known to be low-quality, the increase in their confidence will increase the risk of outputting incorrect and / or unreliable answers. Therefore, it is necessary to quantify this abnormal amplification through the rejected weakening constraint parameter and incorporate it into the total loss value to suppress such confidence increases in subsequent optimizations, thereby effectively suppressing the tendency to generate low-quality answers. In this case, the difference between the third and fourth PPL is determined as the rejected weakening constraint parameter.

[0071] When the third PPL is less than or equal to the fourth PPL, it indicates that the perplexity of the second largest language model for rejected answers has not decreased (i.e., the confidence has not increased or has further decreased), suggesting that the model has not shown an increased preference for inferior content during the optimization process. Therefore, the preset second value (e.g., 0) can be used as the parameter for weakening the rejected constraint.

[0072] For example, it can be expressed by the following formula 2:

[0073] Formula 2

[0074] in, For rejected, weaken the constraint parameters;

[0075] The third PPL for the first rejected answer in the first major language model;

[0076] This is the fourth PPL for the second rejected answer in the second language model.

[0077] It should be noted that, in the embodiments of this application, the purpose of optimizing the large language model is to: increase the tendency to generate Chosen answers and decrease the tendency to generate Rejected answers (that is, to optimize the large language model to increase the probability of generating Chosen answers and decrease the probability of generating Rejected answers), and maintain the stability of the quality of generated answers (that is, to optimize the large language model so that the perplexity of Chosen answers does not increase and the perplexity of Rejected answers does not decrease).

[0078] In this embodiment, Chosen strengthening constraint parameters and Rejected weakening constraint parameters are determined in the manner described above. When optimizing the model, Chosen strengthening constraint parameters and Rejected weakening constraint parameters are introduced. The two work together to make the model optimize in the aforementioned direction during iterative optimization.

[0079] The above provides a detailed explanation of how to determine the Chosen strengthening constraint parameters and the Rejected weakening constraint parameters.

[0080] The following section elaborates on optimizing the second language model based on the Chosen strengthening constraint parameters and Rejected weakening constraint parameters corresponding to each problem in the current optimization round:

[0081] See Figure 2 , Figure 2 This is a flowchart illustrating the optimized second language model provided in an embodiment of this application. Figure 2 As shown, the process includes the following steps:

[0082] S201, for each problem, determine the DPO loss value for that problem based on the direct preference optimization DPO objective loss function, and determine the hyperparameters related to the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters corresponding to that problem based on the DPO loss value.

[0083] In this embodiment, the DPO target loss function is shown in Formula 3 below:

[0084] Formula 3

[0085] in, For the question x When used as input, the probability that the second largest language model generates the second Chosen answer;

[0086] For the question x When used as input, the probability of generating the first Chosen answer after the first largest language model;

[0087] For the question x When used as input, the probability that the second largest language model generates the second rejected answer;

[0088] For the question x When used as input, the largest language model generates the first rejected answer. y The probability of;

[0089] It is the number one language model and the reference model;

[0090] It is the second largest language model, which is a strategy model. The current optimization round is to optimize this strategy model.

[0091] This is the preferred answer. This is the worst answer;

[0092] β is a temperature parameter used to control the strength of the KL divergence constraint;

[0093] It is the implicit reward in the DPO objective loss function, used to measure the preference of the second largest language model over the first largest language model for the Chosen answer;

[0094] Among them, when When the value is >0, it indicates that the second largest language model prefers (or tends to generate) the Chosen answer more than the first largest language model;

[0095] when When the value is <0, it indicates that the first language model prefers (or tends to generate) the Chosen answer more than the second language model.

[0096] Used to calculate the implicit reward difference between Chosen and Rejected answers;

[0097] The larger the implicit reward difference, the stronger the large language model's ability to distinguish between Chosen and Rejected answers.

[0098] σ is the activation sigmoid function, which maps the above implicit reward difference to the interval (0, 1);

[0099] β is a parameter used to control the optimization intensity and prevent the model from deviating excessively from the reference distribution.

[0100] After determining the DPO loss value for this problem using the aforementioned DPO objective loss function, the specific implementation method for determining the hyperparameters can be as follows:

[0101] If the DPO loss value is greater than a set threshold (e.g., 1), the hyperparameter is set to a specified first parameter value (e.g., 0). This first parameter value indicates that the current optimization round focuses on optimizing the DPO loss value. If the DPO loss value is less than or equal to the threshold, the hyperparameter is set to a specified second parameter value (which can be set according to actual conditions, e.g., 0.1). This second parameter value indicates that the current optimization round takes into account optimizing the DPO loss value, Chosen reinforcement constraint parameters, and Rejected weakening constraint parameters, especially taking into account the difference between optimizing the DPO loss value and the sum of the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters.

[0102] S202. Based on the hyperparameters, the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters corresponding to the problem, and the DPO loss value of the problem, determine the total loss value of the problem.

[0103] In this embodiment, the hyperparameters, the Chosen strengthening constraint parameters and the Rejected weakening constraint parameters corresponding to the problem, and the DPO loss value of the problem are specified and calculated, and the result of the specified calculation is the total loss value of the problem.

[0104] For example, the total loss for this problem can be calculated using the following formula 4:

[0105] Formula 4

[0106] in, This represents the total loss value for the problem;

[0107] This represents the DPO loss value.

[0108] For hyperparameters,

[0109] Among them, when When it is greater than 1, then =0 At this point, the current optimization round focuses on optimizing the original DPO loss value;

[0110] when If less than or equal to 1, then >0 indicates an empirical value, such as 0.1; in this case, the current optimization round focuses on optimizing the original DPO loss value, Chosen reinforcement constraint parameters, and Rejected weakening constraint parameters.

[0111] The Chosen reinforcement constraint parameters are those corresponding to this problem;

[0112] This refers to the Rejected weakened constraint parameter corresponding to this problem;

[0113] The values are between 0 and 1, representing the processing results of the Chosen strengthening constraint parameters and the Rejected weakening constraint parameters corresponding to this problem when using the activation function.

[0114] In this embodiment, the hyperparameter is adjusted by the DPO loss value. This allows for the optimization process where, when the DPO loss value is large (γ=0), the relative distinction between Chosen and Rejected answers is learned first. When the DPO loss value is small (γ>0), Chosen reinforcement constraint parameters and Rejected weakening constraint parameters are introduced to prevent the degradation of Chosen answer confidence and the enhancement of Rejected answer preference, thus ensuring the stability of high-quality output.

[0115] The above provides a detailed explanation of how the second language model is optimized based on the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters corresponding to each problem.

[0116] The following section elaborates on determining whether the termination optimization condition is met:

[0117] The above determination of whether the termination optimization condition is met includes the following steps:

[0118] First, after the current optimization round ends, the optimized second language model is used to generate corresponding answers for each test question in the evaluation set, and anthropomorphic evaluation indicators for each answer are obtained.

[0119] In this embodiment, the anthropomorphic evaluation of any answer is used to indicate the degree of anthropomorphism when the optimized second language model generates that answer.

[0120] For example, the anthropomorphism evaluation metric is represented by anthropomorphism scores. Specifically, the assessment set contains multiple test questions. Each test question in the assessment set is answered one by one using the optimized second-largest language model. Then, for each answer, the degree of anthropomorphism of the answer is evaluated from key dimensions such as "empathy" and "naturalness of refusing to answer" using a pre-trained evaluation model, resulting in a first anthropomorphism score for the answer.

[0121] For each response, human experts assess its anthropomorphism based on the same criteria (i.e., key dimensions such as "empathy" and "naturalness of refusing to respond"), obtaining a second anthropomorphism score. The first anthropomorphism score output by the evaluation model and the second anthropomorphism score obtained by humans are then weighted according to a preset weight (e.g., 50% each), and the weighted result is used as the anthropomorphism evaluation index for the response.

[0122] Secondly, based on the anthropomorphic evaluation indicators of each answer, the anthropomorphic comprehensive evaluation indicators of the optimized second language model are determined.

[0123] For example, the average value of the anthropomorphic evaluation indicators corresponding to each answer is determined as the aforementioned anthropomorphic comprehensive evaluation indicator. The anthropomorphic comprehensive evaluation indicator is used to indicate the degree of anthropomorphism of the optimized second language model. The higher the degree of anthropomorphism, the more the optimized second language model resembles a 'well-informed, natural, and clearly defined' human conversationalist in question-and-answer scenarios.

[0124] Finally, if the anthropomorphic comprehensive evaluation index reaches the preset threshold, the optimization termination condition is determined to be met; if the anthropomorphic comprehensive evaluation index does not reach the preset threshold, the optimization termination condition is determined to be unmet.

[0125] In this embodiment, the degree of anthropomorphism is also an important indicator for measuring model quality. In actual evaluation, it was observed that if standard DPO training was used, the model's overall anthropomorphism evaluation index actually decreased from the initial 34% to 30%, indicating quality degradation. However, after adopting the optimization scheme proposed in this application, which introduces Chosen reinforcement constraints and Rejected weakening constraints, the anthropomorphism index significantly improved to 48%. Therefore, determining whether to terminate optimization by comparing whether the optimized second-largest language model's overall anthropomorphism evaluation index reaches the threshold is to ensure that the second-largest language model stops optimization in a timely manner when it reaches the expected human-computer interaction quality level. This avoids a decrease in anthropomorphism due to over-optimization, which could lead to a degradation in the quality of generated responses, thereby ensuring the overall quality of generated responses in dimensions such as empathy and naturalness.

[0126] The above provides a detailed explanation of how to determine whether the conditions for terminating optimization are met.

[0127] To illustrate the method provided in this application in more detail, the following will be combined with... Figure 3 The solution provided in this application will be described in more detail by way of specific embodiments.

[0128] See Figure 3 , Figure 3 This is another flowchart illustrating the method provided in an embodiment of this application.

[0129] like Figure 3 As shown, the process may include the following steps:

[0130] To illustrate the methods provided in the embodiments of this application in more detail, the following specific embodiments are provided for further explanation:

[0131] The process includes the following steps:

[0132] S301, in the initial optimization round, uses the basic large language model as the reference model (i.e., the first large language model). The old model), and the supervised fine-tuned (SFT) base large language model are used as the current policy model to be optimized (i.e., the second large language model). (new model), obtain the N problems in the current optimization round (Batch).

[0133] S302, for each question in the current batch, use the reference model and the strategy model respectively to generate K answers.

[0134] S303, for each question, manually label one Chosen answer and one Rejected answer from the K answers corresponding to that question, and obtain the probability and PPL of the Chosen answer and the Rejected answer under the reference model, denoted as . and , and ;

[0135] Obtain the probabilities and PPL of the Chosen and Rejected answers under the policy model, denoted as . and , and .

[0136] S304, For each problem, the PPL (Personal Portfolio) is based on the Chosen answer to that problem under the reference model. ) and the PPL of Chosen's answer under the policy model ( ), determine the Chosen reinforcement constraint parameters corresponding to the problem; based on the PPL of the rejected answer to the problem under the reference model ( ) and the PPL of the rejected answer under the strategy model ( ), determine the Rejected weakening constraint parameters corresponding to the problem.

[0137] Using Formulas 1 and 2 above, the Chosen strengthening constraint parameters and the Rejected weakening constraint parameters for this problem are calculated.

[0138] S305. For each problem, determine the DPO loss value and hyperparameter γ for that problem, and perform a specified operation on the hyperparameter, the Chosen reinforcement constraint parameter and the Rejected weakening constraint parameter corresponding to that problem, and the DPO loss value for that problem. The result of the specified operation is the total loss value for that problem.

[0139] The total loss value of this problem can be calculated using the following formula 4.

[0140] S306: Optimize the parameters of the strategy model using the total loss value of N problems in the current optimization round.

[0141] Specifically, obtain the total loss value for N problems corresponding to N problems. The sum or average of the values is used to obtain the total loss value of the current batch. The parameters of the policy model are then optimized using the total loss value of the current batch.

[0142] S307: Determine whether the optimization termination condition is met.

[0143] The specific implementation of step S307 is detailed in the above embodiments and will not be repeated here.

[0144] If the judgment result of S307 is yes, then the optimized strategy model in the current batch is used as the target large language model. If the judgment result of S307 is no, then: the strategy model in the current batch before optimization in step S306 is used as the reference model for the next batch, and the strategy model optimized in step S306 in the previous batch is used as the strategy model for the next batch, thus obtaining N problems in the next batch, and returning to execute step S302.

[0145] The methods provided in the embodiments of this application have been described above. The apparatus provided in the embodiments of this application is described below:

[0146] See Figure 4 , Figure 4 This is a structural diagram of the device provided in an embodiment of this application. Figure 4 As shown, the device is applied to a network access device and includes: a determination module 401, an optimization module 402, and an iteration module 403.

[0147] Module 401 is used to perform the following operations for each problem in the current optimization round:

[0148] The algorithm obtains the first optimal Chosen response information output by the first largest language model based on the question, and the second optimal Chosen response information output by the second largest language model based on the question. In the initial optimization round, the first largest language model is the specified large language model, and the second largest language model is the fine-tuned first largest language model. In each subsequent optimization round, the first largest language model is the second largest language model before optimization in the previous round, and the second largest language model is the second largest language model after optimization in the previous round. Based on the first and second Chosen response information, the Chosen reinforcement constraint parameters corresponding to the question are determined. These Chosen reinforcement constraint parameters are used to maintain or improve the quality of the Chosen response information.

[0149] Obtain the first rejected answer information output by the first largest language model based on the question, and obtain the second rejected answer information output by the second largest language model based on the question; based on the first rejected answer information and the second rejected answer information, determine the rejected weakening constraint parameter corresponding to the question; the rejected weakening constraint parameter is used to suppress the generation of rejected answer information;

[0150] Model 402 is optimized to improve the second language model based on the Chosen strengthening constraint parameters and Rejected weakening constraint parameters corresponding to each problem in the current optimization round.

[0151] The iteration module 403 is used to take the optimized second largest language model as the target large language model if the termination optimization condition is met; otherwise, it will enter the next optimization round.

[0152] As an example, the first Chosen response information includes: the first Chosen response and the first perplexity level PPL corresponding to the first Chosen response;

[0153] The second Chosen response information includes: the second Chosen response and the second PPL corresponding to the second Chosen response;

[0154] If the second PPL is greater than the first PPL, then the Chosen reinforcement constraint parameter is the difference between the first PPL and the second PPL.

[0155] If the second PPL is less than or equal to the first PPL, then the Chosen reinforcement constraint parameter is the set first value.

[0156] As an example, the first rejected answer information includes: the first rejected answer and the third PPL corresponding to the first rejected answer;

[0157] The second rejected answer information includes: the second rejected answer and the fourth PPL corresponding to that rejected answer;

[0158] If the third PPL is greater than the fourth PPL, then the rejected weakening constraint parameter is the difference between the third PPL and the fourth PPL.

[0159] If the third PPL is less than or equal to the fourth PPL, then the rejected weakening constraint parameter is set to the second value.

[0160] As an example, the iteration module determines whether the termination optimization condition is met through the following steps:

[0161] The optimized second language model is used to generate corresponding answers for each test question in the evaluation set, and anthropomorphic evaluation indicators for each answer are obtained.

[0162] Based on the anthropomorphic evaluation indicators of each answer, the anthropomorphic comprehensive evaluation indicators of the optimized second language model are determined.

[0163] If the anthropomorphic comprehensive evaluation index reaches the preset threshold, the optimization termination condition is determined to be met; if the anthropomorphic comprehensive evaluation index does not reach the preset threshold, the optimization termination condition is determined to be unmet.

[0164] As an example, when the optimization module optimizes the second major language model step based on the Chosen strengthening constraint parameters and Rejected weakening constraint parameters corresponding to each problem in the current optimization round, it is further used to:

[0165] For each problem, the DPO loss value is determined based on the direct preference optimization DPO objective loss function, and the hyperparameters related to the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters corresponding to the problem are determined based on the DPO loss value.

[0166] Based on the hyperparameters, the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters corresponding to the problem, and the DPO loss value of the problem, the total loss value of the problem is determined;

[0167] The second language model is optimized based on the total loss value of each problem.

[0168] As an example, when performing the step of determining the hyperparameters related to the Chosen strengthening constraint parameters and the Rejected weakening constraint parameters corresponding to the problem based on the DPO loss value, the optimization module is further configured to:

[0169] If the DPO loss value is greater than the set threshold, the hyperparameter is set to the specified first parameter value; the first parameter value is used to indicate that the current optimization round focuses on optimizing the DPO loss value.

[0170] If the DPO loss value is less than or equal to the threshold, the hyperparameter is set to the specified second parameter value; the second parameter value is used to indicate that the current optimization round takes into account the optimization of the DPO loss value, the Chosen reinforcement constraint parameter, and the Rejected weakening constraint parameter.

[0171] This concludes the process. Figure 4 Structural description of the device shown.

[0172] See Figure 5 , Figure 5 This is a structural diagram of an electronic device provided in an embodiment of this application. Figure 5 As shown, the hardware structure may include: a processor and a machine-readable storage medium, the machine-readable storage medium storing machine-executable instructions that can be executed by the processor; the processor is used to execute the machine-executable instructions to implement the method disclosed in the above example of this application.

[0173] Based on the same application concept as the above method, this application embodiment also provides a machine-readable storage medium storing a plurality of computer instructions, which, when executed by a processor, can implement the method disclosed in the above examples of this application.

[0174] For example, the aforementioned machine-readable storage medium can be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, etc. For instance, machine-readable storage media can be: RAM (Random Access Memory), volatile memory, non-volatile memory, flash memory, storage drives (such as hard disk drives), solid-state drives, any type of storage disk (such as optical discs, DVDs, etc.), or similar storage media, or combinations thereof.

[0175] The above description is merely an embodiment of this application and is not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A question-answering method based on a large language model, characterized in that, The method includes: For each issue in the current optimization round, perform the following operations: Obtain the first Chosen answer information output by the first largest language model based on the question, and obtain the second Chosen answer information output by the second largest language model based on the question; the first Chosen answer information includes: the first Chosen answer and the first perplexity level (PPL) corresponding to the first Chosen answer; the second Chosen answer information includes: the second Chosen answer and the second PPL corresponding to the second Chosen answer; In the initial optimization round, the first large language model is the specified large language model, and the second large language model is the first large language model after fine-tuning. In each subsequent optimization round, the first large language model is the second large language model before optimization in the previous optimization round, and the second large language model is the second large language model after optimization in the previous optimization round. Based on the first Chosen response information and the second Chosen response information, the Chosen reinforcement constraint parameter corresponding to the question is determined. The Chosen reinforcement constraint parameter is used to maintain or improve the quality of the Chosen response information. If the second PPL is greater than the first PPL, the Chosen reinforcement constraint parameter is the difference between the first PPL and the second PPL. If the second PPL is less than or equal to the first PPL, the Chosen reinforcement constraint parameter is a set first value. Obtain the first rejected answer information output by the first major language model based on the question, and obtain the second rejected answer information output by the second major language model based on the question; based on the first rejected answer information and the second rejected answer information, determine the rejected weakening constraint parameter corresponding to the question; the rejected weakening constraint parameter is used to suppress the generation of rejected answer information; Based on the Chosen strengthening constraint parameters and Rejected weakening constraint parameters corresponding to each problem in the current optimization round, optimize the second largest language model; if the termination optimization condition is met, the optimized second largest language model is taken as the target large language model; if the termination optimization condition is not met, proceed to the next optimization round.

2. The method according to claim 1, characterized in that, The first rejected answer information includes: the first rejected answer and the third PPL corresponding to the first rejected answer; The second rejected answer information includes: the second rejected answer and the fourth PPL corresponding to the rejected answer; If the third PPL is greater than the fourth PPL, then the rejected weakening constraint parameter is the difference between the third PPL and the fourth PPL. If the third PPL is less than or equal to the fourth PPL, then the rejected weakening constraint parameter is the set second value.

3. The method according to claim 1, characterized in that, The method further includes determining whether the termination optimization condition is met through the following steps: The optimized second language model is used to generate corresponding answers for each test question in the evaluation set, and anthropomorphic evaluation indicators for each answer are obtained. Based on the anthropomorphic evaluation index of each answer, the anthropomorphic comprehensive evaluation index of the optimized second language model is determined; If the anthropomorphic comprehensive evaluation index reaches the preset threshold, then the termination optimization condition is determined to be met; if the anthropomorphic comprehensive evaluation index does not reach the preset threshold, then the termination optimization condition is determined not to be met.

4. The method according to claim 1, characterized in that, The optimization of the second major language model based on the Chosen strengthening constraint parameters and Rejected weakening constraint parameters corresponding to each problem in the current optimization round includes: For each problem, the DPO loss value is determined based on the direct preference optimization DPO objective loss function, and the hyperparameters related to the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters corresponding to the problem are determined based on the DPO loss value. Based on the hyperparameters, the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters corresponding to the problem, and the DPO loss value of the problem, the total loss value of the problem is determined; The second largest language model is optimized based on the total loss value of each problem.

5. The method according to claim 4, characterized in that, The hyperparameters related to the Chosen reinforcement constraint parameters and Rejected weakening constraint parameters corresponding to this problem, determined based on the DPO loss value, include: If the DPO loss value is greater than the set threshold, then the hyperparameter is set to a specified first parameter value; the first parameter value is used to indicate that the current optimization round focuses on optimizing the DPO loss value. If the DPO loss value is less than or equal to the threshold, then the hyperparameter is set to a specified second parameter value; the second parameter value is used to indicate that the current optimization round takes into account the optimization of the DPO loss value, the Chosen reinforcement constraint parameter and the Rejected weakening constraint parameter.

6. A question-answering device based on a large language model, characterized in that, The device includes: The determination module performs the following operations for each problem in the current optimization round: Obtain the first Chosen answer information output by the first largest language model based on the question, and obtain the second Chosen answer information output by the second largest language model based on the question; the first Chosen answer information includes: the first Chosen answer and the first perplexity level (PPL) corresponding to the first Chosen answer; the second Chosen answer information includes: the second Chosen answer and the second PPL corresponding to the second Chosen answer; In the initial optimization round, the first large language model is the specified large language model, and the second large language model is the first large language model after fine-tuning. In each subsequent optimization round, the first large language model is the second large language model before optimization in the previous optimization round, and the second large language model is the second large language model after optimization in the previous optimization round. Based on the first Chosen response information and the second Chosen response information, the Chosen reinforcement constraint parameter corresponding to the question is determined. The Chosen reinforcement constraint parameter is used to maintain or improve the quality of the Chosen response information. If the second PPL is greater than the first PPL, the Chosen reinforcement constraint parameter is the difference between the first PPL and the second PPL. If the second PPL is less than or equal to the first PPL, the Chosen reinforcement constraint parameter is a set first value. Obtain the first rejected answer information output by the first major language model based on the question, and obtain the second rejected answer information output by the second major language model based on the question; based on the first rejected answer information and the second rejected answer information, determine the rejected weakening constraint parameter corresponding to the question; the rejected weakening constraint parameter is used to suppress the generation of rejected answer information; The optimization module is used to optimize the second language model based on the Chosen reinforcement constraint parameters and Rejected weakening constraint parameters corresponding to each problem in the current optimization round. The iterative module is used to select the optimized second largest language model as the target largest language model if the termination optimization condition is met; otherwise, it proceeds to the next optimization round.

7. The apparatus according to claim 6, characterized in that, The first rejected answer information includes: the first rejected answer and the third PPL corresponding to the first rejected answer; The second rejected answer information includes: the second rejected answer and the fourth PPL corresponding to the rejected answer; If the third PPL is greater than the fourth PPL, then the rejected weakening constraint parameter is the difference between the third PPL and the fourth PPL. If the third PPL is less than or equal to the fourth PPL, then the rejected weakening constraint parameter is the set second value; And / or, The iterative module determines whether the termination optimization condition is met through the following steps: The optimized second language model is used to generate corresponding answers for each test question in the evaluation set, and anthropomorphic evaluation indicators for each answer are obtained. Based on the anthropomorphic evaluation index of each answer, the anthropomorphic comprehensive evaluation index of the optimized second language model is determined; If the anthropomorphic comprehensive evaluation index reaches the preset threshold, it is determined that the termination optimization condition is met; if the anthropomorphic comprehensive evaluation index does not reach the preset threshold, it is determined that the termination optimization condition is not met. and / or; When the optimization module optimizes the second major language model step based on the Chosen reinforcement constraint parameters and Rejected weakening constraint parameters corresponding to each problem in the current optimization round, it is further used for: For each problem, the DPO loss value is determined based on the direct preference optimization DPO objective loss function, and the hyperparameters related to the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters corresponding to the problem are determined based on the DPO loss value. Based on the hyperparameters, the Chosen reinforcement constraint parameters and the Rejected weakening constraint parameters corresponding to the problem, and the DPO loss value of the problem, the total loss value of the problem is determined; The second largest language model is optimized based on the total loss value of each problem; And / or, When performing the step of determining the hyperparameters related to the Chosen strengthening constraint parameters and the Rejected weakening constraint parameters corresponding to the problem based on the DPO loss value, the optimization module is further configured to: If the DPO loss value is greater than the set threshold, then the hyperparameter is set to a specified first parameter value; the first parameter value is used to indicate that the current optimization round focuses on optimizing the DPO loss value. If the DPO loss value is less than or equal to the threshold, then the hyperparameter is set to a specified second parameter value; the second parameter value is used to indicate that the current optimization round takes into account the optimization of the DPO loss value, the Chosen reinforcement constraint parameter and the Rejected weakening constraint parameter.

8. An electronic device, characterized in that, The electronic device includes: Processor; and A machine-readable storage medium storing machine-executable instructions that, when executed by the processor, cause the processor to perform the steps of the method as described in any one of claims 1 to 5.

9. A machine-readable storage medium, characterized in that, The machine-readable storage medium stores machine-executable instructions that, when executed by a processor, cause the processor to perform the steps of the method as described in any one of claims 1 to 5.