Method for training language model and related device

By introducing adjustment coefficients for reward scores and confidence index values ​​into language model training and calculating calibrated advantage values, the problem of mismatch between external quality evaluation and confidence scores when large language models generate responses is solved, thereby improving the reliability and stability of the model.

CN122242716APending Publication Date: 2026-06-19ALIBABA HEALTH TECH (CHINA) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ALIBABA HEALTH TECH (CHINA) CO LTD
Filing Date
2026-01-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, when large language models generate responses, there is a mismatch between external quality assessments and model confidence, which affects the reliability and stability of the model output.

Method used

By acquiring the reward scores and confidence index values ​​of multiple responses generated by the language model, an adjustment coefficient is calculated. Based on this adjustment coefficient, the reward scores, and the confidence index values, a calibrated advantage value is calculated and replaced with the original advantage value in reinforcement learning. This updates the language model parameters and guides the model to prioritize learning high-quality responses that match the confidence level.

Benefits of technology

This improves the reliability and stability of the language model in generating responses, reduces the probability of incorrect responses, and ensures that the model demonstrates a reasonable level of confidence in generating high-quality responses.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242716A_ABST
    Figure CN122242716A_ABST
Patent Text Reader

Abstract

This application provides a method and related apparatus for training a language model. The method includes: acquiring multiple responses generated by the language model for the same prompt instruction; acquiring a reward score and confidence index value corresponding to each response; calculating an adjustment coefficient based on the reward score and confidence index value of the multiple responses; wherein the adjustment coefficient is derived based on the correlation between the reward score and confidence index value of the multiple responses; calculating a calibrated advantage value corresponding to each response based on the reward score, the confidence index value, and the adjustment coefficient; and replacing the original advantage value used in reinforcement learning to update the parameters of the language model with the calibrated advantage value to update the parameters of the language model. This application can improve the reliability and stability of language model responses to a certain extent.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] One or more embodiments of this application relate to the field of artificial intelligence-based natural language processing technology, and in particular to a method and related apparatus for training a language model. Background Technology

[0002] In the field of Large Language Model (LLM) optimization, reinforcement learning techniques are used to improve the quality of the responses generated by the model. The training objective for LLMs is to achieve high external quality evaluations for the generated responses, while also ensuring the model has high certainty regarding its generated content.

[0003] However, a mismatch still exists between external quality assessments and model confidence in related technologies. Summary of the Invention

[0004] To address the aforementioned technical problems, this application provides one or more embodiments of a language model training method and related apparatus.

[0005] In a first aspect, one or more embodiments of this application propose a method for training a language model, comprising: acquiring multiple responses generated by the language model for the same prompt instruction; acquiring a reward score and a confidence index value corresponding to each response; wherein the reward score represents the quality assessment value of the corresponding response; the confidence index value is used to represent the degree of confidence of the language model in the generated response; calculating an adjustment coefficient based on the reward scores and confidence index values ​​of the multiple responses; wherein the adjustment coefficient is derived based on the correlation between the reward scores and confidence index values ​​of the multiple responses; calculating a calibrated advantage value corresponding to each response based on the reward score, the confidence index value, and the adjustment coefficient; and replacing the original advantage value used in reinforcement learning to update the parameters of the language model with the calibrated advantage value to update the parameters of the language model.

[0006] Secondly, one or more embodiments of this application propose a language model training apparatus, comprising: a first acquisition module, configured to acquire multiple responses generated by the language model for the same prompt instruction; a second acquisition module, configured to acquire a reward score and a confidence index value corresponding to each response; wherein the reward score represents the quality assessment value of the corresponding response; the confidence index value is used to represent the degree of confidence of the language model in the generated response; a first calculation module, configured to calculate an adjustment coefficient based on the reward scores and confidence index values ​​of the multiple responses; wherein the adjustment coefficient is derived based on the correlation between the reward scores and confidence index values ​​of the multiple responses; a second calculation module, configured to calculate a calibrated advantage value corresponding to each response based on the reward score, the confidence index value, and the adjustment coefficient; and an update module, configured to replace the original advantage value used in reinforcement learning to update the parameters of the language model with the calibrated advantage value, thereby updating the parameters of the language model.

[0007] Thirdly, one or more embodiments of this application provide a computer device including a memory and a processor, wherein the memory stores at least one computer program, which is loaded and executed by the processor to implement the method as described above.

[0008] Fourthly, one or more embodiments of this application provide a computer program product including computer instructions that, when executed by a processor, implement the method as described above.

[0009] Fifthly, one or more embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, causes the processor to implement the method as described above.

[0010] As can be seen from the above embodiments, multiple embodiments in this application obtain multiple responses under the same prompt instruction and simultaneously introduce reward scores and confidence index values. Based on the correlation between the two, adjustment coefficients are calculated, and calibrated advantage values ​​are obtained to replace the original advantage values. This achieves the consistency of simultaneously constraining response quality and model confidence during the reinforcement learning update process, thereby guiding the language model to prioritize learning high-quality responses that match the confidence level. This achieves the effect of reducing the probability of incorrect responses and improving the reliability and stability of the language model's answers. Attached Figure Description

[0011] Figure 1 This is a flowchart illustrating a language model training method provided in one embodiment of this application.

[0012] Figure 2This is a schematic diagram of a language model training device provided in one embodiment of this application.

[0013] Figure 3 This is a schematic diagram of a computer device provided in one embodiment of this application. Detailed Implementation

[0014] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments.

[0015] In the description of the embodiments of this application, it should be understood that the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, features defined with "first" and "second" may explicitly or implicitly include one or more of the stated features. In the description of the embodiments of this application, "multiple" means two or more, unless otherwise explicitly specified.

[0016] The user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.

[0017] In related technologies, the training and optimization of large language models typically employs reinforcement learning to introduce external reward signals, guiding the language model to generate higher-quality responses. Specifically, during training, reward scores are often used to evaluate the quality of the model's generated responses, and the model parameters are updated accordingly to gradually improve its generation capabilities. However, in practical applications (such as online shopping guides and intelligent question answering), in addition to the response quality itself, there is a general focus on the certainty of the language model's generated content—that is, whether the model demonstrates a high degree of confidence in producing "high-quality responses."

[0018] During training, a mismatch may still exist between the reward signal and the confidence level of the language model's output. For example, the language model may show high confidence for low-quality responses or lack sufficient confidence for high-quality responses, thus affecting the reliability and stability of the model's overall output.

[0019] In summary, the relevant technologies still suffer from the problem of difficulty in effectively aligning reward signals with the inherent confidence of the language model during reinforcement learning training, which requires further improvement.

[0020] In multiple embodiments provided by this application, a training method for a language model can be applied to an electronic device with certain computing power and network access capabilities. The electronic device can be a desktop computer, a laptop computer, a tablet computer, a smart phone, or a server. Specifically, the electronic device includes a processor, a memory, and a network access module for network communication. A server can be an electronic device with strong data processing capabilities. Of course, a server can also refer to a server cluster formed by multiple electronic devices, or a quantum server built using a quantum computer.

[0021] An application scenario example of a language model training device is provided in one embodiment of this application. The training device can be deployed on the server of an online shopping guide platform for performing reinforcement learning training on a conversational language model for end users, enabling it to generate high-quality responses in scenarios such as drug shopping guides and health product recommendations, and also showing a matching level of confidence in the high-quality responses, reducing the risk of "wrong but confident" responses in actual scenarios.

[0022] In this scenario example, take an online drug shopping guide as an example. The user inputs a prompt instruction x through the dialogue window on the shopping guide page, for example: "I have had a dry cough for a week, occasionally with a little white phlegm, no fever, and cough more severely at night. Are there any cough medicines that are not too drowsy that can be recommended?" The training device encapsulates the user's question and context information as the prompt instruction x, and controls the target language model to decode this prompt instruction as a policy model under the current parameter θ to generate multiple candidate responses, for example, generating G = 3 responses y1, y2, y3. Each response y i is composed of i T i,1 tokens, denoted as a i,2 , a i,Ti .,

[0023] During the decoding process, the language model, as the policy model πθ, will output the conditional probability distribution πθ(a i,t | x, ai,<t) of the next token at each time step t. While generating each response, the training device records the conditional probability of each sampled token and calculates the confidence index value C i of this response based on this. Specifically, for the i-th response y i , the confidence index value can be calculated according to the following formula: where, T i represents the number of tokens of the i-th response; ai,t Denote the token generated at the t-th position of the reply as a i , where <t represents the sequence of tokens generated before the t-th position; πθ(a i,t | x, a i ,<t) represents the conditional probability of generating token a under the policy model with parameter θ, given the prompt instruction x and the historical prefix a i ,<t. log represents the logarithmic operation. In the above way, the training device can obtain the corresponding confidence index value C i,t for each reply y generated under the same prompt instruction i , which is used to quantify the overall confidence degree of the language model in this reply i .

[0024] When obtaining the confidence index value, the training device can also obtain a reward score R i for each reply i . The reward score R i is comprehensively given by a pre-constructed reward model or manual feedback, and is used to represent the quality evaluation value of the corresponding reply. For example, factors such as content correctness, medical safety, strict avoidance of diagnostic conclusions, clear indication of "specific medication should follow doctor's advice", sufficient inquiry of necessary medical history, and polite and clear expression are considered. For G replies generated under the same prompt instruction x, the training device can arrange their reward scores in the order of reply index to form a reward score sequence {R i}, and arrange their confidence index values to form a confidence sequence {C

[0025] To facilitate comparison, the training device can standardize the confidence index values of all replies under the current prompt instruction to obtain a standardized confidence value C i '. For example, first calculate the mean μc and standard deviation σc of all C i , and then standardize each C i : C i ' = (C i - μc) / σc where μc represents the average of the confidence index values of all replies under the current prompt instruction, σc represents the corresponding standard deviation, and C i ' represents the standardized confidence value corresponding to the i-th reply, which is used to reflect the deviation degree of the confidence of this reply from the average level within the group

[0026] On this basis, the training device first only based on the reward score sequence {R iCalculate the original advantage value for each response. This is used to characterize the relative quality of the response within the group. Within the Group Relative Policy Optimization (GRPO) framework, the training device can normalize the reward as follows: Where, mean(R) j `{j=1..G}` represents the average reward score of G responses under the same prompt command, where `std(R)` is the average score of the responses. j {j=1..G} represents the standard deviation of the reward rating. is the original advantage value of the i-th response, reflecting the degree to which the reward level of that response deviates from the group average.

[0027] The training device will utilize the reward scoring sequence {R} i} and the confidence index value sequence {C i The rank correlation between {R} is used to assess whether "high-quality responses tend to have high confidence". To this end, the training device can calculate the correlation metric ρ based on the Spearman rank correlation coefficient. Specifically, the correlation can be calculated separately for {R}. i} and {C i Sort the responses in ascending or descending order to obtain the rank of each response in the reward dimension. Rank on the confidence dimension Calculate the rank difference = In the case of G replies, the Spearman rank correlation coefficient can be calculated using the following formula: Where di represents the difference between the reward rank and the confidence rank of the i-th response. The sum of the squared rank differences is the result of ρ. The closer the absolute value of ρ is to 1, the more consistent the reward score and the confidence index value are in the ranking trend.

[0028] Based on the aforementioned correlation metric ρ, the training device can calculate an adjustment coefficient λ using a preset function to control the weight of the "reward-confidence calibration" in this round of training. For example, the preset function can be set as follows: Where, λ max λ is the preset maximum adjustment coefficient; ρ is the correlation measure obtained based on the reward score and confidence index value under the current prompt instruction. Thus, when the reward score and confidence index value are highly consistent (ρ close to 1), λ is close to 0, indicating that the reward signal and model confidence are already relatively aligned, and no further calibration is needed; when they are inconsistent (ρ is small or even negative), λ is close to λ0.max This indicates that the confidence signal needs to be used more extensively to correct for the advantage.

[0029] After obtaining the standardized confidence value and original advantage value The training device can then calculate an initial calibration value for each response. This is used to characterize the deviation between "reward" and "confidence". For example, a training device can take the following form: in, This is the initial calibration value for the i-th reply. When Larger and A smaller value indicates a better external quality assessment of the response, but a lower model confidence level. Tends to take positive values; conversely, when Smaller and A larger value indicates that the response is of poor quality but the model is overconfident. It tends to take negative values.

[0030] Subsequently, the training device can use the adjustment coefficient λ to superimpose the calibration value onto the base advantage, forming a calibrated advantage value for updating the strategy. Language models can share the same advantage value at multiple time steps t when generating the same response. In GRPO, the training device can define the calibrated advantage as: in, The calibrated advantage value used for the i-th reply at time step t; This is the original advantage value based solely on the reward score; λ is the initial calibration value calculated based on the deviation between reward and confidence; λ is the adjustment coefficient obtained by the correlation measure ρ through a preset function. The calibration value is used to determine the corresponding advantage. With this design, when the overall correlation between reward and confidence is low, λ is larger, and the model will more strongly penalize responses with "low reward and high confidence" and encourage responses with "high reward but low confidence"; when the two are highly consistent, λ is smaller, and the calibration has a weaker impact on advantage.

[0031] During the policy optimization phase, the training device can adopt the objective function J of Group Relative Policy Optimization (GRPO). grpo (θ) Updates the language model parameters θ. In terms of the objective scheme notation, the objective function of GRPO can be written as: in, in, Indicates the distribution of problems. Indicates within-group samples; The policy ratio represents the current policy. Reference Strategy The probability ratio of the i-th reply at time step t; This indicates that the policy ratio is truncated on the interval [1-ε, 1+ε]; ε is the truncation range hyperparameter; β is the weight coefficient of the KL regularization term; For the current strategy Reference Strategy KL divergence between them; This is the aforementioned calibrated advantage value. The training device maximizes... Used in policy gradient updates Instead of the original advantage, the language model considers both external reward signals and confidence-related calibration information in each parameter update.

[0032] In a specific numerical example, we can assume that for a given prompt instruction x, the reward scores for the three responses are R1=0.9, R2=0.6, and R3=0.3, respectively, and the calculated original advantage value is... =1.0、 =0.0、 =-1.0; After standardization, the confidence index values ​​are C1'=0.2, C2'=-0.1, and C3'=-0.1, respectively. Therefore, the corresponding initial calibration values ​​are... =1.0 - 0.2 = 0.8 =0.0-(-0.1)=0.1、 =-1.0-(-0.1)=-0.9. If the correlation metric ρ obtained based on the Spearman rank correlation coefficient is low in this round of training, and λ calculated by the preset function is close to λmax, for example, λ=0.5, then the calibrated advantage value can be: =1.0 + 0.5 × 0.8 = 1.4 =0.0 + 0.5 × 0.1 = 0.05 =-1.0+0.5×(-0.9)=-1.45. Therefore, when updating the policy, the training device will more strongly reinforce the first high-quality response with low confidence, and weaken the third, lower-quality response that the model might have been more confident in initially.

[0033] By repeatedly executing the above training process in online shopping scenarios, the training device can utilize reward scores R i Confidence index value Ci The correlation metric ρ and the resulting adjustment coefficient λ are used to construct the calibrated advantage in the GRPO objective function. The original dominance value is then used to update the parameters of the language model, allowing the language model to gradually learn to maintain high response quality while making its confidence level more consistent with the actual quality, thereby improving the reliability and stability of the response.

[0034] Please see Figure 1 One embodiment of this application provides a method for training a language model. The language model training method can be applied to a training device, which can be applied to the aforementioned electronic device possessing certain computing power and network access capabilities. Of course, in some embodiments, the training device can also be software running on the electronic device. The language model training method may include the following steps.

[0035] Step S110: For the same prompt instruction, obtain multiple responses generated by the language model.

[0036] Step S120: Obtain the reward score and confidence index value corresponding to each response; wherein, the reward score represents the quality assessment value of the corresponding response; the confidence index value is used to represent the degree of confidence of the language model in the generated response.

[0037] Step S130: Calculate the adjustment coefficient based on the reward scores and confidence index values ​​of the multiple responses; wherein the adjustment coefficient is derived based on the correlation between the reward scores and confidence index values ​​of the multiple responses.

[0038] Step S140: Based on the reward score, the confidence index value, and the adjustment coefficient, calculate the calibrated advantage value corresponding to each response.

[0039] Step S150: Replace the original advantage value used in reinforcement learning to update the parameters of the language model with the calibrated advantage value to update the parameters of the language model.

[0040] In this embodiment, the training device can be used to perform reinforcement learning training on the target language model to improve the reliability and stability of the language model's generated responses in scenarios such as online shopping guides and intelligent question answering. The target language model can be a large language model that has already undergone pre-training on a large corpus, such as an autoregressive language model based on a Transformer structure, which internally consists of multi-layer neural network parameters. These parameters are used to represent the mapping relationship between input prompts and output response sequences. In some embodiments, the target language model can be deployed on the same electronic device as the training device, or it can be deployed as a service on a remote server, invoked by the training device through a network interface. During the reinforcement learning training process, the parameters of the target language model are continuously updated, further aligning it with response preferences for specific transaction scenarios based on existing pre-training capabilities.

[0041] In this embodiment, the training device can control the language model to generate multiple responses to the same prompt instruction. The prompt instruction represents the input content processed by the language model. For example, the input content includes information such as the user's inquiry in an online shopping scenario, constraints, and expected response style. Multiple responses can refer to multiple candidate answers generated by the language model under the same prompt instruction constraints through sampling or other methods, used to compare and optimize the language model's generation strategy within the same training batch.

[0042] In this embodiment, the training device can obtain a corresponding reward score and confidence index value for each of the multiple responses. The reward score can be used to characterize the quality assessment value of the corresponding response. In some embodiments, the training device can comprehensively score the response in dimensions such as content correctness, completeness, politeness, and usability based on a pre-built reward model, manually labeled feedback, or transaction indicators, thereby forming a reward signal that can reflect external quality evaluation. The confidence index value can be used to represent the language model's confidence level in the generated response. The training device can measure this based on the internal probability information output by the language model during the generation of the response. For example, the training device can calculate a value to characterize the overall confidence level based on the generation probability or log probability of each token in the response. By simultaneously obtaining the reward score and confidence index value, the training device can characterize the performance of each response from two dimensions: external quality evaluation and model intrinsic determinism.

[0043] In this embodiment, the training device can calculate an adjustment coefficient representing the degree of matching between the reward scores and confidence index values ​​of the multiple responses. Specifically, the training device can organize the reward scores of the multiple responses into a reward score sequence and the confidence index values ​​of the multiple responses into a confidence index value sequence, and determine the adjustment coefficient based on the correlation measurement results between these two sequences. The correlation reflects the consistency between the response quality assessment and the model confidence level in terms of ranking trends or numerical changes under the same prompt instruction. When the reward scores and confidence index values ​​are highly consistent across multiple responses, the adjustment coefficient can be determined as a parameter representing "the reward signal and confidence level are basically aligned"; when there is a significant inconsistency, the adjustment coefficient can be determined as a parameter representing "the degree of deviation between the reward signal and confidence level". By setting the adjustment coefficient, the subsequent calculation of the dominance value can adaptively consider the matching relationship between the reward and confidence level.

[0044] In this embodiment, the training device can calculate the calibrated dominance value for each response based on the reward score, the confidence index value, and the adjustment coefficient. Specifically, for each response, the training device can construct a calibration quantity reflecting the "quality-confidence matching degree" of the response by combining its reward score and confidence index value, and scale the influence of the calibration quantity using the adjustment coefficient to obtain a calibrated dominance value that comprehensively characterizes the relationship between the external quality and internal confidence of the response. The calibrated dominance value can be understood as a new dominance signal with confidence constraints introduced on the original dominance value. It is used to distinguish multiple responses under the same prompt instruction, so that the calibrated dominance value tends to be larger for responses with higher reward scores and matching confidence levels, while the calibrated dominance value tends to be suppressed for responses with lower reward scores but higher confidence levels.

[0045] In this embodiment, the training device can replace the original advantage value used to update the language model parameters with the calibrated advantage value during reinforcement learning training, thereby updating the language model parameters. Specifically, the original advantage value can be calculated based on the difference between the reward score and the baseline value using an existing reinforcement learning algorithm (e.g., group relative policy optimization algorithm), and is used to characterize the relative merit of a response compared to other responses under the same prompt instruction. In this embodiment, when constructing the policy gradient update objective, the training device replaces the original advantage value with the calibrated advantage value obtained based on the reward score, confidence index value, and adjustment coefficient. This ensures that the language model considers not only external quality evaluation during parameter updates but also the consistency between the reward signal and the confidence level. Through this method, the training device can guide the language model to prioritize learning towards responses with "high rewards and reasonably matched confidence levels," reducing the preference for "low-quality but high-confidence" responses, thereby improving the reliability and stability of the responses generated by the language model after multiple training iterations.

[0046] In some implementations, the confidence index is the average of the logarithmic probabilities of each token in the response when the language model generates the corresponding response.

[0047] In this embodiment, the training device can further determine the confidence index value based on the internal probability output of the language model during the response generation process, using a unified measurement method. Specifically, when controlling the target language model to generate a certain response, the training device can record the log probability corresponding to each token in the response during generation. Here, a token can represent the smallest text unit after segmentation by a word segmenter, such as a character, word, or sub-word unit. For the same response, the training device can summarize the log probabilities of each token in the response and calculate the average of these log probabilities, using this average as the confidence index value corresponding to the response. Therefore, the confidence index value can directly reflect the overall certainty level of the language model's output for each token during the response generation process. When response lengths are inconsistent, using the average log probability as the measurement result is beneficial for comparable confidence assessments between different responses.

[0048] In this embodiment, the training device can perform the above calculation process separately for multiple responses generated under the same prompt instruction. That is, for each response, the log probability of all tokens contained therein is extracted, and the average log probability of the response is calculated, thereby forming a sequence of confidence index values ​​corresponding to multiple responses. Subsequently, these confidence index values ​​can be used together with the corresponding reward scores to construct the aforementioned confidence index value sequence and reward score sequence. This allows the training device to accurately characterize the differences between different responses from the deterministic dimension within the model, based on a confidence metric under a unified scale, when calculating adjustment coefficients and calibrated advantage values ​​in subsequent calculations.

[0049] In some implementations, the logarithmic probability of each token in the response is obtained using the language model as the policy model, without calling an independent policy model to calculate the logarithmic probability.

[0050] In this embodiment, when performing reinforcement learning training, the training device can use the aforementioned target language model itself as the policy model. That is, the language model simultaneously assumes the role of generating responses and outputting policy probability information, without deploying or calling an independent policy model network to calculate the probability or log probability of each token. Specifically, when the training device controls the language model to generate multiple responses under the same cue instruction constraint, the language model outputs the conditional probability distribution of all candidate tokens at each step of decoding the cue instruction to generate a response. The training device can select the target token from this conditional probability distribution as the generation result of the current step, and simultaneously read the probability value of the target token in the probability distribution. Based on this probability value, it calculates the log probability corresponding to the token and sequentially associates the log probabilities obtained in each step with the corresponding responses. In this way, the log probability of each token in the response directly comes from the output result of forward inference performed using the target language model as the policy model.

[0051] In this embodiment, since the training device does not call any policy model independent of the language model in the process of calculating the log probability and constructing the confidence index value from the log probability, but completes the relevant calculations entirely based on the forward inference results of the trained language model itself, on the one hand, it avoids the computational and storage overhead caused by additional maintenance, loading or inference of independent policy networks to obtain policy probability information, thus improving the overall resource utilization efficiency of the reinforcement learning training process; on the other hand, it ensures that the log probability used to represent the confidence index value is highly consistent with the actual policy behavior of the language model when generating responses, which is beneficial for the subsequent calculation of adjustment coefficients and calibrated advantage values ​​based on reward scores and confidence index values, so as to truly reflect the intrinsic confidence level of the currently trained language model under the given policy, thereby improving the reliability of advantage value calibration.

[0052] In some implementations, the training device can calculate the Spearman rank correlation coefficient between the reward score sequence and the confidence index value sequence of the plurality of responses; wherein the reward score sequence includes reward scores of the plurality of responses arranged in a specified order; the confidence index value sequence includes confidence index values ​​of the plurality of responses arranged in the specified order; and the adjustment coefficient is calculated based on the Spearman rank correlation coefficient using a preset function.

[0053] In this embodiment, the training device can construct a statistic to measure the overall matching relationship between multiple responses generated under the same prompt instruction, based on the reward score and confidence index value corresponding to each response, thereby providing a basis for determining the adjustment coefficient. Specifically, the training device can arrange the reward scores of multiple responses in a specified order to form a reward score sequence; similarly, it can arrange the confidence index values ​​of multiple responses in the same specified order to form a confidence index value sequence. Here, the specified order can be the index order or time order used by the training device when recording multiple responses, to ensure that the elements at each position in the reward score sequence and the confidence index value sequence correspond one-to-one, so that each position simultaneously reflects the values ​​of the same response in both external quality evaluation and internal confidence level dimensions.

[0054] In this embodiment, the training device can calculate the Spearman rank correlation coefficient, which characterizes the degree of matching between the reward score sequence and the confidence index value sequence, based on the statistical correlation between the two. The Spearman rank correlation coefficient is a rank-based correlation measure. By ranking the two sequences by rank and comparing their rank relationships, it reflects the consistency between the order of reward scores across multiple responses and the order of confidence index values. The closer the ranking trends of the two sequences are to monotonically consistent across different responses, the closer the value of the Spearman rank correlation coefficient is to its upper bound; conversely, when the ranking trends are opposite or there is no obvious relationship, the value of the coefficient will decrease. The training device can input the calculated Spearman rank correlation coefficient into a preset function, which outputs an adjustment coefficient characterizing the overall matching relationship between the reward scores and the confidence index values. In this way, the adjustment coefficient can serve as a quantitative result of the "match between the reward signal and the confidence level". This can be used to dynamically adjust the influence of the calibration quantity constructed from the reward score and the confidence index value on the dominance signal when calculating the dominance value after calibration. This allows the calibration process of the dominance value to adaptively consider the consistency of multiple responses under the current prompt command in terms of quality assessment and model confidence.

[0055] In some implementations, the output value of the preset function is negatively correlated with the Spearman rank correlation coefficient.

[0056] In this embodiment, after the training device calculates the Spearman rank correlation coefficient ρ based on the reward score sequence and confidence index value sequence of multiple responses, this correlation index can be further mapped to an adjustment coefficient for subsequent dominance value calibration using a preset function. The preset function can be configured as a monotonically decreasing function with the Spearman rank correlation coefficient ρ as the independent variable; that is, as ρ increases, the output value of the preset function decreases, thus making the adjustment coefficient negatively correlated with the Spearman rank correlation coefficient. Therefore, when the correlation between the reward score and the confidence index value is higher and ρ is closer to 1, it indicates that the language model has a reasonable level of confidence in the high-quality responses themselves. In this case, there is no need for excessive calibration of the dominance value; therefore, the adjustment coefficient output by the preset function can be compressed to a smaller value to reduce the magnitude of changes to the original dominance value during subsequent calibration.

[0057] In this implementation, when the Spearman rank correlation coefficient ρ is at a moderate or low level, the preset function can output a relatively large adjustment coefficient to enhance the sensitivity of the dominance value calibration to the mismatch between reward scores and confidence levels. When ρ is negative, it indicates a significant divergence between the ranking trends of reward scores and confidence index values ​​under the same prompt, meaning the language model may assign higher confidence to lower-quality responses or insufficient confidence to higher-quality responses. In this case, the preset function can output a larger adjustment coefficient, giving the subsequent calibration based on the adjustment coefficient a higher weight in the dominance value calculation, thus significantly amplifying the correction effect on the "reward-confidence inconsistency" phenomenon during policy updates. By designing the preset function to be negatively correlated with the Spearman rank correlation coefficient, the training device can adaptively adjust the size of the adjustment coefficient according to different correlation levels. This maintains training stability when reward scores and confidence index values ​​are highly consistent, and increases calibration intensity when there is a significant deviation, more accurately guiding the direction of the calibrated dominance value's influence on language model parameter updates.

[0058] In some implementations, the training device can calculate an initial calibration value for each response based on the reward score and confidence index value; multiply the initial calibration value of each response by the adjustment coefficient to obtain the dominance calibration value for each response; and add the dominance calibration value of each response to the original dominance value to obtain the calibrated dominance value; wherein the original dominance value is calculated based on the reward score of the response.

[0059] In this embodiment, the training device can further construct a calibrated advantage value to drive reinforcement learning updates, based on the reward score, confidence index value, and adjustment coefficient for each response, as well as the corresponding prompt instruction. Specifically, the training device can first calculate the initial calibration value for each response based on the reward score and confidence index value. The initial calibration value can be used to characterize the matching status of the response in two dimensions: "external quality assessment" and "internal confidence level." For example, when a response has a high reward score but a low confidence index value, the initial calibration value can reflect a positive calibration amount to encourage the response; while when a response has a low reward score but a high confidence index value, the initial calibration value can reflect a negative calibration amount to suppress the response, thereby providing a directional basis for the subsequent correction of the advantage signal.

[0060] In this embodiment, the training device can multiply the initial calibration value corresponding to each response by the aforementioned adjustment coefficient to obtain the dominant calibration value for each response. The dominant calibration value can be understood as the calibration amount obtained by combining the "quality-confidence matching degree" of a single response with the overall "reward-confidence correlation" under the current prompt instruction: on the one hand, the initial calibration value expresses the degree of deviation of a single response from the reward score and confidence index value; on the other hand, the adjustment coefficient reflects the strength of consistency between the reward score and confidence index value of multiple responses under the prompt instruction. By multiplying the two, the dominant calibration value can play a more significant calibration role in scenarios with strong overall correlation, while its influence on the dominant signal is relatively weakened in scenarios with weak overall correlation, thereby achieving an adaptive calibration mechanism that combines "local bias—global correlation".

[0061] In this embodiment, the training device can add the calibrated advantage value of each response to the original advantage value corresponding to that response to obtain a calibrated advantage value for updating the policy. The original advantage value can be calculated according to the difference between the reward score and the baseline value using an existing reinforcement learning algorithm (such as the group relative policy optimization algorithm), and is used to characterize the relative superiority or inferiority of a response to other responses under the same prompt instruction. By superimposing the advantage calibration value on the original advantage value, the calibrated advantage value obtained by the training device not only retains the traditional relative advantage information based on the reward score, but also introduces calibration information based on the confidence index value and the adjustment coefficient, so that the advantage signal can comprehensively reflect the matching relationship between the response quality assessment and the model's confidence level. When constructing the policy gradient update target in the subsequent process, the training device can replace the original advantage value with the calibrated advantage value to update the parameters of the target language model, thereby explicitly strengthening the learning intensity of responses with "high reward and reasonable confidence level" during reinforcement learning training, reducing the learning tendency of responses with "low quality but high confidence", and ultimately improving the reliability and stability of the responses generated by the language model.

[0062] In some implementations, the training device can standardize the confidence index values ​​of the multiple responses to obtain standardized confidence values; and determine the initial calibration value corresponding to each response based on the difference between the reward score of each response and the standardized confidence value.

[0063] In this embodiment, the training device can standardize the confidence index values ​​corresponding to multiple responses under the same prompt command before calculating the initial calibration value, so as to map the confidence index values ​​of different responses to a unified metric scale. Specifically, the training device can process the confidence index values ​​based on all responses under the same prompt command using a preset standardization method, such as normalizing or performing a zero-mean unit variance transformation to convert the original confidence index values ​​into standardized confidence values. The standardized confidence values ​​are used to eliminate the influence of differences in numerical scale or distribution between different responses within the group, making subsequent comparisons of the confidence levels of different responses under the same prompt command more comparable.

[0064] In this embodiment, the training device can determine the initial calibration value for each response based on the difference between the reward score and the standardized confidence value. Specifically, for each response under the same prompt instruction, the training device can perform a difference operation between its reward score and the corresponding standardized confidence value, for example, by subtracting the standardized confidence value from the reward score. The difference result can be used as the initial calibration value for the response to characterize the degree of deviation between the external quality assessment and the internal confidence level of the response: when the reward score of a response is relatively high and the corresponding standardized confidence value is relatively low, its initial calibration value tends to be positive, indicating that the advantages of the response need to be positively calibrated, guiding the language model to pay more attention to such "high-quality but low-confidence" responses in subsequent training; when the reward score of a response is relatively low and the corresponding standardized confidence value is relatively high, its initial calibration value tends to be negative, indicating that the advantages of the response need to be negatively calibrated, so as to suppress the influence of "low-quality but high-confidence" responses in parameter updates. In this way, the training device can generate an initial calibration value for each response corresponding to the difference between its "reward score - standardized confidence value", providing a basis for subsequent calculation of the advantage calibration value and the calibrated advantage value by combining the adjustment coefficient.

[0065] Please see Figure 2 One or more embodiments of this application also provide a language model training apparatus, including: a first acquisition module, a second acquisition module, a first calculation module, a second calculation module, and an update module.

[0066] The first acquisition module is used to acquire multiple responses generated by the language model for the same prompt instruction.

[0067] The second acquisition module is used to acquire the reward score and confidence index value corresponding to each response; wherein, the reward score represents the quality assessment value of the corresponding response; and the confidence index value is used to represent the degree of confidence of the language model in the generated response.

[0068] The first calculation module is used to calculate an adjustment coefficient based on the reward scores and confidence index values ​​of the multiple responses; wherein the adjustment coefficient is derived based on the correlation between the reward scores and confidence index values ​​of the multiple responses.

[0069] The second calculation module is used to calculate the calibrated advantage value corresponding to each response based on the reward score, the confidence index value, and the adjustment coefficient.

[0070] An update module is used to replace the original advantage value used in reinforcement learning to update the parameters of the language model with the calibrated advantage value, so as to update the parameters of the language model.

[0071] In this embodiment, the functions and effects of the language model training device can be explained in comparison with the aforementioned embodiments, and will not be repeated here.

[0072] Please see Figure 3 This application also provides a computer device comprising: a memory and a processor, wherein the memory stores at least one computer program, and the at least one computer program is loaded and executed by the processor to implement the method described above.

[0073] The memory, processor, and communication interface in the computer device can communicate with each other via the system bus and network communication.

[0074] In this embodiment, the functions and effects implemented by the computer device can be explained by referring to the foregoing embodiments, and will not be repeated here.

[0075] This application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, causes the processor to implement the method as described above.

[0076] The functions and effects achieved in this embodiment can be explained by referring to other embodiments, and will not be repeated here.

[0077] This application also provides a computer program product containing instructions, including a computer program / instructions that, when executed by a processor, implement the method as described above.

[0078] The functions and effects achieved in this embodiment can be explained by referring to other embodiments, and will not be repeated here.

[0079] It is understood that the specific examples in this document are only intended to help those skilled in the art better understand the embodiments of this application, and are not intended to limit the scope of the invention.

[0080] It is understood that in the various embodiments of this application, the sequence number of each process does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0081] It is understood that the various implementation methods described in this application can be implemented individually or in combination, and the implementation methods in this application are not limited in this respect.

[0082] Unless otherwise stated, all technical and scientific terms used in the embodiments of this application have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to limit the scope of this application. The term "and / or" as used in this application includes any and all combinations of one or more of the associated listed items. The singular forms "a," "the," and "the" as used in the embodiments of this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise.

[0083] It is understood that the processor in the embodiments of this application can be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method embodiments can be completed by the integrated logic circuits in the processor's hardware or by instructions in software form. The processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules can be located in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. This storage medium is located in memory; the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method.

[0084] It is understood that the memory in the embodiments of this application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Specifically, non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory may be random access memory (RAM). It should be noted that the memory in the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.

[0085] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0086] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the aforementioned method implementations, and will not be repeated here.

[0087] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0088] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0089] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0090] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0091] The above description is merely a specific embodiment of this application, but the scope of protection of this invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this invention should be determined by the scope of the claims.

Claims

1. A method for training a language model, characterized in that, include: For the same prompt command, obtain multiple responses generated by the language model; Obtain the reward score and confidence index value corresponding to each response; wherein, the reward score represents the quality assessment value of the corresponding response; the confidence index value is used to represent the degree of confidence of the language model in the generated response; An adjustment coefficient is calculated based on the reward scores and confidence index values ​​of the multiple responses; wherein the adjustment coefficient is derived based on the correlation between the reward scores and confidence index values ​​of the multiple responses. Based on the reward score, the confidence index value, and the adjustment coefficient, calculate the calibrated advantage value for each response; The calibrated advantage value is used to replace the original advantage value used in reinforcement learning to update the parameters of the language model, thereby updating the parameters of the language model.

2. The training method according to claim 1, characterized in that, The confidence index value is the average logarithmic probability of each token in the response when the language model generates the corresponding response.

3. The language model training method according to claim 2, characterized in that, The logarithmic probabilities of each token in the response are obtained using the language model as the policy model, without calling an independent policy model to calculate the logarithmic probabilities.

4. The training method according to claim 1, characterized in that, Based on the reward scores and confidence index values ​​of the multiple responses, adjustment coefficients are calculated, including: Calculate the Spearman rank correlation coefficient between the reward score sequence and the confidence index value sequence of the plurality of responses; wherein, the reward score sequence includes the reward scores of the plurality of responses arranged in a specified order; and the confidence index value sequence includes the confidence index values ​​of the plurality of responses arranged in the specified order. The adjustment coefficient is calculated using a preset function based on the Spearman rank correlation coefficient.

5. The training method according to claim 4, characterized in that, The output value of the preset function is negatively correlated with the Spearman rank correlation coefficient.

6. The training method according to claim 1, characterized in that, Based on the reward score, the confidence index value, and the adjustment coefficient, the calibrated advantage value corresponding to each response is calculated, including: Calculate the initial calibration value for each response based on the reward score and confidence index value; The initial calibration value for each response is multiplied by the adjustment coefficient to obtain the dominant calibration value for each response; The calibrated advantage value is obtained by adding the advantage calibration value of each response to the original advantage value; wherein the original advantage value is calculated based on the reward score of the response.

7. The training method according to claim 6, characterized in that, Calculate the initial calibration value for each response based on the reward score and confidence index value, including: The confidence index values ​​of the multiple responses are standardized to obtain standardized confidence values; The initial calibration value for each response is determined based on the difference between the reward score and the standardized confidence value.

8. A computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, causes the processor to implement the method as described in any one of claims 1 to 7.

9. A computer device, characterized in that, The computer device includes a memory and a processor, the memory storing at least one computer program, the at least one computer program being loaded and executed by the processor to implement the method as described in any one of claims 1 to 7.

10. A computer program product, characterized in that, Includes computer instructions that, when executed by a processor, implement the method as described in any one of claims 1 to 7.