Training method and system of large language model
By employing selective positive pseudo-labels and entropy-gated negative pseudo-labels, the problems of weak consensus and noise amplification in reinforcement learning during testing are solved, enabling robust training and high-quality generation of the model in complex tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INST OF AUTOMATION CHINESE ACAD OF SCI
- Filing Date
- 2026-03-11
- Publication Date
- 2026-06-19
AI Technical Summary
Existing test-time reinforcement learning techniques are prone to label noise amplification under weak consensus conditions, lack of negative supervision mechanisms, and lack of dynamic adaptability in reward mechanisms under unlabeled conditions, resulting in unstable model training in complex reasoning tasks.
A selective positive pseudo-label and entropy-gated negative pseudo-label mechanism is adopted. Candidate responses are generated through multiple samplings, the answer distribution is constructed, high-confidence positive pseudo-labels are identified and low-support, high-uncertainty negative pseudo-labels are marked, and the model parameters are updated in combination with a dynamic reward reshaping mechanism.
It effectively suppresses label noise amplification under weak consensus, accurately identifies erroneous paths, improves the training stability and generalization ability of the model in complex reasoning tasks, and enhances the generation quality.
Smart Images

Figure CN122242574A_ABST
Abstract
Description
Technical Field
[0001] This disclosure belongs to the field of artificial intelligence and natural language processing technology, specifically relating to a training method for large language models based on test-time reinforcement learning. More specifically, this invention relates to a test-time reinforcement learning method and system that, in unlabeled test data streams, uses selective positive pseudo-labels and entropy-gated negative pseudo-labels to suppress noise amplification and improve the model's complex reasoning ability. Background Technology
[0002] In recent years, large language models (LLMs) have made significant progress in validating fields such as mathematical reasoning and code generation. To further improve the reasoning ability of models on complex tasks, validation reward-based reinforcement learning (RLVR) has become a mainstream approach. However, this method usually relies on a large amount of manually labeled data or real labels, which becomes difficult to obtain as the complexity of the task increases.
[0003] To address the reliance on labeled data, Test-Time Reinforcement Learning (TTRL) emerged. TTRL allows models to improve themselves on unlabeled test flows. Typically, it involves the model generating multiple inference paths for the same problem and using consensus formed by "majority voting" as a pseudo-reward signal to update the policy.
[0004] However, existing TTRL solutions are mainly based on positive labeling strategies, which have the following core shortcomings when dealing with highly difficult reasoning tasks: 1) Label noise amplification under weak consensus: Existing methods assume that the "majority" answer is the correct answer. However, when dealing with complex problems or when sampling budgets are limited, the model's answer distribution is often highly discrete, making it difficult to form a strong consensus. In this "weak consensus" scenario, traditional majority voting mechanisms are prone to mislabeling slightly more frequent but actually incorrect answers as positive samples. This erroneous supervision signal is further amplified during reinforcement learning (such as the GRPO algorithm), causing the model to converge prematurely to the wrong solution.
[0005] 2) Lack of negative supervision mechanisms: Existing technologies only focus on positive pseudo-labels, ignoring the value of negative signals. In fact, when the model is in a state of high uncertainty, although it is difficult to identify the correct answer with certainty, it is relatively easy and reliable to identify "obviously wrong" answers. Existing TTRL frameworks lack the ability to use negative labels to prune the search space, resulting in the inability to effectively utilize this complementary information to assist model optimization.
[0006] 3) The reward mechanism lacks dynamic adaptability: Existing reward allocation is usually static and does not take into account the strength of consensus. Giving rewards indiscriminately when consensus is weak will exacerbate the instability of training and lack a refined reward and punishment mechanism for answers with different confidence levels.
[0007] Therefore, there is an urgent need for a new test-time reinforcement learning framework that can accurately filter unreliable positive consensus in the absence of real labels and effectively use negative signals to eliminate erroneous paths, thereby achieving robust model self-evolution in complex reasoning tasks. Summary of the Invention
[0008] This disclosure relates to the fields of artificial intelligence and natural language processing technology, specifically to a test-time reinforcement learning method and system for large language models, which can accurately filter unreliable positive consensus in unlabeled test data streams and identify negative trajectories using an entropy gating mechanism, thereby achieving a robust improvement in the model's reasoning ability.
[0009] One or more embodiments of this disclosure provide a training method for a large language model. The training method includes: sampling an unlabeled query multiple times using the large language model to be trained, generating multiple candidate responses for each, extracting multiple answers for each candidate response, aggregating semantically identical answers among the multiple answers to construct an answer distribution, where the answer distribution represents the proportion of different answers appearing in the multiple candidate responses; identifying the preferred answer with the highest appearance proportion and the second-highest appearance proportion of the second-highest appearance proportion of the second-highest appearance proportion of the second-highest appearance proportion of the second-highest appearance proportion of the second-highest appearance proportion of the first ... The algorithm generates positive pseudo-labels for tag queries; determines the prediction dispersion of each candidate response based on the word-level uncertainty of each prediction step in multiple candidate responses, determines the answer generation uncertainty based on the mean of the prediction dispersion of candidate responses corresponding to each answer in the answer distribution, and determines the average generation uncertainty based on the mean of all prediction dispersions of multiple candidate responses; adds at least one answer in the answer distribution with a proportion lower than a preset support threshold and an answer generation uncertainty higher than the average generation uncertainty to the set of negative pseudo-labels for unlabeled queries; and calculates the reward value for each candidate response based on the positive and negative pseudo-label sets, and updates the parameters of the large language model using a reinforcement learning algorithm based on the reward value.
[0010] According to one or more embodiments of this disclosure, the step of sampling the unlabeled query multiple times using a large language model to be trained includes: using the large language model to be trained to independently sample the unlabeled query multiple times based on a preset sampling temperature, thereby generating multiple candidate responses.
[0011] According to one or more embodiments of this disclosure, the first predetermined condition is that the occurrence ratio of the preferred answer is greater than or equal to a preset consensus threshold; the second predetermined condition is that the difference is greater than a preset boundary threshold; and the training method further includes: setting the positive pseudo-label to an empty set when the occurrence ratio of the preferred answer is less than the consensus threshold or the difference is less than or equal to the boundary threshold.
[0012] According to one or more embodiments of this disclosure, lexical level uncertainty is obtained using the Shannon entropy formula to characterize the dispersion of the large language model's prediction of the next lexical term at each prediction step of each candidate response; the prediction dispersion is obtained by averaging the lexical level uncertainty of all generation steps of the candidate response; and the answer generation uncertainty is obtained by averaging the prediction dispersion of all candidate responses belonging to the same answer.
[0013] According to one or more embodiments of this disclosure, calculating the reward value for each candidate response includes: determining the reward value based on a positive reward item, a negative penalty item, and an uncertainty penalty item, wherein, when the answer of the candidate response belongs to a positive pseudo-label, the positive reward item is determined as a positive value that is positively correlated with the proportion of the answer in the answer distribution; when the answer of the candidate response belongs to a set of negative pseudo-labels, the negative penalty item is determined as a negative value based on the difference between the proportion of the answer and a preset support threshold; and the uncertainty penalty item is determined as the difference between the uncertainty of the candidate response's answer generation and the average generation uncertainty multiplied by a preset penalty coefficient.
[0014] According to one or more embodiments of this disclosure, the reinforcement learning algorithm is a group relative policy optimization algorithm.
[0015] One or more embodiments of this disclosure provide a training system for a large language model, including: a pseudo-label estimation unit and a policy optimization unit. The pseudo-label estimation unit includes: a sampling module configured to sample unlabeled queries multiple times using the large language model to be trained, generating multiple candidate responses respectively; extracting multiple answers based on the multiple candidate responses; and aggregating semantically identical answers among the multiple answers to construct an answer distribution, where the answer distribution represents the proportion of different answers appearing in the multiple candidate responses; and a selective positive pseudo-label generation module configured to identify the preferred answer with the highest appearance proportion and the second-highest appearance proportion of the secondary answer in the answer distribution; and a system that generates pseudo-labels when the appearance proportion of the preferred answer satisfies a first predetermined condition and the difference between the appearance proportion of the preferred answer and the appearance proportion of the secondary answer satisfies a second predetermined condition. The first-choice answer is determined as the positive pseudo-label for the unlabeled query. The entropy-gated negative pseudo-label generation module is configured to determine the prediction dispersion of each candidate response based on the word-level uncertainty of each prediction step of each candidate response among multiple candidate responses, determine the answer generation uncertainty based on the mean of the prediction dispersion of the candidate responses corresponding to each answer in the answer distribution, and determine the average generation uncertainty based on the mean of all prediction dispersions of multiple candidate responses. At least one answer in the answer distribution whose occurrence ratio is lower than the preset support threshold and whose answer generation uncertainty is higher than the average generation uncertainty is added to the negative pseudo-label set for the unlabeled query. The policy optimization unit is configured to calculate the reward value of each candidate response based on the positive pseudo-label and negative pseudo-label sets, and update the parameters of the large language model based on the reward value using a reinforcement learning algorithm.
[0016] One or more embodiments of this disclosure provide an electronic device, including: a memory for storing a computer program; and a processor for executing the computer program; wherein, when the processor executes the computer program, it implements a training method for a large language model as described in any of the above embodiments.
[0017] The test-time reinforcement learning method and system for large language models provided in this disclosure combines two mechanisms: selective positive supervision and entropy-gated negative supervision. Compared with a single majority voting method, it can more effectively suppress the label noise amplification problem under weak consensus. By introducing an uncertainty measure, it can accurately identify and prune the erroneous search space in the absence of real labels, only penalizing answers that are both rare and highly uncertain, thus avoiding the accidental deletion of potentially correct rare solutions. Through a dynamic reward reshaping mechanism, the reward magnitude is adaptively adjusted according to the consensus strength and generation uncertainty, which significantly improves the training stability and generalization ability of the model under complex inference tasks and limited sampling budgets, and also improves the generation quality of the model. Attached Figure Description
[0018] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure, and are not intended to unduly limit this disclosure.
[0019] Figure 1 This is a flowchart of the training method for a large language model according to the present disclosure.
[0020] Figure 2A and Figure 2B The diagrams show a comparison between a pseudo-labeling strategy under weak consensus based on the prior art and a pseudo-labeling strategy under weak consensus based on an embodiment of this disclosure.
[0021] Figure 3 This is a schematic diagram illustrating the test results during reinforcement learning based on the selective complementary strategy of this disclosure.
[0022] Figure 4 This is a statistical diagram illustrating the test-time reinforcement learning positive and negative pseudo-label estimation based on the selective complementarity strategy of this disclosure.
[0023] Figure 5 A block diagram of a training system for a large language model based on this disclosure.
[0024] Figure 6 This is a structural block diagram of an electronic device according to the present disclosure. Detailed Implementation
[0025] To enable those skilled in the art to better understand the technical solutions of this disclosure, the technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings.
[0026] It should be noted that the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such order can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. The embodiments described in the following examples do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.
[0027] It should be noted that the phrase "at least one of several items" in this disclosure refers to three parallel cases: "any one of the several items", "a combination of any number of the several items", and "all of the several items". For example, "including at least one of A and B" includes the following three parallel cases: (1) including A; (2) including B; (3) including A and B. Another example is "performing at least one of step one and step two", which means the following three parallel cases: (1) performing step one; (2) performing step two; (3) performing both step one and step two.
[0028] Various embodiments of this disclosure will now be described in detail. Numerous specific details are set forth in the following description in order to provide a thorough understanding of this disclosure. However, it will be apparent to those skilled in the art that this disclosure can be practiced without some of these specific details.
[0029] Large Language Models (LLMs) are deep learning models trained on massive amounts of text data, capable of understanding, generating, and reasoning about natural language. In the training of LLMs, Test-Time Reinforcement Learning (TTRL) is an unsupervised learning paradigm performed during the model's inference phase. In existing technologies, multiple inference samples are typically generated on unlabeled test data streams, and pseudo-reward signals are generated using consensus among the samples (such as majority voting), thereby updating the model policy online to improve inference performance. Grouped Relative Policy Optimization (GRPO) is a reinforcement learning algorithm that calculates the relative advantage and updates the policy by analyzing multiple outputs (groups) generated from the same input. It is commonly used in existing technologies for inference optimization of large language models. This invention uses this algorithm as the basic optimizer. Weak consensus refers to a state where, when dealing with highly difficult problems or with a limited number of samples, the multiple answers generated by a large language model are relatively discrete, lacking a single, dominant answer. In this state, traditional majority voting methods are prone to selecting incorrect answers.
[0030] This application relates to the fields of artificial intelligence and natural language processing, and can be widely applied to computer technology scenarios requiring highly reliable logical reasoning, such as intelligent assisted programming and automated modeling of complex industrial equipment. As a preferred example of a complex task application, the method described in this application can be used for "automated code logic generation and adaptive debugging for complex engineering systems." In practical applications, when large language models face highly challenging programming tasks, the generated candidate codes often exhibit extremely high dispersion. Conventional majority voting mechanisms are prone to mistaking majority code containing potential logical errors as the correct answer. Applying the technical solution of this application, the system first performs a strict consistency check on the generated candidate code stream. Only when the generation ratio of a certain type of code scheme and its ratio difference with the suboptimal scheme both meet a preset threshold are they used as positive learning signals. When strong consensus is lacking, the system calculates the Shannon entropy of each code trajectory generated by the model to quantify uncertainty, and marks code schemes with low generation frequency and entropy values higher than the average level as negative pseudo-labels for search space pruning. Finally, the system constructs a dynamic reward function based on the above consensus strength and uncertainty penalty term to adjust the internal weights of the large language model. This solution dynamically avoids unreliable programming logic during the unlabeled testing phase by using objective data distribution and entropy calculation, effectively improving the accuracy of complex automated code generation and the stability of system operation.
[0031] Figure 1 This is a flowchart of the training method for a large language model according to the present disclosure. Figure 2A and Figure 2B The diagrams show a comparison between a pseudo-labeling strategy under weak consensus based on the prior art and a pseudo-labeling strategy under weak consensus based on an embodiment of this disclosure.
[0032] like Figure 1 As shown, this disclosure aims to address the problems of weak consensus and label noise amplification caused by the discrete distribution of answers in existing large language model testing reinforcement learning techniques. The purpose of this disclosure is to provide a large language model testing reinforcement learning method and system based on a selective complementarity strategy. This method and system can accurately filter unreliable positive consensus in the unlabeled test stream and identify negative trajectories using an entropy gating mechanism, thereby achieving a robust improvement in the model's reasoning ability.
[0033] In step S100, the unlabeled query is sampled multiple times using the large language model to be trained, generating multiple candidate responses. Multiple answers are extracted from the multiple candidate responses, and the semantically identical answers are aggregated to construct an answer distribution. The answer distribution represents the proportion of each different answer in the multiple candidate responses.
[0034] In this embodiment, the unlabeled query input by the user is subjected to inference sampling and distribution construction. That is, a set of candidate responses is generated using the current policy model, and the answer distribution is constructed based on the extracted answer statistics to obtain the inference state of the current model for the query.
[0035] In the embodiments of this disclosure, a candidate response refers to the output text generated by the large language model in response to an input query (e.g., an unlabeled query), containing a complete reasoning process or chain of thought, while the answer is the final conclusion or result extracted from the candidate response. For example, when the unlabeled query is a mathematical problem "calculate the value of 2+3", the first candidate response generated by the model might be "First, add 2 and 3 together, 2 plus 3 equals 5, so the final result is 5"; another generated candidate response might be "According to the addition principle, the result of 2+3 is 5". From these two different candidate responses, although their derivation processes (i.e., contextual descriptions) are not exactly the same, the final conclusion extracted from them is the answer "5". This relationship shows that the candidate response contains derivation steps and a final conclusion, while the answer can correspond to the core output result of the candidate response; multiple different candidate responses may correspond to the same semantically equivalent answer, and thus an answer distribution can be constructed by aggregating these answers.
[0036] According to one embodiment of this disclosure, the unlabeled query is sampled multiple times using a large language model to be trained, including: using the large language model to be trained to independently sample the unlabeled query multiple times based on a preset sampling temperature, thereby generating multiple candidate responses.
[0037] In the embodiments of this disclosure, setting an appropriate sampling temperature is a prerequisite for achieving effective multiple independent sampling. When generating text, the large language model predicts the probability distribution of the next word based on the given context. If the sampling temperature is set too low (e.g., close to 0), the model's output will tend towards a deterministic greedy search. In this case, no matter how many independent samplings are performed, the model will repeatedly output the exact same candidate response, resulting in an inability to cover a sufficiently broad inference path. Conversely, when an effective sampling temperature greater than 0 is set (e.g., 0.7 or 1.0), the probability distribution of words is smoothed, allowing the model to select a suboptimal but reasonable word with a certain probability in each independent sampling process, thus introducing necessary randomness. Due to the randomness mechanism triggered by the sampling temperature, multiple independent samplings can generate multiple candidate responses with different content structures and inference approaches for the same query, avoiding homogenization of the generated results. This provides a data foundation for subsequently constructing a statistically significant answer distribution and accurately estimating model uncertainty.
[0038] As an example of an embodiment of this disclosure, inference sampling and distribution construction can be implemented through a processing framework that includes a candidate response generation unit and an answer distribution statistics unit.
[0039] The candidate response generation unit is configured to receive unlabeled queries. Utilizing the current strategy model To ensure diversity in the generated dataset, a specific sampling temperature is set, and the query is sampled multiple times independently to generate a dataset containing... A set of candidate responses This process aims to explore the solution space, providing a sufficient data foundation for subsequent consensus evaluation. The sampling temperature is set to balance the diversity and quality of responses; a higher sampling temperature allows the model to explore a wider range of solutions, providing a sufficient data foundation for subsequent consensus evaluation.
[0040] The answer distribution statistics unit is configured to analyze each candidate response in the set. Analyze and extract the final answer. And construct the answer distribution This unit further analyzes each different answer. Number of times And calculate its proportion in the total candidate responses. This is used to quantify the support level of each answer.
[0041] Reference Figure 1 , Figure 2A and Figure 2B In step S200, the preferred answer with the highest occurrence rate and the second-highest occurrence rate in the answer distribution are identified; if the occurrence rate of the preferred answer meets a first predetermined condition and the difference between the occurrence rate of the preferred answer and the occurrence rate of the second-highest occurrence rate meets a second predetermined condition, the preferred answer is determined as a positive pseudo-label for the unlabeled query.
[0042] According to one embodiment of this disclosure, the first predetermined condition is that the occurrence ratio of the preferred answer is greater than or equal to a preset consensus threshold; the second predetermined condition is that the difference is greater than a preset boundary threshold; and the training method further includes: setting the positive pseudo-label to an empty set when the occurrence ratio of the preferred answer is less than the consensus threshold or the difference is less than or equal to the boundary threshold.
[0043] In this embodiment, the answer distribution is processed by selective positive pseudo-labeling, which means that by evaluating the concentration and separation of the answer distribution, the preferred answer with high confidence is selected as a positive supervision signal, or an abstention operation is performed when there is insufficient consensus.
[0044] Existing technologies typically employ simple majority voting strategies, such as... Figure 2A As shown, in a weak consensus state where the answer distribution is highly discrete, simple majority voting can easily misclassify incorrect answers with only a slight advantage as positive labels, leading to noise amplification. To address this issue, this step employs the following... Figure 2B The selective strategy shown employs a screening framework that includes a consensus strength assessment mechanism and a dual threshold discrimination mechanism (i.e., Figure 1 The consensus verification shown in the figure is implemented.
[0045] As an example of an embodiment of this disclosure, the consensus strength assessment mechanism is configured to analyze the distribution of answers. Identify the most frequently occurring preferred answer. and its corresponding first ratio ,Right now Simultaneously, identify the second most frequent answer choice and its corresponding second proportion. ,Right now This serves as the basic indicator for measuring consensus concentration and answer separation.
[0046] The dual threshold discrimination mechanism is configured to perform strict conditional judgments: first, the first ratio is judged. Is it greater than or equal to the preset consensus threshold? This ensures that the preferred answer has sufficient absolute support; secondly, it determines the difference between the first and second proportions. Is it greater than the preset boundary threshold? This is to ensure a significant separation between the preferred answer and the competing answers.
[0047] Selective positive pseudo-label generation unit (i.e., Figure 1 The Selective Positive Pseudo-Labeling (SPP) shown is configured to output the final label based on the result of a dual threshold discrimination mechanism: if both of the above conditions are met simultaneously, the system determines that the current consensus has high confidence and selects the preferred answer. Marked as a positive pseudo-tag If any condition is not met—that is, the proportion of the preferred answer is less than the consensus threshold or the difference is less than or equal to the boundary threshold—the system determines that it is currently in a weak consensus state, performs an abstention operation, and sets the positive pseudo-label to an empty set. This avoids mistaking unreliable majority answers as training targets and prevents the model from prematurely converging to incorrect solutions in complex reasoning tasks.
[0048] In step S300, based on the word-level uncertainty of each prediction step of each candidate response among multiple candidate responses, the prediction dispersion of each candidate response is determined, the answer generation uncertainty is determined based on the mean of the prediction dispersion of the candidate responses corresponding to each answer in the answer distribution, and the average generation uncertainty is determined based on the mean of all prediction dispersions of multiple candidate responses; at least one answer in the answer distribution whose occurrence ratio is lower than the preset support threshold and whose answer generation uncertainty is higher than the average generation uncertainty is added to the negative pseudo-label set of the unlabeled query.
[0049] According to one embodiment of this disclosure, lexical level uncertainty is obtained using the Shannon entropy formula to characterize the dispersion of the large language model's prediction of the next lexical term at each prediction step of each candidate response; the prediction dispersion is obtained by averaging the lexical level uncertainty of all generation steps of the candidate response; and the answer generation uncertainty is obtained by averaging the prediction dispersion of all candidate responses belonging to the same answer.
[0050] In this embodiment, entropy-gated negative pseudo-labeling is performed on candidate responses. This involves identifying and marking erroneous reasoning paths by combining the frequency of answer occurrence with generation uncertainty, and constructing a set of negative pseudo-labels to prune the search space in subsequent steps.
[0051] As an example of an embodiment of this disclosure, entropy-gated negative pseudo-label processing (i.e., Figure 1 The entropy-gated negative pseudo-labeling shown can be implemented through an analytical framework that includes uncertainty aggregation units and negative set building units.
[0052] The uncertainty aggregation unit is configured to first use the Shannon entropy formula to calculate each candidate response. In each generation step Lexical level uncertainty This value reflects the discreteness of the model's prediction of the next word in the current context; then, by analyzing each candidate response... The average entropy values of all generation steps are used to calculate the prediction dispersion (also known as trajectory-level uncertainty). This is to eliminate the impact of differences in response length; finally, those belonging to the same answer will be... The average generation uncertainty of the answer is calculated by aggregating and averaging the predicted dispersion of all candidate responses. (Also known as answer generation uncertainty, or answer-level uncertainty), which quantifies the overall confidence level of the model when generating that particular answer.
[0053] Negative set building blocks are configured based on a preset low support threshold. (Also known as the support threshold) and query-level average uncertainty (Also known as average generation uncertainty), this involves filtering the answer distribution. The specific filtering logic is to identify those answers that simultaneously satisfy the occurrence ratio... Less than the low support threshold and average generation uncertainty Greater than or equal to the query-level average uncertainty Answers meeting these two conditions are considered incorrect answers that lack both consensus support and high generation uncertainty, and are added to the negative pseudo-label set. This configuration ensures that the system only prunes unreliable error paths, while retaining potentially correct answers that occur infrequently but have high confidence in model generation, thus achieving complementary construction of supervision signals.
[0054] In step S400, based on the sets of positive and negative pseudo-labels, the reward value of each candidate response is calculated, and the parameters of the large language model are updated using a reinforcement learning algorithm based on the reward value.
[0055] According to one embodiment of this disclosure, calculating the reward value for each candidate response includes: determining a reward system based on a positive reward item, a negative penalty item, and an uncertainty penalty item, wherein the positive reward item indicates that when the answer of the candidate response belongs to a positive pseudo-label, the reward value of the candidate response is determined to be a positive value that is positively correlated with the proportion of the answer in the answer distribution; the negative penalty item indicates that when the answer of the candidate response belongs to a set of negative pseudo-labels, the reward value of the candidate response is determined to be a negative value based on the difference between the proportion of the answer and a preset support threshold; and the uncertainty penalty item indicates that a preset penalty coefficient is multiplied by the difference between the uncertainty of the candidate response's answer generation and the average generation uncertainty.
[0056] According to one embodiment of this disclosure, updating the parameters of a large language model using a reinforcement learning algorithm includes: the reinforcement learning algorithm being a group relative policy optimization algorithm.
[0057] In this embodiment, dynamic reward shaping and policy optimization are performed. Specifically, a dynamic reward signal with multiple constraints is constructed based on the positive and negative pseudo-labels generated in the preceding steps, and the parameters of the large language model are updated using a group relative policy optimization algorithm to achieve self-evolution of the model's reasoning ability.
[0058] As an example of an embodiment of this disclosure, dynamic reward reshaping and policy optimization are implemented through an optimization framework that includes a dynamic reward calculation unit and a parameter iterative update unit.
[0059] The dynamic reward calculation unit is configured to calculate the reward for each candidate response. Calculate a comprehensive reward value The calculation logic for this comprehensive reward value includes three core components: positive reward, negative penalty, and uncertainty penalty.
[0060] The positive reward is configured when the candidate response's answer belongs to a positive pseudo-label. At that time, the positive reward is determined to be a positive value, which is related to the proportion of the answer appearing in the distribution. It is proportional to the strength of consensus, thus dynamically adjusting the incentive magnitude based on the strength of consensus.
[0061] The negative penalty term is configured to apply when the candidate response's answer belongs to the set of negative pseudo-labels. At that time, the negative penalty term is determined to be a negative value. ,in A low support threshold is set to penalize erroneous paths that lack both support and high generation uncertainty.
[0062] The uncertainty penalty term is configured to calculate the generation uncertainty of the response. Query-level average uncertainty The difference is multiplied by a preset penalty coefficient. This imposes soft constraints on the highly uncertain generation behavior, guiding the model to explore more certain regions.
[0063] The parameter iterative update unit is configured to utilize the comprehensive reward value. The Grouped Relative Policy Optimization (GRPO) algorithm was used for training. The specific operation of the GRPO algorithm is as follows: First, the mean reward value within a set of candidate responses is calculated. and standard deviation Then, the relative advantage within each candidate response is calculated. Next, an objective function is constructed based on this relative advantage, which includes the importance sampling ratio. And a pruning mechanism; finally, by maximizing the objective function, the large language model is iteratively updated using gradient descent. The parameters are set. This process allows the model to gradually improve its ability to handle complex inference tasks while maintaining training stability, avoiding training collapse caused by excessive variance in the reward signal.
[0064] Figure 3 This is a schematic diagram illustrating the test results during reinforcement learning based on the selective complementary strategy of this disclosure. Figure 4This is a statistical diagram illustrating the test-time reinforcement learning positive and negative pseudo-label estimation based on the selective complementarity strategy of this disclosure.
[0065] To verify the effectiveness of the embodiments disclosed herein, tests were conducted on publicly available mathematical reasoning datasets such as AIME25, MATH-500, AMC, and Minerva, using various mainstream large language models such as Qwen2.5-3B and Qwen2.5-Math-7B as base models. The test results are as follows: Figure 3 , Figure 4 As shown in the figure. Experimental results demonstrate that the Selective-Complementary Reinforcement Learning (SCRL) method for large language models based on a selective complementary strategy proposed in this disclosure achieves significant performance during testing.
[0066] Specifically, such as Figure 3 As shown, taking the Qwen2.5-3B model as an example on the highly challenging AIME25 dataset, this disclosure significantly improves the accuracy (Pass@1) from 2.6% to 8.4% under a limited sampling budget compared to the TTRL method.
[0067] Figure 4 Part (a) of the diagram presents the statistics for positive pseudo-labels. Here, Average Label Accuracy (TTRL) refers to the average accuracy of positive pseudo-labels generated by the baseline method TTRL. Label Accuracy (this disclosure) is the change in the accuracy of positive pseudo-labels generated by the SCRL method with the number of training steps. Label Ratio (this disclosure) is the proportion of samples where the SCRL method successfully generates positive pseudo-labels. It can be seen that the SCRL method of this disclosure consistently maintains a higher label accuracy than TTRL during training, demonstrating that the selective mechanism effectively filters noise under weak consensus.
[0068] Figure 4 Section (b) presents the statistics for negative pseudo-labels. Label accuracy (this disclosure) is the change in accuracy of the negative pseudo-labels generated by the SCRL method with the number of training steps. Average Label Count refers to the average number of negative pseudo-labels generated by the SCRL method. The results show that the accuracy of entropy-gated negative labels is close to 100%, verifying the extremely high reliability of identifying erroneous paths using low-frequency plus high-entropy features.
[0069] Figure 5 A block diagram of a training system for a large language model based on this disclosure.
[0070] Reference Figure 5 The system includes an input interface 5100, a processing core 5300, and an output interface 5500. The processing core 5300 is composed of a pseudo-label estimation unit 5310 and a policy optimization unit 5320 connected together. Descriptions of the input and output interfaces are omitted here.
[0071] The pseudo-label estimation unit 5310 is configured to execute all the processing logic of steps S100 to S300 described above. In an embodiment, the pseudo-label estimation unit 5310 includes: a sampling module 5311, a selective positive pseudo-label generation module 5313, and an entropy-gated negative pseudo-label generation module 5315.
[0072] The sampling module 5311 is configured to sample the unlabeled query multiple times using the large language model to be trained, generating multiple candidate responses for each sample. Based on these candidate responses, multiple answers are extracted, and semantically similar answers are aggregated to construct an answer distribution, which represents the proportion of different answers appearing in the multiple candidate responses. The sampling module 5311 is configured to perform the operation described in step S100 above; redundant descriptions are omitted here.
[0073] The selective positive pseudo-label generation module 5313 is configured to identify the preferred answer with the highest occurrence rate and the second-highest occurrence rate of the secondary answer in the answer distribution; if the occurrence rate of the preferred answer meets a first predetermined condition, and the difference between the occurrence rate of the preferred answer and the occurrence rate of the secondary answer meets a second predetermined condition, the preferred answer is determined as the positive pseudo-label of the unlabeled query. The selective positive pseudo-label generation module 5313 is configured to perform the operation of step S200 as described above, and redundant descriptions are omitted here.
[0074] The entropy-gated negative pseudo-label generation module 5315 is configured to determine the prediction dispersion of each candidate response based on the word-level uncertainty of each prediction step of each candidate response among multiple candidate responses; determine the answer generation uncertainty based on the mean of the prediction dispersion of the candidate responses corresponding to each answer in the answer distribution; and determine the average generation uncertainty based on the mean of all prediction dispersions of multiple candidate responses. At least one answer in the answer distribution whose occurrence rate is lower than a preset support threshold and whose answer generation uncertainty is higher than the average generation uncertainty is added to the negative pseudo-label set of the unlabeled query. The entropy-gated negative pseudo-label generation module 5315 is configured to perform the operation of step S300 as described above; redundant descriptions are omitted here.
[0075] The policy optimization unit 5320 is configured to calculate the reward value for each candidate response based on the set of positive pseudo-labels and the set of negative pseudo-labels, and to update the parameters of the large language model based on the reward value using a reinforcement learning algorithm. The policy optimization unit 5320 is configured to perform the operation of step S400 described above; redundant descriptions are omitted here.
[0076] According to one or more embodiments of this disclosure, it combines selective positive supervision and entropy-gated negative supervision mechanisms, which, compared to a single majority voting method, can more effectively suppress the label noise amplification problem under weak consensus. By introducing an uncertainty metric, it can accurately identify and prune the erroneous search space in the absence of true labels, penalizing only answers that are both rare and highly uncertain, thus avoiding the accidental deletion of potentially correct rare solutions. On the other hand, through a dynamic reward reshaping mechanism, the reward magnitude is adaptively adjusted according to the consensus strength and generation uncertainty, significantly improving the training stability and generalization ability of the model under complex inference tasks and limited sampling budgets, and enhancing the generation quality of the model.
[0077] Figure 6 This is a block diagram of an electronic device 600 according to an embodiment of the present disclosure.
[0078] Reference Figure 6 An electronic device 600 according to embodiments of the present disclosure may include a processor 610 and a memory 620. The processor 610 may include (but is not limited to) a central processing unit (CPU), a digital signal processor (DSP), a microcomputer, a field-programmable gate array (FPGA), a system-on-a-chip (SoC), a microprocessor, an application-specific integrated circuit (ASIC), etc. The memory 620 may store computer programs to be executed by the processor 610. The memory 620 includes high-speed random access memory and / or non-volatile computer-readable storage media. When the processor 610 executes the computer program stored in the memory 620, the training method for a large language model as described above can be implemented.
[0079] Examples of computer-readable storage media include: read-only memory (ROM), random access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or optical disc storage, hard disk drive (HDD), solid-state drive (SSD), card storage (such as multimedia cards, secure digital (SD) cards, or ultra-fast digital (XD) cards), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid-state drive, and any other device configured to store computer programs and any associated data, data files, and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files, and data structures to a processor or computer so that the processor or computer can execute the computer programs. In one example, the computer programs and any associated data, data files, and data structures are distributed across a networked computer system, such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed manner through one or more processors or computers.
[0080] While some embodiments of this disclosure have been shown and described, those skilled in the art will understand that modifications may be made to these embodiments without departing from the principles and spirit of this disclosure, which are defined by the claims and their equivalents.
Claims
1. A training method for a large language model, characterized in that, The training method includes: Using a large language model to be trained, unlabeled queries are sampled multiple times to generate multiple candidate responses. Multiple answers are extracted from the multiple candidate responses. Answers with the same semantics are aggregated to construct an answer distribution, which represents the proportion of different answers appearing in the multiple candidate responses. The preferred answer with the highest occurrence rate and the second-highest occurrence rate in the answer distribution are identified; if the occurrence rate of the preferred answer meets a first predetermined condition and the difference between the occurrence rate of the preferred answer and the occurrence rate of the second-highest occurrence rate meets a second predetermined condition, the preferred answer is determined as the positive pseudo-label of the unlabeled query. Based on the lexical-level uncertainty of each prediction step of each of the multiple candidate responses, the prediction dispersion of each candidate response is determined; the answer generation uncertainty is determined based on the mean of the prediction dispersion of the candidate responses corresponding to each answer in the answer distribution; and the average generation uncertainty is determined based on the mean of all prediction dispersions of the multiple candidate responses. At least one answer in the answer distribution whose occurrence rate is lower than a preset support threshold and whose answer generation uncertainty is higher than the average generation uncertainty is added to the negative pseudo-label set of the unlabeled query. Based on the set of positive pseudo-labels and negative pseudo-labels, a reward value is calculated for each candidate response, and the parameters of the large language model are updated using a reinforcement learning algorithm based on the reward value.
2. The training method according to claim 1, characterized in that, The step of sampling the unlabeled query multiple times using the large language model to be trained includes: using the large language model to be trained to independently sample the unlabeled query multiple times based on a preset sampling temperature, thereby generating the multiple candidate responses.
3. The training method according to claim 1, characterized in that, in, The first predetermined condition is that the proportion of the preferred answer is greater than or equal to a preset consensus threshold; The second predetermined condition is that the difference is greater than a preset boundary threshold; and The training method further includes setting the positive pseudo-label to an empty set when the occurrence ratio of the preferred answer is less than the consensus threshold or the difference is less than or equal to the boundary threshold.
4. The training method according to claim 1, characterized in that, in, The lexical level uncertainty is obtained using the Shannon entropy formula to characterize the degree of dispersion of the large language model in each prediction step of each candidate response for the next lexical prediction. The prediction dispersion is obtained by averaging the word-level uncertainties of all generation steps of the candidate response; and The uncertainty in generating the answer is obtained by averaging the predicted dispersion of all candidate responses belonging to the same answer.
5. The training method according to claim 1, characterized in that, Calculating the reward value for each candidate response includes: The reward value is determined based on positive reward terms, negative penalty terms, and uncertainty penalty terms. Wherein, when the answer of the candidate response belongs to the positive pseudo-label, the positive reward item is determined to be a positive value that is positively correlated with the proportion of the answer appearing in the answer distribution; When the candidate response's answer belongs to the set of negative pseudo-labels, the negative reward item is determined as a negative value based on the difference between the occurrence ratio of the answer and the preset support threshold; and The uncertainty penalty term is determined by multiplying the difference between the uncertainty generated by the candidate response and the average uncertainty generated by the default response by a preset penalty coefficient.
6. The training method according to claim 1, characterized in that, The reinforcement learning algorithm is a group relative policy optimization algorithm.
7. A training system for a large language model, characterized in that, include: A pseudo-label estimation unit and a policy optimization unit, wherein the pseudo-label estimation unit includes: The sampling module is configured to sample the unlabeled query multiple times using the large language model to be trained, generate multiple candidate responses respectively, extract multiple answers according to the multiple candidate responses, and aggregate the semantically similar answers to construct an answer distribution, which represents the proportion of different answers appearing in the multiple candidate responses. The selective positive pseudo-label generation module is configured to identify the preferred answer with the highest occurrence ratio and the second-highest occurrence ratio in the answer distribution; and when the occurrence ratio of the preferred answer meets a first predetermined condition and the difference between the occurrence ratio of the preferred answer and the occurrence ratio of the second-highest occurrence ratio meets a second predetermined condition, the preferred answer is determined as the positive pseudo-label of the unlabeled query. An entropy-gated negative pseudo-label generation module is configured to determine the prediction dispersion of each candidate response based on the lexical-level uncertainty of each prediction step of each candidate response among the plurality of candidate responses; determine the answer generation uncertainty based on the mean of the prediction dispersion of the candidate responses corresponding to each answer in the answer distribution; and determine the average generation uncertainty based on the mean of all prediction dispersions of the plurality of candidate responses; add at least one answer in the answer distribution whose occurrence rate is lower than a preset support threshold and whose answer generation uncertainty is higher than the average generation uncertainty to the negative pseudo-label set of the unlabeled query; and The policy optimization unit is configured to calculate the reward value of each candidate response based on the set of positive pseudo-labels and the set of negative pseudo-labels, and update the parameters of the large language model based on the reward value using a reinforcement learning algorithm.
8. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for executing the computer program; wherein, when the processor executes the computer program, it implements the training method for a large language model as described in any one of claims 1-6.