Large language model post-training method, storage medium and electronic device
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING KNOWLEDGE ATLAS TECHNOLOGY CO LTD
- Filing Date
- 2026-05-12
- Publication Date
- 2026-06-12
Smart Images

Figure CN122198151A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and more specifically, to a method for post-training a large language model, a storage medium, and an electronic device. Background Technology
[0002] With the widespread application of Large Language Models (LLMs) in tasks such as complex reasoning, code generation, multi-step decision-making, agent task execution, and security-sensitive scenarios, models typically generate chain-of-thought (CoT) text before providing the final result to represent the intermediate reasoning process. Chain-of-thought not only helps improve the model's problem-solving capabilities in complex tasks but has also gradually become an important vehicle for understanding model behavior, analyzing model reasoning paths, and implementing process supervision and security monitoring.
[0003] In actual post-training, the model's training objective not only includes the correctness of the final result but also incorporates additional rewards related to the length, style, clarity, preference, and process correctness of the chain-thinking text. At this point, the training signal the model faces is no longer a single objective but rather a combination of "outcome reward" and "chain-thinking reward." Existing research indicates that when there is an inconsistency or even conflict between chain-thinking rewards and outcome rewards, the model may gradually tend to output chain-thinking text that formally conforms to the reward requirements but does not truly reflect its actual solution process, thus reducing the monitorability of chain-thinking. In other words, the model may still maintain high task performance, but its chain-thinking text no longer stably corresponds to the actual reasoning path, making it difficult to effectively implement chain-thinking-based behavior monitoring, risk identification, and process supervision. In conclusion, how to effectively maintain the monitorability of chain thinking during the post-training process of large language models is a problem that needs to be solved. Summary of the Invention
[0004] To address the aforementioned problems, the present invention aims to provide a large language model post-training method, storage medium, and electronic device.
[0005] In a first aspect, embodiments of the present invention provide a method for post-training a large language model, comprising: Obtain a training task set for post-training of a large language model; the training task set includes training samples from multiple training batches. For the target training sample corresponding to the current training batch, the chain-thinking text and the final result text corresponding to the target training sample are determined according to the large language model, so as to train the large language model according to the total reward function of the current training batch and update the model parameters of the large language model; the total reward function includes a result reward item and a chain-thinking reward item; Based on the chain-thinking text and final result text corresponding to each of the target training samples, determine the reward conflict degree and chain-thinking monitorability index for each of the target training samples. Based on the reward conflict degree and chain thinking monitorability index of each target training sample, the risk index of the current training batch is determined. When the risk indicator indicates the presence of chain-like thinking distortion risk, the weight of the chain-like thinking reward item in the total reward function is reduced, while the weight of the result reward item is maintained or increased to update the total reward function. The updated total reward function is used to train the large language model in the next training batch to continue updating the model parameters of the large language model until the large language model is completed and training is finished.
[0006] In some alternative implementations, determining the reward conflict degree of the target training sample includes: Based on the chain-like thinking text of the target training sample, determine the degree of compression of the chain-like thinking length, the degree of restriction of explicit expression of key task semantics, and the degree of lack of completeness of reasoning steps of the target training sample. The reward conflict degree of the target training sample is determined by weighting the degree of compression of chain thinking length, the degree of restriction of explicit expression of key task semantics, and the degree of lack of completeness of reasoning steps.
[0007] In some optional implementations, the degree of compression of the chain-like thought length of the target training samples is as follows:
[0008] in, Indicates the first The degree of compression of the chain-like thinking length of each target training sample. This represents the preset minimum expression length threshold. Indicates the first The length of the chain-like thought text of each target training sample; The degree to which the explicit semantic expression of the key task in the target training samples is restricted is as follows:
[0009] in, Indicates the first The degree to which the explicit semantic expression of the key task of each target training sample is restricted. Indicates completion of the first Each target training sample corresponds to the set of key semantics required for the task. Indicates the first The set of key semantics actually explicitly expressed in the chain-like thinking text of each target training sample. Indicates completion of the first Each target training sample corresponds to the key semantic set required for the task. The number of semantic elements in the text; This represents the number of semantic elements in the intersection of two key semantic sets; The degree of incompleteness of the reasoning steps in the target training sample is as follows:
[0010] in, Indicates the first The degree of incompleteness of the inference steps for each target training sample. Indicates the first The number of valid reasoning steps identified in the chain-thinking text of each target training sample. Indicates completion of the first The number of standard inference steps required for a task corresponding to a target training sample.
[0011] In some alternative implementations, determining the chain-thinking monitorability metrics of the target training samples includes: Based on the chain-like thought text of the target training sample, determine the reasoning expression completeness score of the target training sample; Based on the chain-like thinking text and the final result text corresponding to the target training sample, the monitoring and recognition capability score and consistency score of the target training sample are determined; the consistency score is used to represent the degree of consistency between the chain-like thinking text and the final result text. The reasoning expression integrity score, monitoring and recognition ability score, and consistency score of the target training sample are weighted to determine the chain thinking monitorability index of the target training sample.
[0012] In some optional implementations, the monitoring and recognition capability score of the target training samples is:
[0013] in, Indicates the first The monitoring and recognition capability score of each target training sample. Indicates the monitor's position on the first The number of times a target training sample is correctly identified. This indicates that the monitor is monitoring the first... The total number of times the target training samples are judged; The reasoning expression completeness score of the target training sample is:
[0014] in, Indicates the first The reasoning expression completeness score of each target training sample. Indicates the first The number of valid reasoning steps identified in the chain-thinking text of a target training sample; Indicates completion of the first The number of standard inference steps required for a task corresponding to a given target training sample; The consistency score of the target training samples is:
[0015] in, Indicates the first The consistency score between the chain-thinking text of each target training sample and the final result text. This represents the set of semantic results extracted from the final result text. This represents the set of reasoning semantics extracted from the chain-like thought text. This indicates the number of semantic elements in the resulting semantic set. This represents the number of semantic elements in the inference semantic set that are consistent with the result semantic set.
[0016] In some optional implementations, determining the risk indicators for the current training batch based on the reward conflict degree and chain-thinking monitorability index of each of the target training samples includes: Based on the reward conflict degree and chain thinking monitorability index of the target training sample, the risk index of the target training sample is determined. The risk index of the current training batch is determined based on the risk index of each target training sample in the current training batch. The risk index of the target training sample is as follows:
[0017] in, Indicates the first Risk indicators for each target training sample; Indicates the first The degree of reward conflict for each target training sample; Indicates the first The chain-thinking monitorability index of a target training sample; and These are the preset weighting coefficients.
[0018] In some optional implementations, the method further includes: If the risk indicator indicates the presence of chain-like thinking distortion risk, a safety constraint term is added to the total reward function; or, if the total reward function also includes a safety constraint term, the weight of the safety constraint term is increased.
[0019] In some optional implementations, the security constraint is:
[0020] in, Indicates the first The security constraints corresponding to each target training sample Indicates the first The reasoning expression completeness score of each target training sample. Indicates the first The consistency score between the chain-thinking text of each target training sample and the final result text. Indicates the first The reward conflict degree of each target training sample. , , This represents the weighting coefficient of each part within the safety constraint term.
[0021] Secondly, embodiments of the present invention also provide a large language model post-training apparatus, comprising: The post-training preparation module is used to obtain a training task set for post-training the large language model; the training task set includes training samples from multiple training batches. The processing module is used to determine the chain-like thinking text and the final result text corresponding to the target training sample corresponding to the current training batch based on the large language model, so as to train the large language model according to the total reward function of the current training batch and update the model parameters of the large language model; the total reward function includes a result reward item and a chain-like thinking reward item; The evaluation module is used to determine the reward conflict degree and chain thinking monitorability index of each target training sample based on the chain thinking text and the final result text corresponding to each target training sample. The risk assessment module is used to determine the risk indicators of the current training batch based on the reward conflict degree and chain thinking monitorability indicators of each target training sample. The reward structure adaptive adjustment module is used to reduce the weight of the chain thinking reward item in the total reward function and maintain or increase the weight of the result reward item when the risk indicator indicates the risk of chain thinking distortion, so as to update the total reward function; the updated total reward function is used to train the large language model in the next training batch to continue to update the model parameters of the large language model until the large language model is completed and training is finished.
[0022] Thirdly, embodiments of the present invention also provide a computer storage medium storing computer-executable instructions for use in any of the above-described large language model post-training methods.
[0023] Fourthly, embodiments of the present invention also provide an electronic device, comprising: At least one processor; and, A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform any of the above-described large language model post-training methods.
[0024] In the solution provided by the first aspect of the present invention, the relationship between task result rewards and chain-thinking rewards can be analyzed during the training process, and the reward conflict degree and chain-thinking monitorability index can be calculated, thereby achieving proactive perception and judgment of the risk of chain-thinking distortion. When reward conflict is detected to cause chain-thinking monitorability degradation, the post-training reward structure is adaptively adjusted. By reducing the weight of chain-thinking reward items, maintaining or increasing the weight of result reward items, and further increasing or increasing the weight of safety constraint items, the post-training reward weight and safety constraints are adaptively controlled to reduce the risk of chain-thinking distortion. This allows the model to simultaneously possess high task completion ability and good chain-thinking monitorability in complex task post-training scenarios, thereby improving the safety, reliability, and robustness of the model training process.
[0025] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description
[0026] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0027] Figure 1 The flowchart of a large language model post-training method provided by an embodiment of the present invention is shown; Figure 2 This diagram illustrates the structure of a large language model post-training device provided in an embodiment of the present invention. Figure 3 A schematic diagram of the structure of an electronic device for performing a large language model post-training method, provided in an embodiment of the present invention, is shown. Detailed Implementation
[0028] In the description of this invention, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.
[0029] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," and "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0030] Before providing a detailed description of the embodiments of this application, some of the nouns and terms involved in the embodiments of this application will be explained.
[0031] (1) Chain-of-Thought (CoT): refers to the intermediate reasoning text output by a large language model before generating the final result. It is used to represent the model's solution process, steps, or thought process for the current task. Chain-of-Thought can be a step-by-step explanation in natural language or structured intermediate reasoning content. Chain-of-Thought can serve as an intermediate representation to improve the ability to solve complex tasks, and it can also serve as an important carrier for observing model behavior and monitoring the implementation process.
[0032] (2) Result reward item: refers to the reward item used to measure the final result of the text task completion in the post-training process of the large language model. This reward item usually reflects the correctness, effectiveness, pass rate, task completion degree or safety of the final output of the model, and is used to drive the model to optimize the final result performance. It is the result-oriented signal in the post-training objective.
[0033] (3) Chain Thinking Reward Item: This refers to the reward item used to measure the quality of chain thinking text generated by the model during the post-training process of the large language model. This reward item is usually used to constrain the length, style, readability, clarity, preference, standardization, or process quality of chain thinking, and is an intermediate reasoning guidance signal in the post-training objective. The chain thinking reward item may promote clearer and more standardized chain thinking, but may also inhibit the explicit expression of key reasoning steps in some cases.
[0034] (4) Reward conflict: This refers to the inconsistency, mutual constraint, or mutual interference between the outcome reward and the chain thinking reward during post-training. Specifically, in order to obtain higher chain thinking rewards, the model may reduce, weaken, or avoid explicit expression of the real key reasoning steps, while in order to obtain higher outcome rewards, it must rely on these weakened or hidden solution processes. Reward conflict is one of the important reasons for the deviation between the chain thinking text and the actual solution process.
[0035] In the post-training phase of a large model, the training objective typically consists of both task outcome rewards and chain-thinking rewards. Task outcome rewards are used to improve the correctness, effectiveness, and task completion rate of the model's final output, while chain-thinking rewards are used to constrain the form, length, readability, or preference attributes of the intermediate inference text generated by the model. However, in actual training, chain-thinking rewards and task outcome rewards are not always consistent, and potential conflicts may exist between them.
[0036] For example, when chain-thinking rewards overemphasize conciseness, avoid specific expressions, restrict the explicit appearance of key semantics, or favor superficial fluency, the model may weaken, reduce, or even avoid explicit expressions of the real key reasoning steps in order to obtain higher chain-thinking rewards. In this case, although the chain-thinking text may better meet the reward requirements in form, its content may no longer fully reflect the real solution process supporting the final output, thus causing a deviation between chain-thinking and the actual reasoning process.
[0037] In related technologies, optimization primarily focuses on the correctness, pass rate, task completion rate, or user satisfaction of the model's final output. This is achieved through reinforcement learning, preference optimization, or other post-training techniques to improve the model's final performance on the target task. In these methods, training typically emphasizes the output itself, while lacking separate modeling and constraints on the quality, authenticity, and monitorability of the chain-thinking text.
[0038] While such methods can improve the model's task performance to some extent, the lack of specific control over the chain-thinking process may lead the model to gradually develop a "result-correctness-first" tendency during training. This means that as long as the final output receives a high reward, whether the intermediate chain-thinking process truly represents the key solution steps is unimportant. Although this avoids a direct conflict between chain-thinking rewards and result rewards, it also makes it difficult to guarantee that the chain-thinking process can always be used as a reliable monitoring tool.
[0039] Some approaches attempt to impose additional constraints on the chain-thinking text itself, such as requiring it to be more concise, more in line with human preferences, more readable, more stylistically consistent, or avoiding certain expressions considered poor. These methods explicitly introduce chain-thinking rewards during training to improve the surface quality of the chain-thinking text.
[0040] However, the optimization goal of this scheme does not focus on whether the chain reasoning text truly reflects the model's solution process. When the chain thinking reward overemphasizes conciseness, fluency, avoidance of specific expressions, or catering to preferences, it may conflict with the key reasoning steps that the model needs to explicitly express to complete the task, thereby inducing the model to output a chain thinking text that is better in form but incomplete or even distorted in content.
[0041] In this scenario, when chain-thinking rewards begin to suppress the explicit expression of key reasoning content or interfere with the model's true reasoning path, the training system typically continues to optimize using the established reward structure, failing to promptly identify whether chain-thinking rewards have induced the model to exhibit the risk of "superficial explanations being normal but true reasoning being hidden." Consequently, while the post-training process may continuously improve the surface score, the authenticity, completeness, and monitorability of chain-thinking may decline simultaneously.
[0042] To address at least some of the aforementioned problems, embodiments of the present invention provide a post-training method for large language models. This method introduces a reward conflict perception mechanism during the post-training stage of the large model, analyzes and determines the relationship between task result rewards and chain-thinking rewards, and adaptively adjusts the post-training reward structure when high-risk reward conflicts are detected. This suppresses chain-thinking distortion and maintains the identifiability, consistency, and traceability of chain-thinking for the monitor.
[0043] This invention provides a method for post-training a large language model, see [link to relevant documentation]. Figure 1 As shown, it includes: Step 101: Obtain the training task set for post-training the large language model; the training task set includes training samples from multiple training batches.
[0044] In this embodiment of the invention, for a large language model that has completed pre-training, it can be post-trained by means of fine-tuning, etc. For a large language model to be post-trained, the model parameters after pre-training can be obtained, and the large language model can be deployed to the post-training environment.
[0045] Furthermore, a training task set for post-training is obtained, which includes multiple training samples. These training samples can be divided into multiple training batches, each batch corresponding to multiple training samples. Each training sample in the training task set includes at least a task input and a target output.
[0046] Specifically, this large language model can be a large model that performs question-answering tasks. Users input prompts, and after inference processing, the large language model can output corresponding responses. These responses can be in text format, or in the form of images or videos. Correspondingly, the training sample task input can be user input (e.g., question text), and the target output is the output text, image, or video content.
[0047] Step 102: For the target training sample corresponding to the current training batch, determine the chain thinking text and final result text corresponding to the target training sample according to the large language model, so as to train the large language model according to the total reward function of the current training batch and update the model parameters of the large language model; the total reward function includes the result reward item and the chain thinking reward item.
[0048] In this embodiment of the invention, a large language model is deployed to a post-training environment, and a unified generation mechanism is set to post-train the large language model. Specifically, in the current training batch, training is performed based on the training samples corresponding to the current training batch; for ease of description, the training samples corresponding to the current training batch are referred to as target training samples.
[0049] Specifically, the large language model can output chain-like thought text and final result text for each input target training sample task.
[0050] For example, suppose the training sample set is:
[0051] in, Represents the training sample set, Indicates the first The task input for each training sample Indicates the first The target output corresponding to each training sample. This represents the total number of training samples. For any input... The output of a large language model can be represented as:
[0052] in, This represents the output function of a large language model. Indicates the first The chain-like thought text corresponding to each training sample Indicates the first The final result text corresponding to each sample. It can be understood that, for the target training samples, the corresponding chain-like thought text and final result text can also be represented based on the above formula.
[0053] It is understandable that the final result text For the model to target the input The generated predicted output should be close to the corresponding target output in terms of the training objective. This step exposes both the intermediate inference process and the final task result of the model during the post-training phase, providing a foundation for subsequent reward conflict analysis and monitorability maintenance.
[0054] By decomposing the total reward during the post-training process of the large language model, we obtain the result reward item and the chain-thinking reward item, and can construct the following total reward function:
[0055] For the current training batch, Indicates the first in the current training batch The total reward function for each target training sample. Indicates the first The reward for each target training sample is used to measure the final result text. The result of task completion. Indicates the first Chain-thinking reward items for each target training sample, used to measure chain-thinking text. The quality of expression.
[0056] Results Rewards Based on the final text of the model With target output The degree of matching, correctness, task pass rate, or human / model scoring can be used to determine the score. For example, in question-answering or math tasks, it can be calculated based on whether the answer is correct, semantic similarity, or rule validation results; in code generation tasks, it can be calculated based on test case pass rate, compilation results, or functional correctness; and in security tasks, it can be calculated based on whether the output meets security specifications.
[0057] Chain-thinking reward items Based on chain thinking text The quality of expression is determined by factors such as clarity, format conformity, length constraints, readability, step coherence, or preference model scoring. In practice, this can be achieved through rule-based scoring, manual annotation scoring, reward model scoring, or process supervision model scoring.
[0058] In this embodiment, the result reward item and the chain-thinking reward item are also assigned corresponding weights, and these weights can be dynamically adjusted. Furthermore, in the current training batch, the total reward needs to be calculated based on each target training sample; therefore, the total reward function can be specifically expressed as: .
[0059] in, and The initial weights of each reward before adjustment. This represents the reward item corresponding to the results of all target training samples. This represents the chain-like reward items corresponding to all target training samples. It can be understood that, for the current training batch, based on... Post-training the large language model updates its parameters, thereby improving its output performance.
[0060] Step 103: Based on the chain-thinking text and final result text corresponding to each target training sample, determine the reward conflict degree and chain-thinking monitorability index for each target training sample.
[0061] Training objectives sometimes reward a situation where the final answer is correct, but the crucial intermediate reasoning process isn't revealed. As a result, the model gradually learns a strategy: retaining the internal computations necessary to solve the problem correctly, but rewriting the thought process to be safer, shorter, more aesthetically pleasing, or less likely to trigger penalties. This leads to a situation where, while maintaining task performance, the model risks hiding the true reasoning process, resulting in reward conflicts.
[0062] In this embodiment, a reward conflict degree is constructed to quantify reward conflict. The reward conflict degree is a quantitative description of the degree of conflict between the outcome reward and the chain-thinking reward, used to reflect the degree of interference of chain-thinking rewards on the actual solution process under the current training conditions. A higher reward conflict degree indicates that chain-thinking rewards are more likely to induce the model to reduce the explicit expression of key reasoning content, thereby increasing the risk of deviation between the chain-thinking text and the actual solution process.
[0063] Monitorability refers to a mechanism for comprehensively evaluating chained thought text during post-training. It analyzes whether chained thought can still be effectively identified and understood by monitors, evaluation modules, or human reviewers. This mechanism typically scores chained thought based on aspects such as monitoring and recognition capabilities, the completeness of reasoning expression, and the consistency between chained thought and the final result, yielding monitorability indicators to determine whether the current chained thought maintains good monitorability.
[0064] In some alternative implementations, the process of determining the reward conflict degree of the target training sample in step 103 above may include steps a1 to a2.
[0065] Step a1: Based on the chain-like thinking text of the target training sample, determine the degree of compression of the chain-like thinking length, the degree of restriction of explicit expression of key task semantics, and the degree of lack of completeness of reasoning steps in the target training sample.
[0066] Step a2: Weight the degree of chain thinking length compression, the degree of restriction of explicit expression of key task semantics, and the degree of lack of completeness of reasoning steps of the target training sample to determine the reward conflict degree of the target training sample.
[0067] In this embodiment, the chain thinking length refers to the length of the chain thinking text. The higher the degree of compression, the higher the risk of the chain thinking text being compressed.
[0068] Key task semantics refers to the core concepts, operational logic, solution points, or risk-related semantic content necessary to complete the current task. For example, in code generation tasks, key task semantics may include boundary handling, exception conditions, and core operational logic; in complex reasoning tasks, key task semantics may include intermediate state updates, condition judgments, and constraint relationships. In this embodiment, whether key task semantics is explicitly expressed in chain-like thinking is used as one of the important criteria for judging the authenticity and monitorability of chain-like thinking.
[0069] The completeness of reasoning steps refers to the extent to which the text of chain-thinking covers the key steps in the task-solving process. If chain-thinking can present the intermediate steps required to complete the task relatively completely, the completeness of reasoning steps is high; if chain-thinking omits key steps, excessively compresses the process, or only provides a conclusive description, the completeness of reasoning steps is low.
[0070] By weighting the three metrics, the reward conflict of the target training sample can be determined. Specifically, the reward conflict degree can be: .
[0071] in, Indicates the first The reward conflict degree corresponding to each target training sample Indicates the first The degree of compression of the chain-like thinking length of each target training sample. Indicates the first The degree to which the explicit semantic expression of the key task of each target training sample is restricted. Indicates the first The degree of incompleteness of the inference steps for each target training sample; This represents the weighting coefficient of the degree of compression of chain thinking length in the reward conflict level. This represents the weighting coefficient of the degree to which the explicit expression of key task semantics is restricted in the reward conflict level. This represents the weighting coefficient of the degree of incompleteness in reasoning steps in the reward conflict degree. The weighting coefficient can satisfy: .
[0072] This step transforms the reward conflict implicit in the post-training process into a calculable reward conflict degree. A higher reward conflict degree indicates a stronger interference between chained thinking rewards and the actual solution process, making the model more prone to discrepancies between the chained thinking text and the actual solution process.
[0073] Optionally, to quantify chain-thinking text, the chain-thinking length is defined as:
[0074] in, Indicates the first The length of the chain-like thinking text corresponding to each training sample; This represents a function for calculating text length.
[0075] As shown above, the degree of compression of the chain-like thinking text can be determined based on the target training sample. Specifically, the degree of compression of the chain-like thinking length... Defined as:
[0076] in, Indicates the first The degree of compression of the chain-like thinking length of each target training sample. This represents the preset minimum expression length threshold. Indicates the first The length of the chained thought text for each target training sample. This formula represents the length of the chained thought text when... Below the minimum expression length threshold At that time, it was believed that chain thinking carried the risk of compression, and the higher the degree of compression, the more likely it was to lead to problems. The larger.
[0077] The degree of restriction on explicit semantic representation of key tasks in target training samples Defined as:
[0078] in, Indicates the first The degree to which the explicit semantic expression of the key task of each target training sample is restricted. Indicates completion of the first Each target training sample corresponds to the set of key semantics required for the task. Indicates the first The set of key semantics actually explicitly expressed in the chain-like thinking text of each target training sample. Indicates completion of the first Each target training sample corresponds to the key semantic set required for the task. The number of semantic elements in the middle. This represents the number of semantic elements in the intersection of two key semantic sets, that is, the number of elements explicitly covered by chain thinking.
[0079] This formula indicates that the fewer key semantics explicitly covered by chain-like thinking, the better. The larger the value, the more limited the chain-like thinking is in expressing the key semantics of the task. These key semantics can specifically be keywords, or they can include operational intentions, conditional judgments, logical relationships, algorithmic steps, or risk semantic units, etc.
[0080] Specifically, key semantic set The key semantic set can be determined based on the task type through rule extraction, reference answer parsing, manual annotation, task evaluators, or semantic extraction models. For example, in mathematical reasoning tasks, the key semantic set may include problem conditions, constraints, intermediate variables, key formulas, derivation steps, and the final solution objective; in code generation tasks, it may include input / output formats, boundary conditions, exception handling, core algorithm logic, loops or recursion, complexity constraints, etc.; in security alignment tasks, it may include risk types, prohibited behaviors, security constraints, reasons for rejection, or compliant alternatives, etc.
[0081] For task input Target output And a set of key semantics that are pre-determined or automatically extracted for the task type, used to characterize the core semantic content that must be covered to complete the task. Through chain thinking text It is obtained by processing keyword matching, semantic role recognition, rule parsing or semantic extraction models.
[0082] Understandable. This refers to the set of key semantics that should theoretically be covered to complete a task, belonging to the task requirements side or the reference standard side. This represents the chain-like thinking text generated by the model. The set of key semantics explicitly expressed in the model actually belongs to the model output side. That is, Indicates the key semantics that should appear. It is the key semantic meaning that actually appears, and the intersection of the two is calculated. The ratio of these two metrics can measure whether the chain-like thinking generated by the model adequately covers the key semantics required to complete the task. The degree to which the explicit expression of key task semantics calculated by both metrics is limited. This can indicate whether the key reasoning content is stated in the chain-thinking text.
[0083] Degree of incompleteness of inference steps in target training samples Defined as:
[0084] in, Indicates the first The degree of incompleteness of the inference steps for each target training sample. Indicates the first The number of valid reasoning steps identified in the chain-thinking text of each target sample. Indicates completion of the first The number of standard inference steps required for a target sample task.
[0085] Number of valid reasoning steps It can be determined through rule parsing, step classification models, process supervision models, or matching with standard reasoning step templates, and represents the number of explicit reasoning steps in the chain-thinking text that can make a substantial contribution to the task solution.
[0086] Specifically, identification can be achieved through the following methods: First, based on rules, identify step markers, causal connectors, conditional judgments, formula derivations, code logic blocks, etc. in chain thinking; second, based on process supervision models or semantic classification models, determine whether a sentence or a paragraph constitutes a valid reasoning step; and third, match it with reference reasoning paths or standard step templates to determine whether it covers the necessary intermediate reasoning links.
[0087] For example, in math problems, “setting up the equation,” “substituting the conditions,” “simplifying and solving,” and “verifying the answer” can be considered as valid reasoning steps; in code generation tasks, “identifying the input scale,” “selecting the data structure,” “handling boundary cases,” “implementing the core loop,” and “returning the result” can be considered as valid reasoning steps.
[0088] Number of standard reasoning steps It can be determined based on task type, reference answer, manually annotated standard solution process, or task evaluation template.
[0089] For example, for mathematical reasoning tasks, This can be determined by the necessary derivation steps in the standard answer; for code generation tasks, This can be determined by the key algorithmic steps required to complete the function; for multi-step decision-making tasks, It can be determined by the necessary state judgment, action selection and result verification steps in task planning.
[0090] In this embodiment, identifying conflicting features such as chain-like thinking compression, limited expression of key semantics, and missing reasoning steps during the post-training process enables proactive perception and judgment of the risk of chain-like thinking distortion.
[0091] In one implementation, It can be determined by a manually annotated reference inference chain; in another implementation, the target output can also be determined by a task template, rule engine, or teacher model. The solution process is obtained after analysis.
[0092] This formula indicates that the fewer the explicit steps in chain thinking, the better. The larger the value, the higher the degree of incompleteness of the steps.
[0093] In some alternative implementations, the process of determining the chain-thinking monitorability index of the target training sample in step 103 above may include steps b1 to b3.
[0094] Step b1: Determine the reasoning expression completeness score of the target training sample based on the chain-like thinking text of the target training sample.
[0095] Step b2: Determine the monitoring and recognition ability score and consistency score of the target training sample based on the chain-thinking text and the final result text corresponding to the target training sample; the consistency score is used to represent the degree of consistency between the chain-thinking text and the final result text.
[0096] Step b3: Weight the reasoning expression integrity score, monitoring and recognition ability score, and consistency score of the target training sample to determine the chain thinking monitorability index of the target training sample.
[0097] In this embodiment, the chain-like thought text generated by the model is... Construct a monitorable evaluation mechanism to assess monitoring and identification capabilities, the completeness of reasoning and expression, and chain-thinking text. With the final result text The chain thinking is evaluated using three dimensions of consistency to obtain the chain thinking monitorability index.
[0098] Specifically, the monitorability indicators for chain thinking can be: .
[0099] in, Indicates the first The chain-thinking monitorability index corresponding to each target training sample Indicates the first The monitoring and recognition capability score of each target training sample. Indicates the first The reasoning expression completeness score of each target training sample. Indicates the first The consistency score between the chain-thinking text of each target training sample and the final result text. This indicates the weight of the monitoring and identification capability score in the chain-thinking monitorability index. This indicates the weight of the reasoning expression completeness score in the chain thinking monitorability index; This represents the weight of the consistency score in the chain-thinking monitorability index, and the weight coefficient satisfies... .
[0100] In some optional implementations, the monitoring and recognition capability score of the target training samples. It can be:
[0101] in, Indicates the first The monitoring and recognition capability score of each target training sample. Indicates the monitor's position on the first The number of times each target training sample is correctly identified; Indicates the monitor's position on the first The formula represents the total number of times the target training samples are judged. This means that the more accurately the monitor judges chain-like thinking, the better. The higher.
[0102] If multiple identifications are based on a monitor, it does not mean that the same fixed sample is repeatedly judged by the same monitor. Rather, it means that the accuracy of the monitor's identification is statistically analyzed by using multiple monitoring dimensions, multiple monitors, multiple discrimination tasks, or multiple sampling evaluations.
[0103] Specifically, the monitor can identify objects such as: whether the chain-thinking text contains key reasoning steps, whether there are missing steps, whether it is consistent with the final result, and whether there is abnormal reasoning or evasive expression. If a single deterministic monitor is used, it can also perform only one identification. If the identification is correct, then Otherwise, 0.
[0104] If multiple monitors or multiple monitoring dimensions are used, such as separately judging "whether key semantics are covered", "whether the reasoning steps are complete", and "whether the chain thinking is consistent with the final result", then It can represent the first The total number of monitoring and discrimination items performed on each target training sample. This indicates the number of items whose judgment result is correct.
[0105] If random sampling evaluation, multiple sampling by model evaluator, or manual review mechanism is adopted, the same sample may be judged differently in different identifications due to different sampling, evaluator output, or review subjects. Therefore, the correct proportion of multiple judgments can be used to measure the stability of monitoring and identification.
[0106] Inference expression completeness score of target training samples Defined as:
[0107] in, Indicates the first The reasoning expression completeness score of each target training sample. Indicates the first The number of valid reasoning steps identified in the chain-thinking text of the target training sample; This represents the number of standard reasoning steps required to complete the task. The formula indicates that the more complete the explicitly expressed reasoning steps in chain thinking, the better. The higher.
[0108] The consistency score between the chain-thinking text and the final result text indicates the degree of consistency between the reasoning semantics expressed in the chain-thinking text and the solution logic reflected in the final result text. If the key steps, constraints, and operational logic described in the chain-thinking can be correspondingly reflected in the final result, it indicates a high degree of consistency; if the chain-thinking description is significantly inconsistent with the actual logic on which the final result depends, it indicates that the chain-thinking may have distortion or pseudo-transparency issues.
[0109] Consistency score of target training samples It can be defined as:
[0110] in, Indicates the first The consistency score between the chain-thinking text of each target training sample and the final result text. This represents the set of semantic results extracted from the final result text. This represents the set of inferential semantics extracted from chain-thinking text. This indicates the number of semantic elements in the resulting semantic set. This represents the number of semantic elements in the consistent portion between the reasoning semantic set and the result semantic set. The formula indicates that the higher the consistency between the chain of reasoning and the final result, the better. The higher.
[0111] Specifically, From the final result text of the model The semantic set of results extracted from the final answer represents the conclusion, operation, judgment, or output logic actually reflected in the final answer. From chain thinking text The extracted set of reasoning semantics represents the reasoning basis, steps, conditions, or intermediate conclusions actually described in chain-like thinking. The consistency score between the two is... The focus is on whether the reasoning logic described in chain thinking can support the final result.
[0112] Step 104: Determine the risk indicators for the current training batch based on the reward conflict degree and chain thinking monitorability indicators of each target training sample.
[0113] The degree of reward conflict when obtaining the target training samples and chain thinking monitorability indicators Then, a risk scoring model is further constructed to determine the risk indicators of the current training batch.
[0114] This step integrates reward conflict and the difficulty in monitoring chain thinking into a single comprehensive risk indicator, providing a basis for subsequent training adjustments. A higher comprehensive risk score indicates a greater likelihood that chain thinking will become distorted and lose its monitorability under current training conditions.
[0115] Optionally, step 104, "determine the risk indicators of the current training batch based on the reward conflict degree and chain thinking monitorability indicators of each target training sample," may specifically include the following steps c1 to c2.
[0116] Step c1: Determine the risk indicators of the target training samples based on the reward conflict degree and chain thinking monitorability indicators of the target training samples.
[0117] Step c2: Determine the risk index of the current training batch based on the risk index of each target training sample in the current training batch.
[0118] In this embodiment, a risk index for each target training sample can be generated: .
[0119] in, Indicates the first Risk indicators for each target training sample; Indicates the first The degree of reward conflict for each target training sample; Indicates the first The chain-thinking monitorability index of a target training sample; and These are preset weighting coefficients used to balance the impact of reward conflict and monitorability indicators on overall risk.
[0120] If the training sample set has a total of If there are multiple samples, then a sample-level risk indicator can be calculated for each sample first. Therefore, theoretically, one would obtain There are several risk indicators. However, subsequent training adjustments typically do not directly adjust the global reward structure for each sample. Instead, these sample-level risk scores are aggregated to form batch-level, window-level, or round-level comprehensive risk scores, which are then used to determine whether reward structure adjustments are triggered.
[0121] For example, for the current training batch Batch-level risk indicators can be defined as follows: .
[0122] in, This indicates the risk indicator for the current training batch. This represents the number of samples in the current training batch, i.e., the number of target training samples. In other optional implementations, batch-level risk indicators can also be calculated using methods such as maximum value, weighted average, or high quantile. Subsequent training adjustments can be performed based on batch-level risk indicators. Specifically, the maximum value is suitable for scenarios sensitive to individual high-risk samples; the average value is suitable for measuring the overall training status; and the high quantile is suitable for balancing overall stability and the impact of high-risk samples.
[0123] Step 105: If the risk indicator indicates the risk of chain-like thinking distortion, reduce the weight of the chain-like thinking reward item in the total reward function, and maintain or increase the weight of the result reward item to update the total reward function; the updated total reward function is used to train the large language model in the next training batch to continue updating the model parameters of the large language model until the large language model is completed and training is finished.
[0124] In this embodiment, the risk indicators of the current training batch can be... With preset risk threshold Comparison is performed; when the risk score is below the preset risk threshold, it indicates that there is no immediate risk of chain-like thinking distortion, and the current reward structure can be maintained for continued training; when the risk indicator... When the risk level exceeds the preset risk threshold, it indicates a risk of chain-like thinking distortion. In this case, it is necessary to regulate the current and subsequent training processes to suppress the decline in the monitorability of chain-like thinking.
[0125] Specifically, when the following conditions are met When the current reward structure is maintained, continue training; when the condition is met... At that time, the training control mechanism is activated.
[0126] In this embodiment, when there is no obvious conflict between the task result reward and the chain thinking reward, and the monitorability of the chain thinking remains stable, the original post-training mechanism of the model remains unchanged to avoid unnecessary training perturbations; when an obvious conflict is detected between the task result reward and the chain thinking reward, and the monitorability of the chain thinking has shown a downward trend, the reward structure adaptive adjustment mechanism is activated to adjust the chain thinking reward item, the result reward item, and the safety constraint item, guiding the model to explicitly express key reasoning steps while maintaining task performance.
[0127] The adaptive adjustment process of the reward structure specifically includes: reducing the weight of chain-thinking reward items in the total reward function, and maintaining or increasing the weight of outcome reward items.
[0128] If the total reward function for the current training batch is: After detecting high risk, the weights are adjusted. , After adjustments, the updated total reward function can be expressed as: .in, (That is, reducing the weight of the chain-thinking reward item in the total reward function), and (That is, to maintain or increase the weight of the outcome reward items).
[0129] It should be noted that "maintaining" the weight of the outcome reward item can specifically mean keeping the weight of the outcome reward item unchanged, i.e., only reducing the weight of the chain thinking reward item; or, the total reward function can also include other reward items, and by increasing the weight of other reward items, the proportion of the outcome reward item in the total reward function can be kept unchanged.
[0130] This step ensures that the training process no longer solely pursues high scores in the form of chain thinking, but also takes into account the chain thinking's ability to be identified by the monitor and its ability to truly express the task-solving process.
[0131] In some alternative implementations, if the risk indicator indicates the presence of chain-like thinking distortion risk, a safety constraint term can be added to the total reward function, or, if the total reward function also includes a safety constraint term, the weight of the safety constraint term can be increased.
[0132] In this embodiment, if Furthermore, safety constraints can be introduced, or the weight of safety constraints can be increased, to add a minimum transparency requirement to the chain-like thinking text, ensuring that the chain-like thinking generated by the model at least covers the key reasoning steps required to complete the task. The adjusted total reward function can be expressed as: .
[0133] in, This represents the total reward after adjustment. This indicates the reward items for the results. This represents the reward items for chain thinking. This represents a safety constraint used to maintain the monitorability of chained thinking. This indicates the weight of the result reward item. This indicates the weight of the reward items in the chain-thinking approach. This represents the weight of the security constraint. Generally, the reward weights satisfy: .
[0134] It can be determined based on risk scores, preset adjustment steps, and upper and lower weight limits.
[0135] In the initial reward function, the safety constraint term may be absent or may exist but have a weight of 0. When a high risk is detected, its weight is adjusted from 0 to a positive value or from a small value to an increased value.
[0136] In other words, the following can be set before regulation:
[0137] In one implementation, This indicates that safety constraints were not enabled before the adjustment; once the risk score exceeds the threshold, they are set. This means that a safety constraint is introduced. In another implementation, This indicates that there were already safety constraints before the adjustment, but their weight was further increased after the risk increased. .
[0138] In some alternative implementations, the security constraint can be expressed as:
[0139] in, Indicates the first The security constraints corresponding to each target training sample Indicates the first The reasoning expression completeness score of each target training sample. Indicates the first The consistency score between the chain-thinking text of each target training sample and the final result text. Indicates the first The reward conflict degree of each target training sample. , , This represents the weight coefficient of each part within the safety constraint term. The formula indicates that the more complete the chain of thought and the more consistent it is with the final result, the higher the safety constraint term; conversely, the higher the reward conflict, the lower the safety constraint term.
[0140] By summing or averaging the security constraints of each target training sample, the security constraints corresponding to the current training batch can be determined. .
[0141] In this embodiment, when reward conflict is detected, which leads to a degradation in the monitorability of chain thinking, the post-training reward structure is adaptively adjusted. By reducing the proportion of high-conflict chain thinking reward items, increasing the weight of result reward items and safety constraint items, and strengthening the explicit expression requirements of key reasoning steps, the training direction is specifically modified. Without changing the overall structure of the model, the distortion of chain thinking is effectively suppressed.
[0142] After the reward adjustment is completed, the model is trained again using the adjusted reward function. After each round of training (each round can correspond to a training batch), the chain thinking text and the final result text output by the model are re-acquired, and the reward conflict degree, chain thinking monitorability index and risk index are recalculated until the training is completed.
[0143] Specifically, let the first The model parameters after training rounds are Based on the adjusted total reward function, the model parameter update can be expressed as:
[0144] in, Indicates the first Model parameters after one round of training; Indicates the first Model parameters after one round of training; Indicates the learning rate; This represents the gradient operation applied to the model parameters; This represents the expected value of the total reward after regulation.
[0145] This can be achieved by updating model parameters based on the total reward function, using post-training or reinforcement learning parameter update methods. For example, it can be based on policy gradients, reward-weighted supervised fine-tuning, etc., according to the adjusted reward function. Continue training the model.
[0146] In the After each round of training, the risk indicators are recalculated:
[0147] in, Indicates the first Risk indicators after training round Indicates the first Reward conflict after each training session Indicates the first Monitorability metrics after each round of training and The weighting coefficients are the same as those in the aforementioned risk indicator model.
[0148] If the recalculated risk index is still higher than the preset risk threshold, the reward weight adjustment and monitorability constraint enhancement will continue. If the risk index drops below the preset risk threshold, the current reward configuration will be maintained for the next round of training. This forms the following closed-loop iterative process: model generation → reward conflict calculation → monitorability assessment → risk index generation → reward adjustment → model retraining.
[0149] This step enables the post-training process to have dynamic feedback capabilities, thereby allowing for continuous correction of information constraints that might induce chain thinking and obscure true reasoning during training.
[0150] When the model simultaneously meets both the task performance requirements and the chain-thinking monitorability requirements over multiple consecutive training cycles, training stops, and the target model is output. Stopping conditions may include: the final result reward conflict degree reaching a preset task threshold, the chain-thinking monitorability index consistently exceeding a preset monitoring threshold, and the risk index consistently falling below a preset risk threshold.
[0151] Specifically, the stopping condition can be expressed as: .
[0152] in, Indicates the first The reward for the results after the current training batch. Indicates the first Monitorability indicators of chain thinking after rounds of training Indicates the first Comprehensive risk indicators after one round of training Indicates the preset task threshold. This indicates the preset monitoring threshold. This indicates a preset risk threshold.
[0153] The final output includes: the target model trained after completion, the corresponding reward configuration parameters, and the evaluation results of chain thinking monitorability. This yields a target model that can automatically detect reward conflicts, suppress chain thinking monitoring risks, and maintain chain thinking monitorability during post-training.
[0154] The large language model post-training method provided in this invention can analyze the relationship between task result rewards and chain thinking rewards during the training process, calculate reward conflict degree and chain thinking monitorability index, thereby achieving proactive perception and judgment of chain thinking distortion risk. When reward conflict is detected leading to chain thinking monitorability degradation, the post-training reward structure is adaptively adjusted. This is achieved by reducing the weight of chain thinking reward items, maintaining or increasing the weight of result reward items, and further increasing or increasing the weight of safety constraint items. This adaptive regulation of post-training reward weights and safety constraints reduces the risk of chain thinking distortion, enabling the model to simultaneously possess high task completion ability and good chain thinking monitorability in complex task post-training scenarios, thereby improving the safety, reliability, and robustness of the model training process.
[0155] Compared to related solutions that only focus on the final score or the superficial quality of chain-thinking, this embodiment analyzes the conflict state between the result reward items and the chain-thinking reward items during post-training. It proactively triggers training regulation when high-risk conflicts are detected, suppressing the deviation between the chain-thinking text and the actual solution process at the mechanism level. This allows for earlier identification and more effective suppression of visible but unreliable chain-thinking issues, fundamentally improving the authenticity and monitorability of chain-thinking. Furthermore, it only requires adjusting the reward structure set, without changing the model's structure itself or adding additional inference chains or complex decoding processes, resulting in low cost and suitability for practical post-training system deployment.
[0156] By employing a collaborative mechanism of reward conflict perception, monitorability assessment, and training strategy regulation, continuous intervention and closed-loop correction in the post-training process can reduce the risks of key semantic loss, excessive step compression, and distortion of representation in chain-like thinking. This improves the identifiability of chain-like thinking to the monitor, its consistency with the final result, and its stability across multiple training rounds. Consequently, the model maintains more stable, controllable, and reliable chain-like thinking performance across different task scenarios and training stages, significantly enhancing the safety, reliability, and robustness of the post-training process.
[0157] The large language model post-training method provided in this embodiment can be widely applied to large language model post-training scenarios with high requirements for the realism of chained thinking, training security, and process monitorability. These include, but are not limited to, post-training of complex inference models, code generation models, mathematical solving models, agent task planning models, secure alignment training, and high-risk behavior monitoring training. Alternatively, it can be integrated as a general post-training control module into large model training platforms, alignment training systems, inference security assessment systems, and agent training frameworks, demonstrating significant engineering practical value and broad industrial application prospects.
[0158] The above details the training process of a large language model. This method can also be implemented using a corresponding device, the structure and function of which will be described in detail below.
[0159] Based on the same inventive concept, embodiments of the present invention also provide a large language model post-training device, see [link to relevant documentation]. Figure 2 As shown, the device includes: The post-training preparation module 201 is used to obtain a training task set for post-training the large language model; the training task set includes training samples from multiple training batches. Processing module 202 is used to determine the chain-like thinking text and the final result text corresponding to the target training sample corresponding to the current training batch based on the large language model, so as to train the large language model according to the total reward function of the current training batch and update the model parameters of the large language model; the total reward function includes a result reward item and a chain-like thinking reward item; Evaluation module 203 is used to determine the reward conflict degree and chain thinking monitorability index of each target training sample based on the chain thinking text and final result text corresponding to each target training sample. The risk assessment module 204 is used to determine the risk indicators of the current training batch based on the reward conflict degree and chain thinking monitorability indicators of each target training sample. The reward structure adaptive adjustment module 205 is used to reduce the weight of the chain thinking reward item in the total reward function and maintain or increase the weight of the result reward item when the risk indicator indicates the existence of chain thinking distortion risk, so as to update the total reward function; the updated total reward function is used to train the large language model in the next training batch to continue to update the model parameters of the large language model until the large language model is completed and training is completed.
[0160] In some alternative implementations, the process by which processing module 202 determines the reward conflict degree of the target training sample includes: Based on the chain-like thinking text of the target training sample, determine the degree of compression of the chain-like thinking length, the degree of restriction of explicit expression of key task semantics, and the degree of lack of completeness of reasoning steps of the target training sample. The reward conflict degree of the target training sample is determined by weighting the degree of compression of chain thinking length, the degree of restriction of explicit expression of key task semantics, and the degree of lack of completeness of reasoning steps.
[0161] In some optional implementations, the degree of compression of the chain-like thought length of the target training samples is as follows:
[0162] in, Indicates the first The degree of compression of the chain-like thinking length of each target training sample. This represents the preset minimum expression length threshold. Indicates the first The length of the chain-like thought text of each target training sample; The degree to which the explicit semantic expression of the key task in the target training samples is restricted is as follows:
[0163] in, Indicates the first The degree to which the explicit semantic expression of the key task of each target training sample is restricted. Indicates completion of the first Each target training sample corresponds to the set of key semantics required for the task. Indicates the first The set of key semantics actually explicitly expressed in the chain-like thinking text of each target training sample. Indicates completion of the first Each target training sample corresponds to the key semantic set required for the task. The number of semantic elements in the text; This represents the number of semantic elements in the intersection of two key semantic sets; The degree of incompleteness of the reasoning steps in the target training sample is as follows:
[0164] in, Indicates the first The degree of incompleteness of the inference steps for each target training sample. Indicates the first The number of valid reasoning steps identified in the chain-thinking text of each target training sample. Indicates completion of the first The number of standard inference steps required for a task corresponding to a target training sample.
[0165] In some alternative implementations, the process by which processing module 202 determines the chain-thinking monitorability index of the target training samples includes: Based on the chain-like thought text of the target training sample, determine the reasoning expression completeness score of the target training sample; Based on the chain-like thinking text and the final result text corresponding to the target training sample, the monitoring and recognition capability score and consistency score of the target training sample are determined; the consistency score is used to represent the degree of consistency between the chain-like thinking text and the final result text. The reasoning expression integrity score, monitoring and recognition ability score, and consistency score of the target training sample are weighted to determine the chain thinking monitorability index of the target training sample.
[0166] In some optional implementations, the monitoring and recognition capability score of the target training samples is:
[0167] in, Indicates the first The monitoring and recognition capability score of each target training sample. Indicates the monitor's position on the first The number of times a target training sample is correctly identified. This indicates that the monitor is monitoring the first... The total number of times the target training samples are judged; The reasoning expression completeness score of the target training sample is:
[0168] in, Indicates the first The reasoning expression completeness score of each target training sample. Indicates the first The number of valid reasoning steps identified in the chain-thinking text of a target training sample; Indicates completion of the first The number of standard inference steps required for a task corresponding to a given target training sample; The consistency score of the target training samples is:
[0169] in, Indicates the first The consistency score between the chain-thinking text of each target training sample and the final result text. This represents the set of semantic results extracted from the final result text. This represents the set of reasoning semantics extracted from the chain-like thought text. This indicates the number of semantic elements in the resulting semantic set. This represents the number of semantic elements in the inference semantic set that are consistent with the result semantic set.
[0170] In some optional implementations, determining the risk indicators for the current training batch based on the reward conflict degree and chain-thinking monitorability index of each of the target training samples includes: Based on the reward conflict degree and chain thinking monitorability index of the target training sample, the risk index of the target training sample is determined. The risk index of the current training batch is determined based on the risk index of each target training sample in the current training batch. The risk index of the target training sample is as follows:
[0171] in, Indicates the first Risk indicators for each target training sample; Indicates the first The degree of reward conflict for each target training sample; Indicates the first The chain-thinking monitorability index of a target training sample; and These are the preset weighting coefficients.
[0172] In some optional implementations, the reward structure adaptive adjustment module 205 is further configured to: If the risk indicator indicates the presence of chain-like thinking distortion risk, a safety constraint term is added to the total reward function; or, if the total reward function also includes a safety constraint term, the weight of the safety constraint term is increased.
[0173] In some optional implementations, the security constraint is:
[0174] in, Indicates the first The security constraints corresponding to each target training sample Indicates the first The reasoning expression completeness score of each target training sample. Indicates the first The consistency score between the chain-thinking text of each target training sample and the final result text. Indicates the first The reward conflict degree of each target training sample. , , This represents the weighting coefficient of each part within the safety constraint term.
[0175] This invention also provides a computer storage medium storing computer-executable instructions, including a program for executing the above-described large language model post-training method, wherein the computer-executable instructions can execute the method in any of the above-described method embodiments.
[0176] The computer storage medium can be any available medium or data storage device that a computer can access, including but not limited to magnetic storage (e.g., floppy disk, hard disk, magnetic tape, magneto-optical disk (MO)), optical storage (e.g., CD, DVD, BD, HVD), and semiconductor storage (e.g., ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid-state drive (SSD)).
[0177] Figure 3 A structural block diagram of an electronic device according to another embodiment of the present invention is shown. The electronic device 1100 may be a host server with computing capabilities, a personal computer (PC), or a portable computer or terminal, etc. The specific embodiments of the present invention do not limit the specific implementation of the electronic device.
[0178] The electronic device 1100 includes at least one processor 1110, a communications interface 1120, a memory array 1130, and a bus 1140. The processor 1110, the communications interface 1120, and the memory 1130 communicate with each other via the bus 1140.
[0179] The communication interface 1120 is used to communicate with network elements, including, for example, virtual machine management centers and shared storage.
[0180] Processor 1110 is used to execute programs. Processor 1110 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
[0181] Memory 1130 is used for executable instructions. Memory 1130 may include high-speed RAM and may also include non-volatile memory, such as at least one disk storage device. Memory 1130 may also be a memory array. Memory 1130 may also be divided into blocks, and the blocks may be combined into virtual volumes according to certain rules. The instructions stored in memory 1130 can be executed by processor 1110 to enable processor 1110 to execute the large language model post-training method in any of the above method embodiments.
[0182] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A method for post-training a large language model, characterized in that, include: Obtain the training task set for post-training of the large language model; The training task set includes training samples from multiple training batches; For the target training sample corresponding to the current training batch, the chain-thinking text and the final result text corresponding to the target training sample are determined according to the large language model, so as to train the large language model according to the total reward function of the current training batch and update the model parameters of the large language model; the total reward function includes a result reward item and a chain-thinking reward item; Based on the chain-thinking text and final result text corresponding to each of the target training samples, determine the reward conflict degree and chain-thinking monitorability index for each of the target training samples. Based on the reward conflict degree and chain thinking monitorability index of each target training sample, the risk index of the current training batch is determined. When the risk indicator indicates the presence of chain-like thinking distortion risk, the weight of the chain-like thinking reward item in the total reward function is reduced, while the weight of the result reward item is maintained or increased, in order to update the total reward function. The updated total reward function is used to train the large language model in the next training batch to continue updating the model parameters of the large language model until the large language model is completed and training is finished.
2. The large language model post-training method according to claim 1, characterized in that, Determining the reward conflict degree of the target training sample includes: Based on the chain-like thinking text of the target training sample, determine the degree of compression of the chain-like thinking length, the degree of restriction of explicit expression of key task semantics, and the degree of lack of completeness of reasoning steps of the target training sample. The reward conflict degree of the target training sample is determined by weighting the degree of compression of chain thinking length, the degree of restriction of explicit expression of key task semantics, and the degree of lack of completeness of reasoning steps.
3. The large language model post-training method according to claim 2, characterized in that, The degree of compression of the chain-like thought length of the target training samples is: in, Indicates the first The degree of compression of the chain-like thinking length of each target training sample. This represents the preset minimum expression length threshold. Indicates the first The length of the chain-like thinking text of each target training sample; The degree to which the explicit semantic expression of the key task in the target training samples is restricted is as follows: in, Indicates the first The degree to which the explicit semantic expression of the key task of each target training sample is restricted. Indicates completion of the first Each target training sample corresponds to the set of key semantics required for the task. Indicates the first The set of key semantics actually explicitly expressed in the chain-like thinking text of each target training sample. Indicates completion of the first Each target training sample corresponds to the key semantic set required for the task. The number of semantic elements in the text; This represents the number of semantic elements in the intersection of two key semantic sets; The degree of incompleteness of the reasoning steps in the target training sample is as follows: in, Indicates the first The degree of incompleteness of the inference steps for each target training sample. Indicates the first The number of valid reasoning steps identified in the chain-thinking text of each target training sample. Indicates completion of the first The number of standard inference steps required for a task corresponding to a target training sample.
4. The large language model post-training method according to claim 1, characterized in that, Determine the chain-thinking monitorability metrics for the target training samples, including: Based on the chain-like thought text of the target training sample, determine the reasoning expression completeness score of the target training sample; Based on the chain-like thinking text and the final result text corresponding to the target training sample, the monitoring and recognition capability score and consistency score of the target training sample are determined; the consistency score is used to represent the degree of consistency between the chain-like thinking text and the final result text. The reasoning expression integrity score, monitoring and recognition ability score, and consistency score of the target training sample are weighted to determine the chain thinking monitorability index of the target training sample.
5. The large language model post-training method according to claim 4, characterized in that, The monitoring and recognition capability score of the target training samples is: in, Indicates the first The monitoring and recognition capability score of each target training sample. Indicates the monitor's position on the first The number of times a target training sample is correctly identified. This indicates that the monitor is monitoring the first... The total number of times the target training samples are judged; The reasoning expression completeness score of the target training sample is: in, Indicates the first The reasoning expression completeness score of each target training sample. Indicates the first The number of valid reasoning steps identified in the chain-thinking text of a target training sample; Indicates completion of the first The number of standard inference steps required for a task corresponding to a given target training sample; The consistency score of the target training samples is: in, Indicates the first The consistency score between the chain-thinking text of each target training sample and the final result text. This represents the set of semantic results extracted from the final result text. This represents the set of reasoning semantics extracted from the chain-like thought text. This indicates the number of semantic elements in the resulting semantic set. This represents the number of semantic elements in the inference semantic set that are consistent with the result semantic set.
6. The large language model post-training method according to claim 1, characterized in that, The step of determining the risk indicators for the current training batch based on the reward conflict degree and chain-thinking monitorability index of each target training sample includes: Based on the reward conflict degree and chain thinking monitorability index of the target training sample, the risk index of the target training sample is determined. The risk index of the current training batch is determined based on the risk index of each target training sample in the current training batch. The risk index of the target training sample is as follows: in, Indicates the first Risk indicators for each target training sample; Indicates the first The degree of reward conflict for each target training sample; Indicates the first The chain-thinking monitorability index of a target training sample; and These are the preset weighting coefficients.
7. The large language model post-training method according to claim 1, characterized in that, The method further includes: If the risk indicator indicates the presence of chain-like thinking distortion risk, a safety constraint term is added to the total reward function; or, if the total reward function also includes a safety constraint term, the weight of the safety constraint term is increased.
8. The large language model post-training method according to claim 7, characterized in that, The security constraints are: in, Indicates the first The security constraints corresponding to each target training sample Indicates the first The reasoning expression completeness score of each target training sample. Indicates the first The consistency score between the chain-thinking text of each target training sample and the final result text. Indicates the first The reward conflict degree of each target training sample. , , This represents the weighting coefficient of each part within the safety constraint term.
9. A computer storage medium, characterized in that, The computer storage medium stores computer-executable instructions for executing the large language model post-training method according to any one of claims 1 to 8.
10. An electronic device, characterized in that, include: At least one processor; as well as, A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the large language model post-training method according to any one of claims 1 to 8.