Model training method, task processing method, computer program product, and device
By concatenating non-inference prefixes in a large language reasoning model and adjusting parameters using multimodal analysis and accuracy gating penalties, the overthinking problem is solved, achieving higher accuracy and adaptability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING SANKUAI CLOUD COMPUTING TECH CO LTD
- Filing Date
- 2025-08-22
- Publication Date
- 2026-06-16
AI Technical Summary
Existing large language reasoning models suffer from overthinking in reasoning tasks, resulting in low reasoning accuracy and poor generalization of cue word control.
By acquiring sample input information and concatenating intermediate input information without inference prefix, training samples are generated through analysis using multiple inference operation modes. When the accuracy threshold is met, an inference length penalty is triggered, and the model parameters are adjusted to obtain a well-trained inference model that can adapt to different task difficulties.
It improves the accuracy and completeness of the reasoning model, reduces redundant reasoning, and enhances the model's versatility and generalization ability, making it suitable for various reasoning scenarios.
Smart Images

Figure CN121094112B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, and more specifically, to a model training method, a task processing method, a computer program product, and an electronic device. Background Technology
[0002] To address the overthinking problem in large language reasoning models during reasoning tasks, related technologies typically employ methods such as supervised learning optimization, reinforcement learning optimization, or cue word control.
[0003] However, the above methods reduce inference accuracy, cause premature shortening of inference, and affect the integrity of model inference; in addition, prompt word control needs to be designed specifically according to the task and scenario, and has poor generalization. Summary of the Invention
[0004] The purpose of this disclosure is to provide a model training method, a task processing method, and computer program products and electronic devices, thereby overcoming, at least to some extent, the problem of low inference accuracy caused by the limitations and defects of related technologies.
[0005] Other features and advantages of this disclosure will become apparent from the following detailed description, or may be learned in part from practice of this disclosure.
[0006] According to one aspect of this disclosure, a model training method is provided, comprising: acquiring sample input information of an inference task; concatenating the sample input information with specified information represented by a no-inference prefix to obtain intermediate input information; the specified information being used to reduce redundant inference by the inference model during the execution of the inference task; analyzing the intermediate input information through multiple inference operation modes of the inference model to determine training samples; grouping the training samples, and triggering an inference length penalty of the inference model to determine a reward function when the inference accuracy of each group of training samples meets an accuracy threshold; adjusting the model parameters of the inference model based on the reward function to obtain a trained inference model; wherein the trained inference model is used to analyze and process text information and / or image information of the task to be processed through an inference operation mode matching the task to be processed, and generate processing results.
[0007] In one exemplary embodiment of this disclosure, the step of analyzing the intermediate input information through multiple inference operation modes of the inference model to determine training samples includes: analyzing and processing the intermediate input information according to the multiple inference operation modes of the inference model to generate multiple answer information corresponding to the multiple inference operation modes, so as to determine training samples; wherein, the inference operation modes include a zero-thinking sampling mode and a self-recovering inference sampling mode, and the inference chain lengths of the multiple inference operation modes are different.
[0008] In one exemplary embodiment of this disclosure, the step of triggering the inference model's inference length penalty to determine the reward function includes: for each training sample, obtaining the longest and shortest answer information from the multiple answer information included in each training sample; determining the length difference between the longest and shortest answer information, and calculating the overlength ratio corresponding to the training sample based on the ratio of the length difference to a pre-configured penalty window length; determining the penalty coefficient of the training sample, and determining the penalty based on the overlength ratio and the penalty coefficient; and determining the reward function based on the answer correctness indicator function and the penalty.
[0009] In one exemplary embodiment of this disclosure, determining the penalty coefficient of the training sample includes: determining the penalty coefficient based on the inference accuracy, a fixed parameter, and the accuracy threshold when the inference accuracy is greater than or equal to an accuracy threshold.
[0010] In one exemplary embodiment of this disclosure, adjusting the model parameters of the inference model based on the reward function to obtain a trained inference model includes: determining a reward value through the reward function; and performing reinforcement learning on the inference model based on the reward value to determine the trained inference model.
[0011] In one exemplary embodiment of this disclosure, the step of performing reinforcement learning on the inference model based on the reward value to determine the trained inference model includes: determining an advantage based on the reward value, and determining a policy gradient based on the advantage; adjusting policy parameters according to the gradient direction of the policy gradient to obtain the trained inference model.
[0012] According to one aspect of this disclosure, a task processing method is provided, comprising: acquiring text information and / or image information of a task to be processed; inputting the text information and / or image information into a trained inference model; performing model analysis on the text information and / or image information through an inference operation mode matched with the task to be processed; and obtaining a processing result corresponding to the input information; wherein the trained inference model is trained according to any one of the model training methods described above.
[0013] In one exemplary embodiment of this disclosure, the step of performing model analysis on the text information and / or image information by means of an inference operation mode matching the task to be processed to obtain a processing result corresponding to the input information includes: determining the task difficulty of the task to be processed; determining an inference operation mode matching the task to be processed based on the task difficulty; performing model analysis on the text information and / or image information of the task to be processed based on the inference chain corresponding to the inference operation mode; and determining the processing result corresponding to the text information and / or image information.
[0014] According to one aspect of this disclosure, a computer program product is provided, which, when executed by a processor, implements the model training method or the task processing method described in any one of the preceding claims.
[0015] According to one aspect of this disclosure, an electronic device is provided, comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the model training method or the task processing method described in any one of the preceding claims by executing the executable instructions.
[0016] The technical solution provided in this disclosure, on the one hand, avoids the problem of inference chain reduction for all tasks in related technologies by concatenating specified information into the sample input information and triggering inference length penalty when the inference accuracy is greater than the accuracy threshold for model training. This enables adaptive compression and recovery of the inference chain of the inference model, avoids redundant inference, and improves the accuracy of model training. On the other hand, the trained inference model only triggers the inference length penalty to apply a penalty reward to the inference length of the inference model after the inference accuracy reaches the accuracy threshold. This prevents the inference model from prematurely shortening the inference chain before the accuracy reaches the threshold. Moreover, the inference model can select the inference operation mode and adjust the allocation of inference resources according to the task to be processed, improving the inference integrity and inference accuracy. Furthermore, it avoids the need for manual modification of prompt words, can be applied to any inference scenario and model, increases versatility, scalability, and flexibility, and improves generalization ability.
[0017] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0018] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure. It is obvious that the drawings described below are merely some embodiments of this disclosure, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort.
[0019] Figure 1 A schematic diagram of a system architecture for which the model training method or task processing method of the present disclosure embodiments can be applied is shown.
[0020] Figure 2 The schematic diagram illustrates a flowchart of a model training method according to an embodiment of the present disclosure.
[0021] Figure 3 The schematic diagram illustrates the process of determining the reward function in an embodiment of this disclosure.
[0022] Figure 4 The schematic diagram illustrates the process of adaptive automatic response reasoning training in an embodiment of this disclosure.
[0023] Figure 5 The schematic diagram illustrates a flowchart of a task processing method according to an embodiment of the present disclosure.
[0024] Figure 6 The diagram illustrates a simple instruction distribution and a difficult instruction distribution generated by the inference model of an embodiment of this disclosure. Detailed Implementation
[0025] Example embodiments will now be described more fully with reference to the accompanying drawings. However, example embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided to make this disclosure more comprehensive and complete, and to fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a full understanding of embodiments of this disclosure. However, those skilled in the art will recognize that the technical solutions of this disclosure can be practiced with one or more of the specific details omitted, or other methods, components, apparatus, steps, etc., can be employed. In other instances, well-known technical solutions are not shown or described in detail to avoid obscuring various aspects of this disclosure.
[0026] Furthermore, the accompanying drawings are merely illustrative of this disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and therefore repeated descriptions of them will be omitted. Some block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.
[0027] In related technologies, the "overthinking" problem in Large Language Reasoning Models (LRMs) during reasoning tasks is generally addressed through supervised learning optimization, reinforcement learning optimization, or cue word control. Supervised learning optimization refers to using supervised fine-tuning and direct preference optimization, combined with a specially designed short-chain reasoning dataset, to achieve fine-grained suppression of the model's output reasoning length. Supervised learning optimization indiscriminately reduces the model's response length across different instruction distributions, resulting in decreased accuracy for instructions requiring deep thinking and poor adaptability. Supervised fine-tuning reduces the diversity of responses during sampling, making it more difficult to sample better solutions. Reinforcement learning optimization introduces additional rewards based on reasoning length in reinforcement learning training guided by accuracy rewards, encouraging the model to generate concise reasoning chains while maintaining a certain level of accuracy, attempting to strike a balance between performance and efficiency. However, length rewards in reinforcement learning often lead to the model prematurely shortening reasoning before achieving the required accuracy, affecting the completeness of reasoning for complex problems. Cue word control utilizes different cue word designs to guide the model to generate more concise reasoning chains, reducing unnecessary reasoning steps. However, cue word control relies on manually designed control mechanisms or cue word engineering, making it difficult to achieve adaptive inference resource allocation for different task types and difficulty levels, resulting in limited generalization and practicality.
[0028] In order to solve the technical problems existing in the related technologies, some embodiments of this disclosure provide a model training method that can be applied to train the inference model used in any inference task to optimize redundant inference.
[0029] Figure 1 A schematic diagram of a system architecture for model training methods, task processing methods, and apparatus that can be applied to some embodiments of this disclosure is shown.
[0030] In some embodiments, the system architecture 100 upon which model training and task processing depend may include a terminal 110 and a server 120. The terminal 110 may be of any type. The terminal 110 and the server 120 can transmit information via a network 130. During the model training phase, the server 120 performs model analysis on the sample input information uploaded by the terminal 110 and outputs answer information. Based on the sample input information and answer information, training samples are obtained, a reward function is determined according to the training samples, and the training process of the inference model is implemented according to the reward function. During the task processing phase, textual and / or image information of the task to be processed can be analyzed, so that the trained inference model can invoke a target inference operation mode related to the task difficulty to analyze and process the textual and / or image information, obtaining the corresponding processing results.
[0031] It should be noted that in the above model training and task processing methods, server 120 can be any type of server or server cluster. Terminal 110 can be any type of device. The above model training and task processing methods can be executed entirely by the server; or they can be partially executed by the server and partially by the terminal, without specific limitations here.
[0032] Next, refer to Figure 2 As shown, each step in the model training method of the present disclosure will be described in detail.
[0033] In step S210, sample input information of the inference task is obtained, and the sample input information is concatenated with the specified information represented by the no-inference prefix to obtain intermediate input information; the specified information is used to reduce redundant inference in the inference model during the execution of the inference task.
[0034] In some embodiments, the inference task can be any type of inference task, referring to the task of extracting features from input information and generating a predicted output based on those features. Inference tasks can be classification tasks, detection tasks, generation tasks, multimodal inference tasks, etc. The sample input information can be the input data corresponding to the inference task. The input data can be determined according to the inference task itself, for example, it can be sample text information or sample image information, or it can be multimodal information composed of sample text information and sample image information, etc. The number of inference tasks can be at least one, and can be tasks that have already been executed for training the model. In this embodiment, the input of each inference task can be structured to obtain sample input information. The sample input information of each inference task can be represented as x. iIn addition, specific information can be injected into the sample input information to obtain intermediate input information. This specific information can be used to reduce redundant reasoning in the inference model during the inference task. The specific information can be a no-thinking prefix, which can be represented as p, for example. term Redundant reasoning can refer to unnecessary or repetitive reasoning in the reasoning process of a reasoning model.
[0035] For the sample input information of each inference task, the sample input information can be concatenated with the specified information represented by the no-inference prefix to obtain the intermediate input information. For example, the no-inference prefix can be appended to any position in the sample input information; there are no specific restrictions on the appending position. For instance, for each inference task input x... i , concatenate the No-Thinking prefix p term This forms new input as intermediate input information. The input information in the middle can be, for example, "Okay, I have finished thinking."
[0036] In step S220, the intermediate input information is analyzed through multiple inference operation modes of the inference model to determine the training samples.
[0037] In this embodiment, intermediate input information, obtained by concatenating intermediate input information and specified information, can be input into the inference model. The inference model then analyzes and processes the intermediate input information using multiple inference operation modes to generate training samples. The inference chain lengths of the multiple inference operation modes are different. The inference model can perform multiple model analyses on the intermediate input information to obtain multiple answer information, and the training samples are determined based on these multiple answer information. The inference operation modes used in the multiple model analyses can be the same or different.
[0038] The inference model includes multiple inference operation modes, which can include zero-thinking sampling mode and self-recovering inference sampling mode. Self-recovering inference sampling mode can also be understood as a self-recovering supplementary thinking mode. Zero-thinking sampling mode refers to using the "No-Thinking" prefix to prompt the inference model to directly output the answer, minimizing redundant inference steps and generating a zero-thinking answer output. Self-recovering inference sampling mode refers to implicitly activating the internal inference path when generating answer information, automatically supplementing the inference steps required to achieve the inference task, and generating a self-recovering supplementary thinking answer output.
[0039] For reasoning tasks, inference prediction can be performed according to two inference operation modes of the inference model, predicting multiple answer information corresponding to the sample input information. The inference task can be executed multiple times, and after multiple inferences, each inference operation mode can generate at least one answer information. The answer information generated in each inference can be the same or different.
[0040] The input information can be a sample question, and the generated answer information can be the predicted answer information for that sample question. Therefore, training samples can include both the input information and the answer information; that is, training samples can include a sample question and multiple corresponding answer information. The number of words in the multiple answer information can vary, and the length of the answer information can also vary.
[0041] In this embodiment of the disclosure, by utilizing the inherent capabilities of the No-Thinking prefix and the inference model, the inference chain of the inference model is adaptively compressed and restored, thereby improving the inference model's perception and adaptability to task difficulty.
[0042] Next, in step S230, the training samples are grouped, and if the inference accuracy of each group of training samples meets the accuracy threshold, the inference length penalty of the inference model is triggered to determine the reward function.
[0043] In this embodiment, after determining the training samples, they can be grouped according to the same sample input information, grouping the answer information corresponding to the same sample question into one group. Further, the answer information obtained from different inference operation modes for the same sample input information in the training samples can be compared with the actual answer information of the sample input information. If the comparison result is consistent or similar, the answer information is determined to be correct. The inference accuracy rate of each group of training samples is determined based on the proportion of correct answer information, and the inference accuracy rate of each group of training samples is statistically analyzed in real time.
[0044] Inference accuracy meeting the accuracy threshold means that the inference accuracy is greater than or equal to the accuracy threshold. When the inference accuracy of each training sample is greater than or equal to the accuracy threshold, the inference length penalty of the inference model can be activated. This prevents premature compression of the inference chain when the inference accuracy does not meet the threshold, thus improving accuracy. The accuracy threshold can be a value set according to actual needs, such as 0.6 or 0.8, etc., without specific limitations here.
[0045] In some embodiments, by activating the inference length penalty, a reward function can be determined, and then the model parameters of the inference model can be adjusted according to the reward function to obtain a trained inference model.
[0046] Figure 3 The flowchart illustrating the determination of the reward function is shown in the image. (See reference...) Figure 3 As shown, the main steps include:
[0047] Step S310: For each training sample, calculate the ultra-long ratio corresponding to the training sample;
[0048] Step S320: Determine the penalty coefficient of the training sample, and determine the penalty based on the ultra-long ratio and the penalty coefficient;
[0049] Step S330: Determine the reward function based on the answer correctness indicator function and the penalty.
[0050] In this embodiment of the disclosure, for each training sample, the hyperlength ratio of that training sample can be calculated. The "excess length ratio" represents the proportion of the length difference between the longest and shortest answer information in the training samples. For example, the longest and shortest answer information corresponding to the sample input information can be determined from the multiple answer information contained in the training samples, and their lengths can be determined based on the number of words contained in the longest and shortest answer information. It should be noted that the longest and shortest answer information here refers to the correct answer information. Further, the length difference between the longest and shortest answer information can be calculated, and the excess length ratio can be determined based on the ratio of the length difference to the pre-configured penalty window length. The pre-configured penalty window length can be determined according to actual needs, for example, it can be 100 or other values. Based on this, the excess length ratio can be calculated as follows:
[0051]
[0052] Among them, L correct_shortest It is the shortest length of the answer information in this set of training samples, that is, the length of the shortest answer information; L i It is the length of the longest answer message in this set of training samples; L window It is a fixed penalty window length.
[0053] Furthermore, the penalty coefficient for the training samples can be determined. The penalty coefficient can be used to represent the magnitude of the adjustment of the error by the regularization process, or it can be used to represent the intensity of the penalty. Its main function is to limit the model's overfitting to the training data by controlling the weight of the regularization term, thereby improving the model's generalization ability.
[0054] In this embodiment of the disclosure, the penalty coefficient can be determined to be 0 based on the comparison result between the inference accuracy and the accuracy threshold, or the penalty coefficient can be determined based on the inference accuracy, fixed parameters, and the accuracy threshold. For example, if the comparison result is that the inference accuracy is less than the accuracy threshold, the penalty coefficient is 0. When the inference accuracy is greater than or equal to the accuracy threshold, the penalty coefficient is determined based on the inference accuracy, fixed parameters, and the accuracy threshold. For example, the fixed parameters can include a first parameter and a second parameter. The first parameter can be a scaling factor, and the second parameter can be a constant. Based on this, the difference between the inference accuracy and the accuracy threshold can be added to the second parameter to obtain a first addition result. The first parameter is multiplied by the first addition result to obtain a multiplication result, and the ratio between the multiplication result and the second addition result is further calculated to determine the penalty coefficient. The second addition result can be obtained by adding the difference between a preset value and the accuracy threshold to the second parameter. The preset value can be 1. The penalty coefficient can be positively correlated with the inference accuracy. For example, as the inference accuracy increases, the penalty coefficient increases. The penalty coefficient can be specifically calculated according to formula (2):
[0055]
[0056] After determining the hyperlength ratio and penalty coefficient, the penalty can be determined based on the hyperlength ratio and penalty coefficient. The penalty typically refers to a regularization term, which aims to suppress excessively large values of model parameters by adding an additional cost term, thereby preventing the model from overfitting the training data. In this embodiment, the penalty can be determined based on the product of the hyperlength ratio and the penalty coefficient.
[0057] Furthermore, the reward function can be determined based on the answer correctness indicator function and the calculated penalty. Specifically, the reward function can be obtained by subtracting the penalty from the answer correctness indicator function. The answer correctness indicator function is a function used to indicate whether a certain answer is correct. The output of the answer correctness indicator function can be 1 (indicating correct) or 0 (indicating incorrect). The reward function can be calculated in the following form:
[0058]
[0059] in, This is used to represent the indicator function of the correctness of the answer, where α is a dynamically adjusted penalty coefficient. Used to indicate punishment.
[0060] In step S240, the model parameters of the inference model are adjusted based on the reward function to obtain a trained inference model. The trained inference model is used to analyze and process the text information and / or image information of the task to be processed through an inference operation mode that matches the task to be processed, and generate processing results.
[0061] In this embodiment, the reward function, by defining reward values for different states and actions, guides the inference model to make optimal decisions in complex environments. The reward function directly determines the objective of the inference model body, maximizing cumulative rewards through a learning strategy. The reward is the feedback the inference model receives immediately after performing an action, used to adjust the model's behavioral strategy, making it more optimized in decision-making.
[0062] Once the reward function is determined, the reward value can be determined based on the reward function. Furthermore, the inference model can be trained using reinforcement learning algorithms based on the reward value, thereby obtaining a trained inference model.
[0063] In some embodiments, the process of performing reinforcement learning on the inference model based on the reward value mainly includes: determining the advantage based on the reward value, and determining the policy gradient based on the advantage; adjusting the model parameters of the inference model according to the gradient direction of the policy gradient to obtain a trained inference model. For example, the inference model can obtain the current state from the environment, select an action according to the current policy, return the action to the environment, and substitute it into the reward function to obtain the reward value. A target value is calculated based on the reward value; the target value can be the advantage, which measures the quality of the current action relative to the average level. The advantage can be, for example, a generalized advantage estimate, which can be determined based on the reward value, a discount factor, and a state value function. The policy gradient is determined based on the target value; the policy gradient measures the impact of changes in policy parameters on the reward value. Policy parameters refer to the learnable model parameters of the inference model, such as the weights and biases of the inference model.
[0064] The policy parameters are adjusted according to the gradient direction until the change in the mean reward or the policy gradient is less than a threshold. After multiple iterations, the policy parameters are optimized to a state that maximizes long-term rewards, at which point a well-trained inference model is obtained. By using the reward function to adjust the model parameters of the inference model, redundant inference can be compressed to the maximum extent while ensuring inference accuracy, reducing computational load and improving inference efficiency.
[0065] In this embodiment, accuracy-rewarded training and accuracy-gated dynamic inference length penalty are employed. By adjusting the grouped accuracy gating and dynamic penalty coefficients, a dynamic optimal balance between inference efficiency and accuracy is achieved, preventing the phenomena of "short but wrong" or "accurate but verbose" inference models. Through explicit inference suppression and implicit self-recovery mechanisms, combined with accuracy-gated dynamic inference length penalty, adaptive allocation of inference resources for different task difficulties is achieved. Experimental results show that in multiple benchmark tests, the ASRR framework significantly reduces inference overhead (up to 32.5%) with minimal accuracy loss (less than 1.2%).
[0066] It should be noted that since the input training samples contain answer information for different reasoning operation modes, the trained reasoning model can also automatically identify the reasoning operation mode, and select the reasoning operation mode that matches the task to be processed to perform model analysis and processing on the text information and / or image information of the task to be processed, and generate processing results that match the text information and / or image information and conform to the reasoning operation mode, so as to realize the function of the task to be processed.
[0067] Figure 4 This diagram illustrates adaptive automatic recovery inference training. (See reference) Figure 4 As shown, the process mainly includes explicit suppression prefix injection, model generation, reinforcement learning training, and model inference. In the explicit suppression prefix injection stage, the input of the inference task is structured and injected with a No-Thinking prefix to generate sample input information. The specified information represented by the No-Thinking prefix explicitly suppresses redundant inference of the model.
[0068] During the model generation phase, sample input information obtained by concatenating specified information into the sample input information is input into the inference model. The inference model may include multiple inference operation modes, such as zero-thinking sampling mode and self-recovering inference sampling mode. For the sample input information of the inference task, if the sample input information is a sample question, at least one answer output corresponding to the zero-thinking sampling mode and at least one answer output corresponding to the self-recovering inference sampling mode can be output. During the reinforcement learning training phase, the inference accuracy of each training sample consisting of the sample question and the answer output can be determined. If the inference accuracy of each training sample is greater than or equal to the accuracy threshold, dynamic inference length penalty is enabled. For example, for each training sample, the overlength ratio corresponding to the training sample is determined by the ratio of the length difference between the longest and shortest answer information among the multiple answer information of the sample question to the pre-configured penalty window length. Further, if the inference accuracy is greater than or equal to the accuracy threshold, the penalty coefficient is determined based on the logical processing result of the inference accuracy, fixed parameters, and accuracy threshold. The penalty can also be determined based on the product of the overlength ratio and the penalty coefficient. After the penalty is applied, a reward function is determined based on the correctness indicator function and the penalty, and the reward value is determined based on the reward function. Based on this, the advantage is calculated according to the reward value, and the policy gradient is determined based on the advantage to maximize the reward. The policy parameters are adjusted according to the gradient direction. The process stops when the mean reward or the change in the policy gradient reaches a threshold, thus determining the model parameters of the inference model and obtaining a trained inference model.
[0069] After obtaining the trained inference model, during the model inference stage, the text information and / or image information corresponding to the task to be processed can be input into the trained inference model to obtain the processing results corresponding to the text information and / or image information.
[0070] The technical solution in this disclosure significantly reduces the inference overhead for simple tasks through accuracy-gated rewards, while ensuring the integrity and accuracy of inference for complex tasks, achieving a balance between inference efficiency and accuracy. It avoids the diversity loss caused by supervised fine-tuning, performs better in security tests, reduces the potential risks of redundant inference, increases diversity, and improves security. It can be seamlessly integrated into the training process of mainstream large language models, is suitable for various inference scenarios and model architectures, and has good versatility and scalability. Without relying on manually designed prompts or control structures, the inference model can adaptively allocate inference resources matching the task type and difficulty, improving generalization ability and practical application effects, demonstrating strong adaptability.
[0071] This disclosure also provides a task processing method, referring to... Figure 5 As shown, the following steps may be included:
[0072] Step S510: Obtain text information and / or image information of the task to be processed;
[0073] Step S520: Input the text information and / or image information into the trained inference model, and perform model analysis on the text information and / or image information through an inference operation mode that matches the task to be processed, so as to obtain the processing result corresponding to the text information and / or image information.
[0074] In this embodiment of the disclosure, the task to be processed can be any type of inference task to be executed. The task to be processed can be a task instruction. The text information and / or image information of the task to be processed can be text information or image information, or it can be multimodal information composed of text information and image information.
[0075] After acquiring text and / or image information, the trained inference model can automatically identify the difficulty of the task to be processed. Then, based on the task difficulty, it automatically selects and switches between multiple inference operation modes corresponding to the trained inference model to match the difficulty of the task. Simple tasks output directly, while complex tasks automatically supplement the inference chain. The inference operation mode matching the task can be either zero-thinking sampling mode or self-recovering inference sampling mode. After determining the inference operation mode matching the task, the model can perform model analysis on the text and / or image information based on the inference chain corresponding to the matching inference operation mode to obtain the processing result.
[0076] In this embodiment of the disclosure, the inference model can dynamically adjust the allocation of inference resources according to the difficulty of the task, suppress redundant inference in simple tasks and ensure the integrity of inference in complex tasks, thereby effectively reducing computational overhead while maintaining or even improving the inference accuracy and security of the model.
[0077] For example, the text information and / or image information of the task to be processed are input into the trained inference model. The trained inference model selects the zero-thinking sampling mode to obtain the simple instruction distribution as the processing result based on the task difficulty represented by the text information and / or image information, or uses the self-recovering inference sampling mode to obtain the difficult instruction distribution as the processing result.
[0078] For the self-recovery inference sampling mode, which matches the inference operation mode of the task to be processed, text and / or image information can be segmented to obtain text features. These text features can then be encoded to obtain encoded features. An attention weight matrix is calculated on the encoded features, and after weighted summation and multi-head parallel computation, a multi-head attention output is generated. Subsequently, the multi-head attention output is processed by a feedforward neural network, combined with residual connections and layer normalization, to achieve feature extraction. Finally, the extracted features are iteratively decoded by a decoder, and the answer information corresponding to the input information is generated as the processing result based on feature prediction. The simple instruction distribution generated by the zero-thinking sampling mode and the difficult instruction distribution generated by the self-recovery inference sampling mode can be as follows: Figure 6 As shown in the image.
[0079] In this embodiment, the adaptive automatic recovery mechanism leverages the self-recovery capability of the inference model when generating answers. It allows the model to automatically supplement necessary implicit inference for complex tasks without explicit full inference, further improving its performance on complex problems. Through model-aware training on problem difficulty and reinforcement with implicit automatic recovery, the inference chain length is dynamically adjusted. This enables the model to proactively reduce inference steps and improve efficiency for simple tasks, while preserving sufficient inference process and ensuring accuracy for complex tasks. This avoids the problem of compressing inference length for all tasks under supervised learning and cue word control. In reinforcement learning training, a penalty reward is applied to the inference length only after the model reaches a preset accuracy threshold within the current problem group. This prevents the model from prematurely shortening the inference chain before the accuracy threshold is reached, balancing inference efficiency with problem-solving ability for complex tasks. Without relying on manually designed cue words or control structures, the model can adaptively allocate inference resources according to different task types and difficulties, improving generalization ability and practical application effects. Through accuracy-gated rewards, the inference overhead for simple tasks is significantly reduced, while ensuring the integrity and accuracy of inference for complex tasks. It avoids the diversity loss associated with supervised fine-tuning, while performing better in security tests and reducing the potential risks of redundant inference. This framework can be seamlessly integrated into the training process of mainstream large language models, is suitable for various inference scenarios and model architectures, and exhibits good versatility and scalability.
[0080] In some embodiments of this disclosure, a model training apparatus is provided, comprising: an input splicing module, a training sample determination module, a reward function determination module, and a parameter adjustment module, wherein:
[0081] The input concatenation module is used to obtain sample input information for the inference task and concatenate the sample input information with specified information represented by the no-inference prefix to obtain intermediate input information; the specified information is used to reduce redundant inference in the inference model during the execution of the inference task.
[0082] The training sample determination module is used to analyze the intermediate input information through multiple inference operation modes of the inference model to determine the training samples;
[0083] The reward function determination module is used to group training samples. If the inference accuracy of each group of training samples meets the accuracy threshold, the inference length penalty of the inference model is triggered to determine the reward function.
[0084] The parameter adjustment module is used to adjust the model parameters of the inference model based on the reward function to obtain a trained inference model; wherein, the trained inference model is used to analyze and process the text information and / or image information of the task to be processed through an inference operation mode that matches the task to be processed, and generate processing results.
[0085] In one exemplary embodiment of this disclosure, the step of analyzing the intermediate input information through multiple inference operation modes of the inference model to determine training samples includes: analyzing and processing the intermediate input information according to the multiple inference operation modes of the inference model to generate multiple answer information corresponding to the multiple inference operation modes, so as to determine training samples; wherein, the inference operation modes include a zero-thinking sampling mode and a self-recovering inference sampling mode, and the inference chain lengths of the multiple inference operation modes are different.
[0086] In one exemplary embodiment of this disclosure, the step of triggering the inference model's inference length penalty to determine the reward function includes: for each training sample, obtaining the longest and shortest answer information from the multiple answer information included in each training sample; determining the length difference between the longest and shortest answer information, and calculating the overlength ratio corresponding to the training sample based on the ratio of the length difference to a pre-configured penalty window length; determining the penalty coefficient of the training sample, and determining the penalty based on the overlength ratio and the penalty coefficient; and determining the reward function based on the answer correctness indicator function and the penalty.
[0087] In one exemplary embodiment of this disclosure, determining the penalty coefficient of the training sample includes: determining the penalty coefficient based on the inference accuracy, a fixed parameter, and the accuracy threshold when the inference accuracy is greater than or equal to an accuracy threshold.
[0088] In one exemplary embodiment of this disclosure, adjusting the model parameters of the inference model based on the reward function to obtain a trained inference model includes: determining a reward value through the reward function; and performing reinforcement learning on the inference model based on the reward value to determine the trained inference model.
[0089] In one exemplary embodiment of this disclosure, the step of performing reinforcement learning on the inference model based on the reward value to determine the trained inference model includes: determining an advantage based on the reward value, and determining a policy gradient based on the advantage; adjusting policy parameters according to the gradient direction of the policy gradient to obtain the trained inference model.
[0090] According to one aspect of this disclosure, a task processing apparatus is provided, comprising: an input information acquisition module and a model inference module, wherein: the input information acquisition module is used to acquire text information and / or image information of a task to be processed; the model inference module is used to input the text information and / or image information into a trained inference model, and perform model analysis on the text information and / or image information through an inference operation mode matched with the task to be processed, to obtain a processing result corresponding to the text information and / or image information; wherein the trained inference model is trained according to any one of the model training methods described above.
[0091] In one exemplary embodiment of this disclosure, the step of performing model analysis on the text information and / or image information by using an inference operation mode matched with the task to be processed to obtain a processing result corresponding to the text information and / or image information includes: determining the task difficulty of the task to be processed; determining an inference operation mode matched with the task to be processed based on the task difficulty; performing model analysis on the text information and / or image information of the task to be processed based on the inference chain corresponding to the inference operation mode; and determining a processing result corresponding to the text information and / or image information.
[0092] It should be noted that the specific details of each part of the above-mentioned model training device and task processing device have been described in detail in some implementations of the corresponding methods. For details that are not disclosed, please refer to the implementation content of the method section, and therefore will not be repeated here.
[0093] Exemplary embodiments of this disclosure also provide an electronic device. This electronic device may be the aforementioned terminal device or server. Generally, the electronic device may include a processor and a memory, the memory for storing executable instructions of the processor, and the processor configured to perform the aforementioned task processing method by executing the executable instructions. Furthermore, the electronic device may also include a display for displaying an operating interface.
[0094] The electronic device is described below as an example in the form of a general-purpose computing device. This electronic device is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.
[0095] The components of an electronic device may include, but are not limited to: at least one processing unit, at least one storage unit, a bus connecting different system components (including storage units and processing units), and a display unit.
[0096] The storage unit stores program code that can be executed by the processing unit, causing the processing unit to perform the steps described in the "Exemplary Methods" section of this specification according to various exemplary embodiments of this disclosure. For example, the processing unit can perform actions such as... Figure 2 The steps are shown in the figure.
[0097] The storage unit may include readable media in the form of volatile storage units, such as random access memory (RAM) and / or cache storage units, and may further include read-only memory (ROM).
[0098] The storage unit may also include a program / utility having a set (at least one) of program modules, including but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of these examples may include an implementation of a network environment.
[0099] A bus can represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local bus that uses any of the various bus structures.
[0100] The electronic device can also communicate with one or more external devices (e.g., keyboards, pointing devices, Bluetooth devices, etc.), one or more devices that enable a user to interact with the electronic device, and / or any device that enables the electronic device to communicate with one or more other computing devices (e.g., routers, modems, etc.). This communication can be achieved through input / output (I / O) interfaces. Furthermore, the electronic device can communicate with one or more networks (e.g., local area networks (LANs), wide area networks (WANs), and / or public networks, such as the Internet) via a network adapter. As shown in the figure, the network adapter communicates with other modules of the electronic device via a bus. It should be understood that, although not shown in the figure, other hardware and / or software modules can be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.
[0101] It should be noted that some embodiments of this disclosure also provide a computer program product, which includes a computer program that implements the above-described method when executed by a processor.
[0102] In one embodiment, the computer program product can be a tangible product containing a computer program, such as a computer-readable storage medium storing the computer program. The readable storage medium can be a storage medium based on electrical, magnetic, optical, electromagnetic, infrared, or other signals, including but not limited to: random access memory (RAM), read-only memory (ROM), magnetic tape, floppy disk, flash memory, hard disk drive (HDD), solid-state drive (SSD), etc. For example, the computer program product can be implemented as a non-volatile storage medium storing the computer program, such as read-only memory, NAND flash memory, etc.
[0103] In one implementation, the computer program product can be an intangible product containing a computer program. For example, the computer program product can be implemented as a virtual digital product, such as an executable file, installation package, or other digital file storing the computer program.
[0104] Computer program code can be written in one or more programming languages. Examples of programming languages include C, Java, and C++. Program code can execute entirely on the user's computing device, partially on the user's computing device, or as a standalone software package. It can also execute partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, such as a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via an internet connection provided by a mobile network operator).
[0105] Computer programs can be carried or transmitted via signals such as electrical, magnetic, optical, electromagnetic, and infrared rays. Electronic devices can convert signals carrying computer programs into digital signals, thereby running the computer programs. When a computer program runs on an electronic device, its code is used to cause the electronic device to execute (more specifically, to be executed by the processor of the electronic device) the method steps of various exemplary embodiments of this disclosure.
[0106] From the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause a computing device (such as a personal computer, server, terminal device, or network device, etc.) to execute the methods according to the embodiments of this disclosure.
[0107] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the claims.
[0108] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.
Claims
1. A model training method, characterized in that, include: Obtain sample input information for the reasoning task, and concatenate the sample input information with the specified information represented by the no-reasoning prefix to obtain intermediate input information; The specified information is used to reduce redundant reasoning in the reasoning model during the execution of reasoning tasks; The intermediate input information is analyzed through multiple inference operation modes of the inference model to determine the training samples; The training samples are grouped, and if the inference accuracy of each group of training samples meets the accuracy threshold, the inference length penalty of the inference model is triggered to determine the reward function. The model parameters of the inference model are adjusted based on the reward function to obtain a trained inference model; wherein, the trained inference model is used to analyze and process the text information and / or image information of the task to be processed through an inference operation mode that matches the task to be processed, and generate processing results. The step of analyzing the intermediate input information through multiple inference operation modes of the inference model to determine training samples includes: Based on the multiple reasoning operation modes of the reasoning model, the intermediate input information is analyzed and processed to generate multiple answer information corresponding to the multiple reasoning operation modes, so as to determine the training samples; The inference operation modes include zero-thinking sampling mode and self-recovering inference sampling mode, and the inference chain lengths of the multiple inference operation modes are different; The inference length penalty that triggers the inference model, used to determine the reward function, includes: For each training sample, extract the longest and shortest answer information from the multiple answer information included in each training sample. Determine the length difference between the longest and shortest answer information, and calculate the overlength ratio corresponding to the training sample based on the ratio of the length difference to the pre-configured penalty window length; Determine the penalty coefficient for the training samples, and determine the penalty based on the ultra-long ratio and the penalty coefficient; The reward function is determined based on the correctness indicator function of the answer and the penalty.
2. The model training method according to claim 1, characterized in that, Determining the penalty coefficient for the training samples includes: If the inference accuracy is greater than or equal to the accuracy threshold, the penalty coefficient is determined based on the inference accuracy, the fixed parameter, and the accuracy threshold.
3. The model training method according to claim 1, characterized in that, The step of adjusting the model parameters of the inference model based on the reward function to obtain a trained inference model includes: The reward value is determined using the reward function. The inference model is reinforced based on the reward value to determine the trained inference model.
4. The model training method according to claim 3, characterized in that, The step of performing reinforcement learning on the inference model based on the reward value to determine the trained inference model includes: The advantage is determined based on the reward value, and the strategy gradient is determined based on the advantage. The model parameters of the inference model are adjusted according to the gradient direction of the policy gradient to obtain a trained inference model.
5. A task processing method, characterized in that, include: Obtain text and / or image information of the task to be processed; The input information is fed into the trained inference model, and the text information and / or image information are analyzed by the model through an inference operation mode that matches the task to be processed, so as to obtain the processing result corresponding to the text information and / or image information; wherein, the trained inference model is trained by the model training method according to any one of claims 1-4.
6. The task processing method according to claim 5, characterized in that, The step of performing model analysis on the text information and / or image information through a reasoning operation mode matched with the task to be processed, to obtain the processing result corresponding to the text information and / or image information, includes: Determine the task difficulty of the task to be processed; Based on the task difficulty, a reasoning operation mode matching the task to be processed is determined. Based on the reasoning chain corresponding to the reasoning operation mode, the text information and / or image information of the task to be processed are analyzed by the model to determine the processing result corresponding to the input information.
7. A computer program product, characterized in that, When the computer program is executed by the processor, it implements the model training method according to any one of claims 1-4 or the task processing method according to any one of claims 5-6.
8. An electronic device, characterized in that, include: processor; as well as Memory for storing the executable instructions of the processor; The processor is configured to execute the model training method of any one of claims 1-4 or the task processing method of any one of claims 5-6 by executing the executable instructions.