A text model reinforcement learning training method and a reinforcement learning training device
By hierarchically and heterogeneously allocating and dynamically adjusting the precision of the reinforcement learning model, the problem of balancing training cost and inference performance is solved, achieving precision alignment between training and inference and improving the overall performance of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI XIYU JIZHI TECH CO LTD
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, reinforcement learning cannot simultaneously achieve both high training costs and high inference performance, resulting in significant differences in the probability distribution of lexical units between the training and inference phases, thus failing to achieve optimal inference performance.
By performing initial quantization on each layer of the model to be trained, dynamically adjusting the quantization precision, and combining the changes in preset thresholds and probability correlation parameters, the precision alignment between training and inference is achieved, and a hierarchical heterogeneous precision allocation is adopted.
While reducing training costs, it significantly improved model inference performance, achieved accuracy alignment between training and inference, and improved training efficiency and effectiveness.
Smart Images

Figure CN122242629A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of model training technology, and in particular to a text model reinforcement learning training method and reinforcement learning training device. Background Technology
[0002] In reinforcement learning training, quantization techniques are typically used to optimize the utilization of GPU memory and bandwidth resources, increase data processing throughput, and thus reduce training costs and improve training efficiency. This technique effectively reduces GPU memory usage and communication overhead by representing model parameters and activation values with low-precision data types, thereby significantly improving training efficiency. However, high-precision data types are still required during the model deployment and inference phase to ensure generation quality.
[0003] However, while this differentiated precision configuration achieves a certain balance between training cost and inference performance, it also limits the optimization space for model inference performance, preventing the training effects of reinforcement learning from being fully realized in the inference phase. For the same training data and model parameters, the probability distribution of word terms generated in the training and inference phases differs significantly. This makes it difficult to align the parameter optimization objectives of reinforcement learning with the actual generated content in the inference phase, resulting in inference performance failing to reach the optimal level achieved during training. Therefore, how to improve model inference performance while reducing training costs has become a critical technical challenge that urgently needs to be addressed. Summary of the Invention
[0004] In view of this, the purpose of this application is to provide a text model reinforcement learning training method and reinforcement learning training device, which not only ensures training efficiency and reduces training costs with low-precision training, but also achieves precision alignment between training and inference by accurately adjusting the precision of key levels, significantly improving the final effect of reinforcement learning training, and effectively solving the technical problem of not being able to balance training cost and inference performance in the prior art.
[0005] In a first aspect, embodiments of this application provide a text model reinforcement learning training method, the text model reinforcement learning training method comprising: The training text data is used to train the model to be trained. For at least one level in the model to be trained, the model parameters and / or activation values corresponding to that level are quantized according to the initial quantization precision. Determine the quantization parameter corresponding to each model parameter and / or activation value, compare the quantization parameter with a preset quantization threshold, and adjust the quantization precision corresponding to the model parameter and / or activation value at that level based on the comparison result; Compare the changes in the probability relevance parameter values of each output word before and after quantization precision adjustment, and the changes in the training model when performing inference with quantization precision and full precision, respectively. Based on the comparison between the change in the probability correlation parameter value and the preset change threshold, the target quantization accuracy corresponding to at least one level of model parameters and / or activation values is determined, and the model to be trained is trained with the target quantization accuracy to obtain the trained target model.
[0006] Furthermore, the quantization parameters include scaling parameters, wherein the scaling parameters are the ratio of the quantized numerical range to the unquantized numerical range; the step of comparing the quantization parameters with a preset quantization threshold, and adjusting the quantization precision corresponding to the model parameters and / or activation values at this level based on the comparison result, includes: When the scaling parameter corresponding to the model parameters and / or activation values of this level is less than or equal to a preset quantization threshold, and the number of occurrences is greater than or equal to a first preset number of occurrences or the occurrence ratio is greater than or equal to a first preset ratio, the precision level of the quantization accuracy corresponding to the model parameters and / or activation values of this level is increased and / or the quantization range is expanded.
[0007] Furthermore, the quantization parameters include a precision loss parameter, wherein the precision loss parameter is the maximum number of identical values after quantization and different values before quantization, or the ratio of the maximum number of identical values after quantization and different values before quantization to the number of quantization parameters for that layer; the step of comparing the quantization parameters with a preset quantization threshold and adjusting the quantization precision corresponding to the model parameters and / or activation values of that layer based on the comparison result includes: When the accuracy loss parameter corresponding to the model parameters and / or activation values at this level is greater than a preset quantization threshold, and the number of occurrences is greater than or equal to a second preset number of occurrences or the occurrence ratio is greater than or equal to a second preset ratio, the accuracy level of the quantization accuracy corresponding to the model parameters and / or activation values at this level is increased and / or the accuracy range is expanded.
[0008] Furthermore, the preset quantization threshold and / or the preset variable threshold are dynamic thresholds, and the text model reinforcement learning training method further includes: Based on the current training stage and / or the current training task type, determine the preset quantization threshold and / or preset variation threshold corresponding to the current training stage and / or the current training task type; Based on the preset quantization threshold and / or preset variation threshold corresponding to the current training stage and / or the current training task type, the quantization accuracy corresponding to the model parameters and / or activation values of each level of the model to be trained is determined.
[0009] Furthermore, the text model reinforcement learning training method also includes: When the training process of the model to be trained reaches the calibration trigger condition, the verification text data is input into the model to be trained, and the model to be trained performs full-precision inference and quantized precision inference respectively to obtain the probability distribution correlation of the output lexical units of the model to be trained in different precision modes. The distribution deviation between model training and model inference is determined based on the correlation of the probability distribution, and the preset quantization threshold and / or the preset variation threshold are dynamically adjusted based on the distribution deviation.
[0010] Furthermore, the text model reinforcement learning training method also includes: The quantization precision corresponding to at least one level of model parameters and / or activation values is upgraded or downgraded in turn, and the test change of the probability correlation parameter value of each output word corresponding to the inference of the model under training with quantization precision and full precision before and after the upgrade or downgrade is determined. Based on the comparison between the test variation of the probability correlation parameter value and the preset test variation threshold, the test quantization accuracy corresponding to at least one level of model parameters and / or activation values is determined, and the model to be trained is trained with the test quantization accuracy to obtain the trained target model.
[0011] Furthermore, the text model reinforcement learning training method also includes: The quantization precision corresponding to at least one level of model parameters and / or activation values is upgraded or downgraded in turn, and the test change of the probability correlation parameter value of each output word corresponding to the inference of the model under training with quantization precision and full precision before and after the upgrade or downgrade is determined. Based on the test variation of the probabilistic correlation parameter values, construct the weight distribution of the impact of quantization accuracy at each level on model performance; Based on the weight distribution, the preset quantization threshold and / or preset variation threshold corresponding to the model parameters and / or activation values of each level are determined respectively.
[0012] Furthermore, the text model reinforcement learning training method also includes: Every preset number of training steps, or when the model gradient at the current quantization precision converges below a preset value, a search method is executed to alternately upgrade or downgrade the quantization precision corresponding to at least one level of model parameters and / or activation values.
[0013] Furthermore, the text model reinforcement learning training method also includes: Obtain the teacher model corresponding to the model to be trained; wherein the teacher model is a model with full-precision data format; In each training step of the model to be trained, a portion of the training text is randomly selected from the training text data with a preset sampling probability and input into the teacher model to obtain the first word probability distribution of the portion of the training text under full precision calculation, and at the same time, the second word probability distribution of the model to be trained on the portion of the training text under a determined quantization precision is obtained. The distillation loss is calculated based on the difference between the first word probability distribution and the second word probability distribution; The distillation loss is introduced as a regularization term into the reinforcement learning objective function to form a joint optimization objective function, and the model parameters of the model to be trained are updated based on the joint optimization objective function.
[0014] Secondly, embodiments of this application also provide a text model reinforcement learning training device, the text model reinforcement learning training device comprising: The quantization processing module is used to train the model to be trained using training text data. For at least one level in the model to be trained, the model parameters and / or activation values corresponding to that level are quantized according to the initial quantization precision. The precision adjustment module is used to determine the quantization parameter corresponding to each model parameter and / or activation value, compare the quantization parameter with a preset quantization threshold, and adjust the quantization precision corresponding to the model parameter and / or activation value at that level according to the comparison result. The variable determination module is used to compare the change in the probability correlation parameter value of each output word before and after the quantization precision adjustment, and when the model to be trained performs inference with quantization precision and full precision respectively. The model training module is used to determine the target quantization accuracy corresponding to at least one level of model parameters and / or activation values based on the comparison result of the change in the probability correlation parameter value and the preset change threshold, and to train the model to be trained with the target quantization accuracy to obtain the trained target model.
[0015] This application provides a text model reinforcement learning training method and apparatus. First, a training text data is used to train the model to be trained. For at least one level in the model to be trained, the model parameters and / or activation values corresponding to that level are quantized according to an initial quantization precision. Then, the quantization parameters corresponding to each model parameter and / or activation value are determined, and the quantization parameters are compared with a preset quantization threshold. Based on the comparison result, the quantization precision corresponding to the model parameters and / or activation values of that level is adjusted. The changes in the probability relevance parameter values of each output word corresponding to inference with quantization precision and full precision are compared before and after the quantization precision adjustment. Finally, based on the comparison result of the changes in the probability relevance parameter values and the preset change threshold, a target quantization precision corresponding to the model parameters and / or activation values of at least one level is determined, and the model to be trained is trained with the target quantization precision to obtain a trained target model.
[0016] This application achieves hierarchical heterogeneous precision allocation by initially quantizing model parameters and / or activation values at each level of the model, dynamically adjusting the quantization precision based on the quantization parameters, comparing the changes in probability correlation before and after precision adjustment, and determining the target quantization precision. This approach ensures training efficiency and reduces training costs through low-precision training, while precisely adjusting the precision of key levels to align training and inference precision, significantly improving the final effect of reinforcement learning training. It effectively solves the technical challenge of balancing training cost and inference performance in existing technologies.
[0017] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 A flowchart illustrating a text model reinforcement learning training method provided in this application embodiment; Figure 2 A schematic diagram of the structure of a text model reinforcement learning training device provided in an embodiment of this application; Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0020] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. The components of the embodiments of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of this application. Based on the embodiments of this application, every other embodiment obtained by those skilled in the art without inventive effort falls within the scope of protection of this application.
[0021] First, the applicable scenarios for this application will be introduced. This application can be applied to the field of model training technology.
[0022] In reinforcement learning training, quantization techniques are typically used to optimize the utilization of GPU memory and bandwidth resources, increase data processing throughput, and thus reduce training costs and improve training efficiency. This technique effectively reduces GPU memory usage and communication overhead by representing model parameters and activation values with low-precision data types, thereby significantly improving training efficiency. However, high-precision data types are still required during the model deployment and inference phase to ensure generation quality.
[0023] Research has revealed that while this differentiated precision configuration achieves a certain balance between training costs and inference performance, it also limits the optimization space for model inference performance, preventing the training effects of reinforcement learning from being fully realized in the inference phase. For the same training data and model parameters, the probability distribution of word terms generated in the training and inference phases differs significantly. This makes it difficult to align the parameter optimization objectives of reinforcement learning with the actual content generated in the inference phase, resulting in inference performance failing to reach the optimal level achieved during training. Therefore, how to improve model inference performance while reducing training costs has become a critical technical challenge that urgently needs to be addressed.
[0024] Based on this, the embodiments of this application provide a text model reinforcement learning training method that ensures training efficiency and reduces training costs with low-precision training, and achieves precision alignment between training and inference by accurately adjusting the precision of key levels, significantly improving the final effect of reinforcement learning training, and effectively solving the technical problem in the prior art that training costs and inference performance cannot be balanced.
[0025] Please see Figure 1 , Figure 1 This is a flowchart illustrating a text model reinforcement learning training method provided in an embodiment of this application. Figure 1As shown in the embodiments of this application, the text model reinforcement learning training method includes: S101, using training text data to train the model to be trained, for at least one level in the model to be trained, quantizing the model parameters and / or activation values corresponding to that level according to the initial quantization precision.
[0026] Here, the initial quantization precision uses low-precision data types, including but not limited to FP16, BF16, and even FP8. The core purpose is to save GPU memory and bandwidth resources, improve data throughput, and reduce training costs. Here, model parameters refer to the KQV matrix values of each attention head in each layer of the model, and activation values refer to the input and output values of each attention head in each layer of the model, i.e., the latent vector corresponding to each token in each layer. During training, the main body of the model maintains low quantization precision, or the parameters and activation values of each layer of the model are quantized, stored, and calculated. Simultaneously, the scaling ratio and value distribution range of the parameters and activation values of each layer are recorded to provide data support for subsequent precision adjustments.
[0027] Regarding step S101 above, in specific implementation, the training text data is used to train the model to be trained. For at least one level in the model to be trained, the model parameters and / or activation values corresponding to that level are quantized according to the initial quantization precision. The parameters of each attention point of each layer and the activation values of each token of each layer can be quantized separately to maximize the accuracy after quantization.
[0028] Please refer to Table 1 below. Table 1 is a feature difference table of different precision data types provided in the embodiments of this application. As shown in Table 1 below, FP32 has a wide range and high precision, FP16 has a narrow range and relatively high precision, and BF16 has a wide range (the same as FP32) and low precision (16 fewer binary bits of precision than FP32, and 3 fewer binary bits of precision than FP16).
[0029] Table 1. Characteristic differences of a data type with different precision
[0030] S102, determine the quantization parameter corresponding to each model parameter and / or activation value, compare the quantization parameter with a preset quantization threshold, and adjust the quantization precision corresponding to the model parameter and / or activation value at that level according to the comparison result.
[0031] Regarding step S102 above, in specific implementation, the quantization parameter corresponding to each model parameter and / or activation value is determined, and the quantization parameter is compared with a preset quantization threshold. Based on the comparison result, the quantization accuracy corresponding to the model parameter and / or activation value at that level is adjusted.
[0032] Here, adjustments include increasing and decreasing precision, such as gradually adjusting the quantization precision level between FP32, FP16, and FP8. These adjustments can also include expanding the range and increasing precision. For example, adjusting from BF16 to FP16 can reduce the range and increase precision, suitable for scenarios where parameter values are similar and multiple similar parameters are quantized to the same value. Adjusting from FP16 to BF16 can expand the range and decrease precision, suitable for scenarios where parameter values differ significantly and are not similar. By comparing the quantization parameters with preset quantization thresholds, it is possible to accurately determine whether the current quantization precision corresponding to the model parameters and / or activation values at that level is suitable for the parameter characteristics of that level, thereby achieving precise and dynamic adjustment of the quantization precision at each level.
[0033] As an optional embodiment, the quantization parameter includes a scaling parameter, wherein the scaling parameter is the ratio of the numerical range before quantization to the numerical range before quantization. Specifically, regarding step S102 above, comparing the quantization parameter with a preset quantization threshold and adjusting the quantization precision corresponding to the model parameters and / or activation values at this level based on the comparison result includes: When the scaling parameter corresponding to the model parameters and / or activation values of this level is less than or equal to a preset quantization threshold, and the number of occurrences is greater than or equal to a first preset number of occurrences or the occurrence ratio is greater than or equal to a first preset ratio, the precision level of the quantization accuracy corresponding to the model parameters and / or activation values of this level is increased and / or the quantization range is expanded.
[0034] Here, the scaling parameter is the ratio of the range of values after quantization to the range of values before quantization. Its magnitude directly reflects the degree of compression of parameters or activation values during quantization. The smaller the scaling parameter, the more severely the range of values after quantization is compressed relative to the range before quantization. This can lead to severe compression of a large number of model parameters and / or activation values. Multiple different values before quantization may be compressed into the same value after quantization, causing serious accuracy loss and affecting the accuracy alignment between training and inference. The preset number of iterations can be 1, or a dynamic number can be set. It can gradually decrease as the training phase increases, or it can decrease synchronously with the preset quantization threshold. This application does not make specific limitations on this.
[0035] In specific implementation, when the scaling parameter corresponding to the model parameters and / or activation values at this level is less than or equal to the preset quantization threshold, and the number of occurrences of this non-compliance is greater than or equal to the first preset number of occurrences, or the proportion of occurrences is greater than or equal to the first preset proportion, it is necessary to increase the quantization precision level of the model parameters and / or activation values at this level (e.g., upgrade from FP8 to FP16, FP16 to FP32, etc.), and / or expand its quantization range (e.g., expand from FP16 to BF16, etc.) to reduce the precision loss caused by quantization compression, ensure that the parameters and activation values at this level can accurately reflect the model inference state, and achieve precision alignment between training and inference.
[0036] Here, regarding scaling parameters, taking FP32 quantization to FP16 as an example, although the numerical range and precision of FP32 are much greater than those of FP16, the parameters of each attention head in each layer and the activation values of each token in each layer are not uniformly distributed across the entire range of FP32. For example, a matrix with a dimension of 4 contains 4 parameters, namely [1.22324234, 2.41234232, 3.41234243, 4.32141234]. During quantization, the parameters of each attention head in each layer and the activation values of each token in each layer can be quantized separately to maximize the precision after quantization. Separate quantization means determining the scaling factor based on the maximum and minimum values of the quantization target and the maximum range after quantization. In this example, the minimum parameter value can be quantized to -65500 and the maximum value can be quantized to 65500, thus determining a scaling factor greater than 1. This indicates that the parameter value is amplified during quantization. Although some precision is lost, the parameter is not compressed.
[0037] If the four parameters of a matrix in another layer are [-1, 22324234], then... 10 30 2.41234232 10 20 3.41234243 10 10 4.32141234 10 30In this example, the minimum parameter value is quantized to -65500 and the maximum value is quantized to 65500, thus determining the scaling factor to be a value extremely close to 0, which is less than the preset quantization threshold. This indicates that the parameter value is scaled down during quantization, resulting in a loss of precision and severe compression of the parameter. If the parameter appears more than or equal to the first preset number of times or the proportion of occurrence is greater than or equal to the first preset proportion during multi-step inference, such as appearing twice consecutively or exceeding 50% in 10 training steps, it indicates that the parameter is prone to extreme values. If it is quantized, it will be extremely compressed, and it can be restored to full precision in the next training step without quantization.
[0038] As another optional embodiment, the quantization parameters include a precision loss parameter, wherein the precision loss parameter is the maximum number of identical values after quantization and different values before quantization, or the ratio of the maximum number of identical values after quantization and different values before quantization to the number of quantization parameters for that layer. Specifically, regarding step S102 above, comparing the quantization parameters with a preset quantization threshold and adjusting the quantization precision corresponding to the model parameters and / or activation values of that layer based on the comparison result includes: When the accuracy loss parameter corresponding to the model parameters and / or activation values at this level is greater than a preset quantization threshold, and the number of occurrences is greater than or equal to a second preset number of occurrences or the occurrence ratio is greater than or equal to a second preset ratio, the accuracy level of the quantization accuracy corresponding to the model parameters and / or activation values at this level is increased and / or the accuracy range is expanded.
[0039] The second preset number of times here can be 1, and this application does not make a specific limitation on this.
[0040] In the implementation of the above steps, when the precision loss parameter corresponding to the model parameters and / or activation values at this level is greater than the preset quantization threshold, and the number of times the precision loss exceeds the threshold is greater than or equal to the second preset number of times, or the proportion of occurrence is greater than or equal to the second preset proportion, it is necessary to increase the quantization precision level of the model parameters and / or activation values at this level, and / or expand its precision range. This is to alleviate the problem of excessively concentrated parameter distribution after quantization, avoid the gradient calculated by the reward value in reinforcement learning from failing to effectively apply to parameter optimization and causing training saturation, ensure that the parameters at this level can accurately transmit training information, and guarantee the precision matching between training and inference.
[0041] Here, as an example, regarding the accuracy loss parameters, if the four parameters are [-1.22324234]... 10 30 2.41234232 10 30 3.41234243 1030 4.32141234 10 30 Although the last three parameters show significant differences in distribution before scaling, after scaling, due to the smaller representation range, they may all correspond to a single quantization value, resulting in a more concentrated distribution and indistinguishability. Alternatively, with four parameters [1.22324234, 4.41234232, 4.41234243, 4.41234214], the last three parameters show little difference in distribution before scaling. Even after scaling, due to the smaller quantization precision, they may all correspond to a single quantization value, resulting in a more concentrated distribution. After scaling, the number of parameters corresponding to the same value exceeds the preset value, or the proportion of parameters corresponding to the same value exceeds the preset proportion, making them indistinguishable after scaling. At this point, the gradient calculated based on the reward value of reinforcement learning cannot effectively optimize the parameter. If the number of occurrences during multi-step inference is greater than or equal to the second preset number of occurrences or the proportion of occurrences is greater than or equal to the second preset proportion, such as occurring twice consecutively, or the proportion of occurrences exceeding 50% in 10 training steps, it indicates that the parameter has reached training saturation at the quantization precision. If quantized, the training is ineffective. In the next training step, the quantization precision level can be increased, such as from FP8 to FP16, or restored to full precision without quantization, or the precision range can be expanded, such as from BF16 to FP16, etc.
[0042] S103, compare the changes in the probability correlation parameter values of each output word before and after quantization precision adjustment, and the changes in the training model when performing inference with quantization precision and full precision, respectively.
[0043] Here, the probability correlation parameter can be the Pearson correlation coefficient, which measures the degree of matching between the probabilities of the model output words during training (quantization precision) and inference (full precision). Each point in the Pearson correlation coefficient corresponds to the ratio of the probability of each output word under different precision inference conditions, such as quantization precision probability / full precision probability. If it is 1, it falls on the diagonal; if it is greater than 1, it falls above the diagonal; if it is less than 1, it falls below the diagonal. The correlation parameter value is ultimately determined based on the distance of all points from the diagonal.
[0044] S104. Based on the comparison between the change in the probability correlation parameter value and the preset change threshold, determine the target quantization accuracy corresponding to at least one level of model parameters and / or activation values, and train the model to be trained with the target quantization accuracy to obtain the trained target model.
[0045] Regarding step S104 above, in specific implementation, the correlation between the word probability output of the model training data obtained from inference on the same training sample data before and after the precision change and the word probability output of the model inference data is compared. For example, the Pearson correlation coefficient before the precision adjustment is 0.95, and the Pearson correlation coefficient before the precision adjustment is 0.99. If the increase exceeds the preset change threshold or the corresponding proportion, the precision calculation and storage settings after the precision adjustment are retained; if the preset change threshold is not reached, it means that the improvement of precision does not affect the matching results of training and inference, but only increases the computation and storage costs, so low precision calculation and storage can continue to be used. After precision adjustment, the model to be trained is trained with the target precision to obtain the target model after training. By comparing the change in the correlation parameter values before and after the precision adjustment, it can be determined whether the precision adjustment is effective. If the correlation increases after the adjustment, it means that the precision adjustment helps to align the word probability distribution of training and inference; if the correlation does not increase beyond the preset value or decreases after the adjustment, it means that the precision adjustment is ineffective and will increase the training cost. At this level, there is no need to increase the precision, and low precision training can be maintained.
[0046] In this way, through the above steps, a hierarchical heterogeneous precision allocation scheme is finally formed, which not only retains the efficiency advantage of low-precision training, but also achieves precision alignment between training and inference through high-precision configuration of key levels. While controlling training costs, it significantly improves the upper limit of model inference performance.
[0047] As an optional embodiment, the preset quantization threshold and / or the preset variable threshold are dynamic thresholds, and the text model reinforcement learning training method further includes: A: Based on the current training stage and / or the current training task type, determine the preset quantization threshold and / or preset variation threshold corresponding to the current training stage and / or the current training task type.
[0048] B: Based on the current training stage and / or the preset quantization threshold and / or preset variation threshold corresponding to the current training task type, determine the quantization accuracy corresponding to the model parameters and / or activation values of each level of the model to be trained.
[0049] Here, according to the embodiments provided in this application, the preset quantization threshold and / or preset variation threshold are dynamic thresholds. The dynamic threshold setting can adopt an open-loop scheme, that is, different preset values are preset according to different model training steps, generally showing a trend of high at the beginning and low at the end. A higher preset value is set in the early stages of training, allowing the model to maintain low-precision reinforcement learning training even when there are deviations in training and inference outputs, prioritizing training efficiency. In the later stages of training, the preset value is gradually reduced to decrease the deviation between training and inference, gradually shifting from low-precision training to full-precision training, prioritizing the alignment of training and inference accuracy.
[0050] Regarding steps A-B above, in specific implementation, based on the current training stage and / or the current training task type, determine the preset quantization threshold and / or preset variation threshold corresponding to the current training stage and / or the current training task type. Then, based on the preset quantization threshold and / or preset variation threshold corresponding to the current training stage and / or the current training task type, determine the quantization accuracy corresponding to the model parameters and / or activation values of each level of the model to be trained.
[0051] As an optional embodiment, the text model reinforcement learning training method further includes: a: When the training process of the model to be trained reaches the calibration trigger condition, the verification text data is input into the model to be trained, and the model to be trained performs full-precision inference and quantized precision inference respectively to obtain the probability distribution correlation of the output lexical units of the model to be trained in different precision modes.
[0052] Here, the calibration trigger condition can be reaching a preset number of training steps, or the model gradient converging to below a preset value at this quantization accuracy. This means that the model performance has reached a certain stage at this quantization accuracy, and the threshold can be lowered to improve accuracy for the next stage of training.
[0053] Regarding step a above, in specific implementation, high-precision calibration can be performed periodically. When the training process of the model to be trained reaches the calibration trigger condition, the verification text data is input into the model to be trained, and the model to be trained performs full-precision inference and quantization precision inference respectively, and calculates the probability distribution correlation of the output words under different precisions. For example, the probability distribution correlation here can be the Pearson correlation coefficient. The distribution deviation between training and inference is statistically analyzed and used for closed-loop dynamic adjustment of the preset quantization threshold and the preset variable threshold.
[0054] b: Determine the distribution deviation between model training and model inference based on the correlation of the probability distribution, and dynamically adjust the preset quantization threshold and / or the preset variation threshold based on the distribution deviation.
[0055] Regarding step b above, in specific implementation, the distribution deviation between model training and model inference is determined based on the correlation of probability distribution. When the distribution deviation is large, the preset quantization threshold and / or preset variation threshold can be reduced, thereby improving the quantization accuracy of the attention head parameters and activation values of more key layers or modifying them to full precision for training, thus improving the training effect.
[0056] As an optional embodiment, the text model reinforcement learning training method further includes: I: Alternately upgrade or downgrade the quantization precision corresponding to at least one level of model parameters and / or activation values, and determine the test change in the probability correlation parameter value of each output word corresponding to inference with quantization precision and full precision before and after the upgrade or downgrade.
[0057] II: Based on the comparison between the test variation of the probability correlation parameter value and the preset test variation threshold, determine the test quantization accuracy corresponding to at least one level of model parameters and / or activation values, and train the model to be trained with the test quantization accuracy to obtain the trained target model.
[0058] Regarding steps I-II above, in specific implementation, the quantization precision corresponding to at least one level of model parameters and / or activation values is alternately upgraded or downgraded. The test variation of the probability relevance parameter value for each output word corresponding to inference with quantization precision and full precision before and after the upgrade or downgrade is determined. For example, the probability relevance parameter value here can be the Pearson correlation coefficient. Then, based on the comparison between the test variation of the probability relevance parameter value and a preset test variation threshold, the test quantization precision corresponding to at least one level of model parameters and / or activation values is determined. The model to be trained is then trained using the test quantization precision to obtain the trained target model.
[0059] Thus, according to steps I-II above, in addition to detecting quantization parameters and dynamically adjusting quantization precision, the quantization precision of the model parameters at each layer can be upgraded or downgraded sequentially. The changes in the correlation between quantization precision inference and full precision inference before and after the upgrade or downgrade can be observed. If the correlation increases beyond the preset test change threshold after the upgrade, the upgraded quantization precision is retained. If the correlation decreases below the preset test change threshold after the downgrade, the downgraded quantization precision is retained. This further improves model training performance or saves training memory usage, thereby improving training efficiency.
[0060] As an optional embodiment, the text model reinforcement learning training method further includes: i: In turn, upgrade or downgrade the quantization precision corresponding to at least one level of model parameters and / or activation values, and determine the test change of the probability correlation parameter value of each output word corresponding to inference with quantization precision and full precision before and after the upgrade or downgrade.
[0061] ii: Based on the test variation of the probabilistic correlation parameter values, construct the weight distribution of the impact of quantization accuracy on model performance at each level.
[0062] iii: Based on the weight distribution, determine the preset quantization threshold and / or preset variation threshold corresponding to the model parameters and / or activation values of each level.
[0063] For steps i-iii above, in specific implementation, a weight distribution of the impact of quantization precision on model performance is constructed based on the tested changes in the probability relevance parameter values of each output word corresponding to inference with quantized precision and full precision before and after upgrading or downgrading, respectively. For example, the probability relevance parameter value here can be the Pearson correlation coefficient. Then, based on the weight distribution, the preset quantization threshold and / or preset variation threshold corresponding to the model parameters and / or activation values of each level are determined. Here, since the degree of influence of each level's precision on the final inference precision is different, a preset threshold corresponding to each level needs to be set separately. The higher the weight, the lower the preset threshold. By testing the impact of different level precision adjustments on model performance, a weight distribution is constructed. Levels with a greater impact on model performance are assigned higher weights, and the corresponding preset quantization thresholds and preset variation thresholds are set more strictly to prioritize ensuring precision matching at that level. Levels with a smaller impact on model performance are assigned lower weights, and the corresponding thresholds are set more leniently to prioritize training efficiency, thus achieving personalized threshold configuration. The weight distribution can also be dynamically updated according to the preset training stage. For example, the weight distribution can be recalibrated every certain training step to adapt to the needs of different training stages.
[0064] As an optional embodiment, the text model reinforcement learning training method further includes: Every preset number of training steps, or when the model gradient at the current quantization precision converges below a preset value, a search method is executed to alternately upgrade or downgrade the quantization precision corresponding to at least one level of model parameters and / or activation values.
[0065] In the implementation of the above steps, every preset number of training steps (e.g., every 5000 steps) can be used to adapt to the dynamic changes in model parameters during training. Alternatively, when the model gradient converges to below a preset value, it indicates that the model training is stabilizing under the current precision configuration. At this point, a search method is executed to upgrade or downgrade the quantization precision corresponding to at least one level of model parameters and / or activation values in turn. That is, the above process of step I-step II or step i-step iii is executed once. This can further optimize the quantization precision configuration, improve the upper limit of model performance, and avoid training from getting stuck in local optima.
[0066] As an optional embodiment, the text model reinforcement learning training method further includes: (1) Obtain the teacher model corresponding to the model to be trained.
[0067] Regarding step (1) above, in specific implementation, the teacher model corresponding to the model to be trained is obtained. Here, the teacher model is a model with full-precision data format, and the parameters of the teacher model are the unquantized model parameters corresponding to the model to be trained. Here, as an example, a high-precision (FP32) copy, i.e., the teacher model, is always maintained. It can be the initial SFT model, or an FP32 version that is updated synchronously with training. For example, after each training step, the full-precision parameters of the model to be trained before quantization into low-precision model parameters are passed to the teacher model as the full-precision parameters of the teacher model.
[0068] (2) In each training step of the model to be trained, a portion of the training text is randomly selected from the training text data with a preset sampling probability and input into the teacher model to obtain the first word probability distribution of the portion of the training text under full precision calculation, and at the same time, the second word probability distribution of the model to be trained on the portion of the training text under a determined quantization precision is obtained.
[0069] Here, the preset sampling probability can be set according to training needs, randomly selecting a portion of the training text instead of all of it. This ensures the effectiveness of the guidance while avoiding excessive computational costs, achieving a balance between efficiency and effectiveness. The first word probability distribution is obtained by the teacher model under full-precision computation without quantization noise pollution, reflecting the ideal output of the model during inference; the second word probability distribution is the output of the student model under the current quantization precision, and the difference between the two is the information loss caused by quantization.
[0070] Regarding step (2) above, in specific implementation, in each training step of the model to be trained, a portion of the training text is randomly selected from the training text data with a preset sampling probability and input into the teacher model to obtain the first word probability distribution of the portion of the training text under full precision calculation, and at the same time, the second word probability distribution of the model to be trained on the portion of the training text under a determined quantization precision is obtained.
[0071] (3) Calculate the distillation loss based on the difference between the first word probability distribution and the second word probability distribution.
[0072] (4) The distillation loss is introduced as a regularization term into the reinforcement learning objective function to form a joint optimization objective function, and the model parameters of the model to be trained are updated based on the joint optimization objective function.
[0073] For the randomly selected training samples, in addition to the RL reward signal, this embodiment adds an extra distillation loss. This loss forces the student model's output (logits or token probability distribution) to align with the teacher model's output. Here, the distillation loss acts as a regularization term, constraining the student model's policy updates from deviating too far from the high-precision benchmark.
[0074] For steps (3)-(4) above, in specific implementation, the distillation loss is calculated based on the difference between the probability distribution of the first word and the probability distribution of the second word. Then, the distillation loss is introduced as a regularization term into the reinforcement learning objective function to form a joint optimization objective function, and the model parameters of the model to be trained are updated based on the joint optimization objective function.
[0075] Thus, according to the above steps (1)-(4), stable training can be achieved, preventing degradation. RL exploration is high-risk, and the model may become obsessed with short-term rewards, learning some fragile policies that will collapse in a quantized environment. The teacher model, as a stable, high-precision knowledge source used in actual reasoning, continuously provides guidance to the student model on "what is the correct and robust representation." At the same time, it transmits high-precision information, and the teacher model is not polluted by quantization noise. Through KL divergence loss, the quantized student model is forced not only to imitate actions, but also to imitate the essence of the high-precision model in "thinking" (internal representation). This can effectively compensate for the information loss caused by quantization. Adding distillation loss to the objective function of RL is equivalent to a powerful regularization term. It constrains policy updates from deviating too far from the high-precision benchmark, guiding exploration in a more promising and quantization-robust direction. This directly improves the sample efficiency of training and the upper limit of final performance.
[0076] The text model reinforcement learning training method provided in this application first trains the model to be trained using training text data. For at least one level in the model to be trained, the model parameters and / or activation values corresponding to that level are quantized according to an initial quantization precision. Then, the quantization parameters corresponding to each model parameter and / or activation value are determined, and the quantization parameters are compared with a preset quantization threshold. Based on the comparison result, the quantization precision corresponding to the model parameters and / or activation values of that level is adjusted. The changes in the probability relevance parameter values of each output word corresponding to inference with quantization precision and full precision are compared before and after the quantization precision adjustment. Finally, based on the comparison result of the changes in the probability relevance parameter values and the preset change threshold, the target quantization precision corresponding to the model parameters and / or activation values of at least one level is determined, and the model to be trained is trained with the target quantization precision to obtain the trained target model.
[0077] This application achieves hierarchical heterogeneous precision allocation by initially quantizing model parameters and / or activation values at each level of the model, dynamically adjusting the quantization precision based on the quantization parameters, comparing the changes in probability correlation before and after precision adjustment, and determining the target quantization precision. This approach ensures training efficiency and reduces training costs through low-precision training, while precisely adjusting the precision of key levels to align training and inference precision, significantly improving the final effect of reinforcement learning training. This allows the model to truly reach the performance ceiling optimized during training during inference, effectively solving the technical challenge of balancing training cost and inference performance in existing technologies.
[0078] Please see Figure 2 , Figure 2 This is a schematic diagram of the structure of a text model reinforcement learning training device provided in an embodiment of this application. Figure 2 As shown, the text model reinforcement learning training device 200 includes: The quantization processing module 201 is used to train the model to be trained using training text data, and to quantize the model parameters and / or activation values corresponding to at least one level in the model to be trained according to the initial quantization precision. The precision adjustment module 202 is used to determine the quantization parameter corresponding to each model parameter and / or activation value, compare the quantization parameter with a preset quantization threshold, and adjust the quantization precision corresponding to the model parameter and / or activation value at that level according to the comparison result. The variable determination module 203 is used to compare the change in the probability correlation parameter value of each output word before and after the quantization precision adjustment, and when the model to be trained performs inference with quantization precision and full precision respectively. The model training module 204 is used to determine the target quantization accuracy corresponding to at least one level of model parameters and / or activation values based on the comparison result of the change in the probability correlation parameter value and the preset change threshold, and to train the model to be trained with the target quantization accuracy to obtain the trained target model.
[0079] Furthermore, the quantization parameters include scaling parameters, wherein the scaling parameters are the ratio of the numerical range after quantization to the numerical range before quantization; when the precision adjustment module 202 compares the quantization parameters with a preset quantization threshold and adjusts the quantization precision corresponding to the model parameters and / or activation values at this level based on the comparison result, the precision adjustment module 202 is also used for: When the scaling parameter corresponding to the model parameters and / or activation values of this level is less than or equal to a preset quantization threshold, and the number of occurrences is greater than or equal to a first preset number of occurrences or the occurrence ratio is greater than or equal to a first preset ratio, the precision level of the quantization accuracy corresponding to the model parameters and / or activation values of this level is increased and / or the quantization range is expanded.
[0080] Furthermore, the quantization parameters include a precision loss parameter, wherein the precision loss parameter is the maximum number of identical values after quantization and different values before quantization, or the ratio of the maximum number of identical values after quantization and different values before quantization to the number of quantization parameters for that layer; when the precision adjustment module 202 compares the quantization parameters with a preset quantization threshold and adjusts the quantization precision corresponding to the model parameters and / or activation values of that layer based on the comparison result, the precision adjustment module 202 is further used for: When the accuracy loss parameter corresponding to the model parameters and / or activation values at this level is greater than a preset quantization threshold, and the number of occurrences is greater than or equal to a second preset number of occurrences or the occurrence ratio is greater than or equal to a second preset ratio, the accuracy level of the quantization accuracy corresponding to the model parameters and / or activation values at this level is increased and / or the accuracy range is expanded.
[0081] Furthermore, the preset quantization threshold and / or the preset variable threshold are dynamic thresholds, and the text model reinforcement learning training device further includes a quantization accuracy determination module, which is used for: Based on the current training stage and / or the current training task type, determine the preset quantization threshold and / or preset variation threshold corresponding to the current training stage and / or the current training task type; Based on the preset quantization threshold and / or preset variation threshold corresponding to the current training stage and / or the current training task type, the quantization accuracy corresponding to the model parameters and / or activation values of each level of the model to be trained is determined.
[0082] Furthermore, the text model reinforcement learning training device also includes a threshold adjustment module, which is used for: When the training process of the model to be trained reaches the calibration trigger condition, the verification text data is input into the model to be trained, and the model to be trained performs full-precision inference and quantized precision inference respectively to obtain the probability distribution correlation of the output lexical units of the model to be trained in different precision modes. The distribution deviation between model training and model inference is determined based on the correlation of the probability distribution, and the preset quantization threshold and / or the preset variation threshold are dynamically adjusted based on the distribution deviation.
[0083] Furthermore, the text model reinforcement learning training device also includes a precision search module, which is used for: The quantization precision corresponding to at least one level of model parameters and / or activation values is upgraded or downgraded in turn, and the test change of the probability correlation parameter value of each output word corresponding to the inference of the model under training with quantization precision and full precision before and after the upgrade or downgrade is determined. Based on the comparison between the test variation of the probability correlation parameter value and the preset test variation threshold, the test quantization accuracy corresponding to at least one level of model parameters and / or activation values is determined, and the model to be trained is trained with the test quantization accuracy to obtain the trained target model.
[0084] Furthermore, the threshold adjustment module is also used for: The quantization precision corresponding to at least one level of model parameters and / or activation values is upgraded or downgraded in turn, and the test change of the probability correlation parameter value of each output word corresponding to the inference of the model under training with quantization precision and full precision before and after the upgrade or downgrade is determined. Based on the test variation of the probabilistic correlation parameter values, construct the weight distribution of the impact of quantization accuracy at each level on model performance; Based on the weight distribution, the preset quantization threshold and / or preset variation threshold corresponding to the model parameters and / or activation values of each level are determined respectively.
[0085] Furthermore, every preset number of training steps, or when the model gradient at the current quantization precision converges to below a preset value, the precision search module or the threshold adjustment module executes a search method that alternately upgrades or downgrades the quantization precision corresponding to at least one level of model parameters and / or activation values.
[0086] Furthermore, the text model reinforcement learning training device also includes a model parameter update module, which is used for: Obtain the teacher model corresponding to the model to be trained; wherein the teacher model is a model with full-precision data format; In each training step of the model to be trained, a portion of the training text is randomly selected from the training text data with a preset sampling probability and input into the teacher model to obtain the first word probability distribution of the portion of the training text under full precision calculation, and at the same time, the second word probability distribution of the model to be trained on the portion of the training text under a determined quantization precision is obtained. The distillation loss is calculated based on the difference between the first word probability distribution and the second word probability distribution; The distillation loss is introduced as a regularization term into the reinforcement learning objective function to form a joint optimization objective function, and the model parameters of the model to be trained are updated based on the joint optimization objective function.
[0087] Please see Figure 3 , Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 3 As shown, the electronic device 300 includes a processor 310, a memory 320, and a bus 330.
[0088] The memory 320 stores machine-readable instructions executable by the processor 310. When the electronic device 300 is running, the processor 310 and the memory 320 communicate via the bus 330. When the machine-readable instructions are executed by the processor 310, they can perform the operations described above. Figure 1 The steps of the text model reinforcement learning training method in the method embodiment shown are described in detail in the method embodiment, and will not be repeated here.
[0089] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, can perform the above-described actions. Figure 1 The steps of the text model reinforcement learning training method in the method embodiment shown are described in detail in the method embodiment, and will not be repeated here.
[0090] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0091] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Additionally, the shown or discussed mutual couplings, direct couplings, or communication connections may be through some communication interfaces; indirect couplings or communication connections between devices or units may be electrical, mechanical, or other forms.
[0092] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0093] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0094] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a processor-executable, non-volatile, computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0095] Finally, it should be noted that the above-described embodiments are merely specific implementations of this application, used to illustrate the technical solutions of this application, and not to limit them. The scope of protection of this application is not limited thereto. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the scope of the technology disclosed in this application. Such modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A text model reinforcement learning training method, characterized in that, The text model reinforcement learning training method includes: The training text data is used to train the model to be trained. For at least one level in the model to be trained, the model parameters and / or activation values corresponding to that level are quantized according to the initial quantization precision. Determine the quantization parameter corresponding to each model parameter and / or activation value, compare the quantization parameter with a preset quantization threshold, and adjust the quantization precision corresponding to the model parameter and / or activation value at that level based on the comparison result; Compare the changes in the probability relevance parameter values of each output word before and after quantization precision adjustment, and the changes in the training model when performing inference with quantization precision and full precision, respectively. Based on the comparison between the change in the probability correlation parameter value and the preset change threshold, the target quantization accuracy corresponding to at least one level of model parameters and / or activation values is determined, and the model to be trained is trained with the target quantization accuracy to obtain the trained target model.
2. The text model reinforcement learning training method according to claim 1, characterized in that, The quantization parameters include scaling parameters, wherein the scaling parameters are the ratio of the quantized numerical range to the unquantized numerical range; the step of comparing the quantization parameters with a preset quantization threshold, and adjusting the quantization precision corresponding to the model parameters and / or activation values at this level based on the comparison result, includes: When the scaling parameter corresponding to the model parameters and / or activation values of this level is less than or equal to a preset quantization threshold, and the number of occurrences is greater than or equal to a first preset number of occurrences or the occurrence ratio is greater than or equal to a first preset ratio, the precision level of the quantization accuracy corresponding to the model parameters and / or activation values of this level is increased and / or the quantization range is expanded.
3. The text model reinforcement learning training method according to claim 1, characterized in that, The quantization parameters include a precision loss parameter, wherein the precision loss parameter is the maximum number of identical values after quantization and different values before quantization, or the ratio of the maximum number of identical values after quantization and different values before quantization to the number of quantization parameters for that layer; the step of comparing the quantization parameters with a preset quantization threshold and adjusting the quantization precision corresponding to the model parameters and / or activation values of that layer based on the comparison result includes: When the accuracy loss parameter corresponding to the model parameters and / or activation values at this level is greater than a preset quantization threshold, and the number of occurrences is greater than or equal to a second preset number of occurrences or the occurrence ratio is greater than or equal to a second preset ratio, the accuracy level of the quantization accuracy corresponding to the model parameters and / or activation values at this level is increased and / or the accuracy range is expanded.
4. The text model reinforcement learning training method according to claim 1, characterized in that, The preset quantization threshold and / or the preset variable threshold are dynamic thresholds, and the text model reinforcement learning training method further includes: Based on the current training stage and / or the current training task type, determine the preset quantization threshold and / or preset variation threshold corresponding to the current training stage and / or the current training task type; Based on the preset quantization threshold and / or preset variation threshold corresponding to the current training stage and / or the current training task type, the quantization accuracy corresponding to the model parameters and / or activation values of each level of the model to be trained is determined.
5. The text model reinforcement learning training method according to claim 1, characterized in that, The text model reinforcement learning training method also includes: When the training process of the model to be trained reaches the calibration trigger condition, the verification text data is input into the model to be trained, and the model to be trained performs full-precision inference and quantized precision inference respectively to obtain the probability distribution correlation of the output lexical units of the model to be trained in different precision modes. The distribution deviation between model training and model inference is determined based on the correlation of the probability distribution, and the preset quantization threshold and / or the preset variation threshold are dynamically adjusted based on the distribution deviation.
6. The text model reinforcement learning training method according to claim 1, characterized in that, The text model reinforcement learning training method also includes: The quantization precision corresponding to at least one level of model parameters and / or activation values is upgraded or downgraded in turn, and the test change of the probability correlation parameter value of each output word corresponding to the inference of the model under training with quantization precision and full precision before and after the upgrade or downgrade is determined. Based on the comparison between the test variation of the probability correlation parameter value and the preset test variation threshold, the test quantization accuracy corresponding to at least one level of model parameters and / or activation values is determined, and the model to be trained is trained with the test quantization accuracy to obtain the trained target model.
7. The text model reinforcement learning training method according to claim 1, characterized in that, The text model reinforcement learning training method also includes: The quantization precision corresponding to at least one level of model parameters and / or activation values is upgraded or downgraded in turn, and the test change of the probability correlation parameter value of each output word corresponding to the inference of the model under training with quantization precision and full precision before and after the upgrade or downgrade is determined. Based on the test variation of the probabilistic correlation parameter values, construct the weight distribution of the impact of quantization accuracy at each level on model performance; Based on the weight distribution, the preset quantization threshold and / or preset variation threshold corresponding to the model parameters and / or activation values of each level are determined respectively.
8. The text model reinforcement learning training method according to claim 6 or 7, characterized in that, The text model reinforcement learning training method also includes: Every preset number of training steps, or when the model gradient at the current quantization precision converges below a preset value, a search method is executed to alternately upgrade or downgrade the quantization precision corresponding to at least one level of model parameters and / or activation values.
9. The text model reinforcement learning training method according to claim 1, characterized in that, The text model reinforcement learning training method also includes: Obtain the teacher model corresponding to the model to be trained; wherein the teacher model is a model with full-precision data format; In each training step of the model to be trained, a portion of the training text is randomly selected from the training text data with a preset sampling probability and input into the teacher model to obtain the first word probability distribution of the portion of the training text under full precision calculation, and at the same time, the second word probability distribution of the model to be trained on the portion of the training text under a determined quantization precision is obtained. The distillation loss is calculated based on the difference between the first word probability distribution and the second word probability distribution; The distillation loss is introduced as a regularization term into the reinforcement learning objective function to form a joint optimization objective function, and the model parameters of the model to be trained are updated based on the joint optimization objective function.
10. A text model reinforcement learning training device, characterized in that, The text model reinforcement learning training device includes: The quantization processing module is used to train the model to be trained using training text data. For at least one level in the model to be trained, the model parameters and / or activation values corresponding to that level are quantized according to the initial quantization precision. The precision adjustment module is used to determine the quantization parameter corresponding to each model parameter and / or activation value, compare the quantization parameter with a preset quantization threshold, and adjust the quantization precision corresponding to the model parameter and / or activation value at that level according to the comparison result. The variable determination module is used to compare the change in the probability correlation parameter value of each output word before and after the quantization precision adjustment, and when the model to be trained performs inference with quantization precision and full precision respectively. The model training module is used to determine the target quantization accuracy corresponding to at least one level of model parameters and / or activation values based on the comparison result of the change in the probability correlation parameter value and the preset change threshold, and to train the model to be trained with the target quantization accuracy to obtain the trained target model.