Reinforcement learning training method and device of text language model

By introducing a dynamic clipping threshold in the reinforcement learning training of text language models, the problem of fixed thresholds inhibiting the learning of long thought chains is solved, achieving a more efficient training process and performance improvement. The dynamic clipping threshold allows for appropriate updates of keyword elements, facilitating the realization of complex reasoning behaviors.

CN122242628APending Publication Date: 2026-06-19SHANGHAI XIYU JIZHI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI XIYU JIZHI TECH CO LTD
Filing Date
2026-03-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing reinforcement learning training methods suffer from limited performance improvement potential, unstable training process, low efficiency, and high cost in text language models. In particular, they are inadequate in long-chain reasoning behavior. Fixed pruning thresholds suppress the update contribution of keyword units, hindering the natural emergence of long-chain reasoning behavior.

Method used

Dynamic clipping thresholding is used to clip the weights of words. By retaining the gradient contribution of key behavioral words through dynamic clipping thresholding, the model can learn complex reasoning patterns more effectively in reinforcement learning training, prevent gradient explosion, and improve training efficiency and model performance.

🎯Benefits of technology

Dynamic clipping thresholds enable more efficient model training on complex inference tasks, avoiding inefficiency and training failures caused by improper fixed threshold settings, thus improving model performance and achieving faster policy convergence and higher computational efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242628A_ABST
    Figure CN122242628A_ABST
Patent Text Reader

Abstract

This application provides a reinforcement learning training method and apparatus for a text language model. The method includes: for each training text data, determining a first importance sampling weight for each target word in the first inference result based on a first inference result determined by the language model to be trained on the training text data and a second inference result determined in the previous training step; dynamically pruning the first importance sampling weight to determine a second importance sampling weight for the target word; determining the current objective function value of the model based on the second importance sampling weight and / or first probability of each target word, and the reward value corresponding to the first inference result; and updating the model parameters based on the gradient value determined based on the objective function value to obtain the target text language model. In this way, by dynamically pruning the word weights through a pruning threshold during reinforcement training, the stability of model training is ensured, and the model training efficiency and performance are improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a reinforcement learning training method and apparatus for a text language model. Background Technology

[0002] With the rapid development of artificial intelligence technology, text language models have demonstrated outstanding performance in numerous fields due to their powerful natural language understanding and generation capabilities. To further enhance the reasoning ability and decision-making quality of text language models, reinforcement learning has been widely introduced into the model training process, optimizing policies to align with human preferences or task objectives. However, existing reinforcement learning training methods, especially when applied to text language models that have not been trained using reinforcement learning, often face significant challenges, resulting in limited performance improvement potential, unstable training processes, low efficiency, and high costs.

[0003] The root cause lies in the shortcomings of existing methods in promoting long-chain reasoning behavior. Long-chain reasoning typically involves complex reflection and verification steps. In the early stages of training, because the model has not yet fully learned such patterns, during the policy update phase of reinforcement learning, importance sampling-based algorithms assign high importance sampling weights to some low-probability but potentially high-exploration-value terms, meaning they require large gradient update magnitudes to facilitate learning. However, to maintain training stability and prevent gradient explosion or training collapse, existing techniques typically employ a fixed pruning threshold to trim these excessively large weight updates. While this fixed pruning mechanism stabilizes gradients to some extent, it also suppresses the update contributions of terms crucial to long-chain reasoning behavior, making it difficult for the model to effectively learn and reinforce reflective patterns, thus hindering the natural emergence of long-chain reasoning behavior. Summary of the Invention

[0004] In view of this, the purpose of this application is to provide a reinforcement learning training method and apparatus for a text language model, which performs word weight clipping through a dynamic clipping threshold during reinforcement training, thereby ensuring the stability of model training and improving model training efficiency and model performance.

[0005] This application provides a reinforcement learning training method for a text language model, the training method comprising: For each training text data, obtain the first inference result determined by the language model to be trained in the current training step in the inference of the training text data, and based on the first inference result and the second inference result determined in the previous training step, determine the first importance sampling weight of each target word included in the first inference result. For each target word, the first importance sampling weight of the target word is clipped based on the dynamic clipping threshold of the target word to determine the second importance sampling weight of the target word; The current objective function value of the language model to be trained is determined based on at least one of the second importance sampling weight of each target word and the first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result. The gradient value is calculated based on the objective function value, and the parameters of the language model to be trained are updated according to the gradient value to obtain the target text language model.

[0006] Optionally, determining the dynamic clipping threshold for the target word includes: The dynamic plucking threshold for the target word is determined based on at least one of the following: the training step of the current model to be trained, the gradient value corresponding to at least one previous training step, and / or the model performance; or, For each target lexical unit, the semantic association degree between the target lexical unit and each preset word is identified; wherein, the preset words are words related to the model's reflective behavior; The dynamic pruning threshold of the target word is determined based on the semantic association between the target word and each preset word, as well as the association weight of each preset word; or, the dynamic pruning threshold of the target word is determined based on the semantic association between the target word and each preset word, the association weight of each preset word, and the training state of the current model to be trained.

[0007] Optionally, the step of determining the preset words includes: Semantic and statistical analysis is performed on multiple high-quality outputs from at least one reference language model to determine multiple words and phrases related to model reflection and the frequency of each word and phrase. The words and phrases related to model reflection are used as preset words, and the relevance weight of each preset word is determined according to the frequency of each word and phrase.

[0008] Optionally, determining the first importance sampling weight for each target word in the first inference result based on the first inference result and the second inference result determined in the previous training step includes: Determine whether the length of the first inference result is the same as the length of the second inference result; If they are consistent, then based on the ratio of the first probability of each target word in the first inference result to the second probability of each target word in the second inference result, the first importance sampling weight of each target word included in the first inference result is determined. If they are inconsistent, the first importance sampling weight of each target word in the first inference result with the same length as the second inference result is determined based on the ratio of the first probability of each target word in the first inference result to the second probability of each target word in the second inference result. The first probability of each target word in the part of the first inference result that is longer than the second inference result is used as the first importance sampling weight of each target word.

[0009] Optionally, the method further includes: Determine the length difference between the first inference result and the second inference result; If the difference is greater than or equal to a preset number of tokens, and / or the ratio of the difference to the first inference result or the second inference result is greater than or equal to a preset ratio, then at least one of the following is applied to the objective function: applying a penalty term, reducing the dynamic shearing threshold, discarding the objective function value corresponding to the first inference result, or setting a weight less than 1 to calculate the gradient value of the objective function value corresponding to the first inference result.

[0010] Optionally, the step of determining the second importance sampling weight of each target word by cutting its first importance sampling weight based on a dynamic cutting threshold includes: For each target word, the weight threshold range of the target word is determined based on the dynamic clipping threshold of the target word. Identify whether the first importance sampling weight of the target word is within its weight threshold range; If it is located, the first importance sampling weight of the target word is determined as the second importance sampling weight of the target word; If it is not located, the boundary weight closest to the first importance sampling weight will be determined as the second importance sampling weight of the target word.

[0011] Optionally, determining the current objective function value of the language model to be trained based on at least one of the second importance sampling weight of each target word and the first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result, includes: The function value of each target word is determined by multiplying its second importance sampling weight by the reward value corresponding to the first inference result; or, The second importance sampling weight of each target word is coefficientized, and the function value of each target word is determined by multiplying the coefficient corresponding to the second importance sampling weight of each target word, the logarithmic result determined based on the first probability of each target word, and the reward value corresponding to the first inference result. Based on the mean of the function values ​​of all target lexical units in the first inference result, the current target function value of the language model to be trained is determined.

[0012] Optionally, the method further includes: For each training text data, a parallel sampling method is used to obtain multiple first inference results determined by the language model to be trained in the current training step in the inference of the training text data; Based on the mean of the function values ​​of all target lexical units in multiple first inference results, the current target function value of the language model to be trained is determined.

[0013] Optionally, the method further includes: Based on at least one of the following: current model training steps, model performance, reward value of the first inference result, and dynamic pruning threshold for each target lexical unit, determine the lexical coefficient of the function value of each target lexical unit; Based on the average of the function values ​​of all target lexical units and the corresponding lexical unit coefficients in the first inference result, the current target function value of the language model to be trained is determined.

[0014] This application embodiment also provides a reinforcement learning training device for a text language model, the training device comprising: The acquisition module is used to acquire, for each training text data, the first inference result determined by the language model to be trained in the current training step in the inference of the training text data, and based on the first inference result and the second inference result determined in the previous training step, determine the first importance sampling weight of each target word included in the first inference result. The cutting module is used to cut the first importance sampling weight of each target word based on the dynamic cutting threshold of the target word, and determine the second importance sampling weight of the target word. The determination module is used to determine the current objective function value of the language model to be trained based on at least one of the second importance sampling weight of each target word and the first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result. The update module is used to calculate the gradient value based on the objective function value, and update the parameters of the language model to be trained according to the gradient value to obtain the target text language model.

[0015] This application embodiment also provides an electronic device, including: a processor, a memory, and a bus. The memory stores machine-readable instructions executable by the processor. When the electronic device is running, the processor communicates with the memory via the bus. When the machine-readable instructions are executed by the processor, the steps of the training method described above are performed.

[0016] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, performs the steps of the training method described above.

[0017] This application provides a reinforcement learning training method and apparatus for a text language model. The training method includes: for each training text data, obtaining a first inference result determined by the language model to be trained in the current training step through inference on the training text data, and determining a first importance sampling weight for each target word included in the first inference result based on the first inference result and a second inference result determined in the previous training step; for each target word, cutting the first importance sampling weight of the target word based on a dynamic cutting threshold of the target word to determine a second importance sampling weight of the target word; determining the current objective function value of the language model to be trained based on at least one of the second importance sampling weight of each target word and a first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result; calculating a gradient value based on the objective function value, and updating the parameters of the language model to be trained based on the gradient value to obtain a target text language model.

[0018] This approach, by dynamically preserving the gradient contributions of key behavioral terms, enables the model to more effectively learn and explore complex reasoning patterns during reinforcement learning training. This effectively overcomes the deficiency in existing technologies that hinder the natural emergence of long thought chain reasoning behaviors, resulting in higher training efficiency for complex reasoning tasks. Furthermore, the dynamic pruning threshold allows for appropriate updates to key behavioral terms while still constraining extreme weight values ​​that could cause gradient explosion. This makes the training process smoother, avoiding inefficiency, training failure, or performance fluctuations caused by improperly set fixed thresholds. Moreover, by eliminating fixed pruning, the model's strategy converges to a better solution more quickly. This allows for the training of higher-performing text language models with the same computational resources, solving the problems of limited performance improvement potential, low efficiency, and high cost in existing technologies.

[0019] In summary, this application addresses the problem of fixed thresholds inhibiting the learning of long thought chains in existing technologies by introducing a dynamic shearing threshold. Furthermore, this scheme, while ensuring stable training, preserves update space for high-value but low-probability reasoning terms, thereby promoting the realization of complex reasoning behaviors and improving model performance and training efficiency.

[0020] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0021] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0022] Figure 1 A flowchart illustrating a reinforcement learning training method for a text language model provided in an embodiment of this application; Figure 2 A schematic diagram of the structure of a reinforcement learning training device for a text language model provided in an embodiment of this application; Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. The components of the embodiments of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely represents selected embodiments of this application. Based on the embodiments of this application, every other embodiment obtained by those skilled in the art without inventive effort falls within the scope of protection of this application.

[0024] With the rapid development of artificial intelligence technology, text language models have demonstrated outstanding performance in numerous fields due to their powerful natural language understanding and generation capabilities. To further enhance the reasoning ability and decision-making quality of large text language models, reinforcement learning has been widely introduced into the model training process, optimizing policies to align with human preferences or task objectives. However, existing reinforcement learning training methods, especially when applied to text language models that have not been trained using reinforcement learning, often face significant challenges, resulting in limited performance improvement potential, unstable training processes, low efficiency, and high costs.

[0025] The root cause lies in the shortcomings of existing methods in promoting long-chain reasoning behavior. Long-chain reasoning typically involves complex reflection and verification steps. In the early stages of training, because the model has not yet fully learned such patterns, during the policy update phase of reinforcement learning, importance sampling-based algorithms assign high importance sampling weights to some low-probability but potentially high-exploration-value terms, meaning they require large gradient update magnitudes to facilitate learning. However, to maintain training stability and prevent gradient explosion or training collapse, existing techniques typically employ a fixed pruning threshold to trim these excessively large weight updates. While this fixed pruning mechanism stabilizes gradients to some extent, it also suppresses the update contributions of terms crucial to long-chain reasoning behavior, making it difficult for the model to effectively learn and reinforce reflective patterns, thus hindering the natural emergence of long-chain reasoning behavior.

[0026] Based on this, embodiments of this application provide a reinforcement learning training method and apparatus for a text language model. During the reinforcement training process, the weights of words are cut off by a dynamic cutting threshold, thereby ensuring the stability of model training and improving model training efficiency and model performance.

[0027] Please see Figure 1 , Figure 1 This is a flowchart illustrating a reinforcement learning training method for a text language model provided in an embodiment of this application. Figure 1 As shown in the embodiments of this application, the training method includes: S101. For each training text data, obtain the first inference result determined by the language model to be trained in the current training step in the inference of the training text data, and based on the first inference result and the second inference result determined in the previous training step, determine the first importance sampling weight of each target word included in the first inference result. S102. For each target word, the first importance sampling weight of the target word is cut based on the dynamic cutting threshold of the target word to determine the second importance sampling weight of the target word. S103. Determine the current objective function value of the language model to be trained based on at least one of the second importance sampling weight of each target word and the first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result.

[0028] S104. Calculate the gradient value based on the objective function value, and update the parameters of the language model to be trained according to the gradient value to obtain the target text language model.

[0029] The exemplary steps of the embodiments of this application are described below: For step S101, this step specifically includes: acquiring a training dataset; for each training text data in the training dataset, in the current training step, inputting the training text data into a language model to be trained. The model performs inference based on its current network parameters and generates corresponding output text, which is the first inference result. The first inference result consists of a series of tokens.

[0030] Simultaneously, the second inference result generated by the same language model to be trained (under the parameter state of the previous training step) inferences on the same training text data is obtained. Based on the first inference result and the second inference result, the first importance sampling weight of each target word in the first inference result is calculated.

[0031] In one embodiment provided in this application, determining the first importance sampling weight of each target word included in the first inference result based on the first inference result and the second inference result determined in the previous training step includes: S1011. Determine whether the length of the first reasoning result is consistent with the length of the second reasoning result; S1012. If they are consistent, then based on the ratio of the first probability of each target word in the first inference result to the second probability of each target word in the second inference result, determine the first importance sampling weight of each target word included in the first inference result. S1013. If they are inconsistent, then based on the ratio of the first probability of each target word in the first inference result to the second probability of each target word in the second inference result in the same length portion of the first inference result and the second inference result, determine the first importance sampling weight of each target word in the first inference result in the same length portion as the second inference result; and take the first probability of each target word in the portion of the first inference result longer than the second inference result as the first importance sampling weight of each target word.

[0032] For step S1011, this step includes: comparing the lengths of the first reasoning result and the second recommendation result; if the lengths of the two are the same, proceed to step S1012; if they are not the same, proceed to step S1013.

[0033] Regarding step S1012, when the lengths of the first inference result and the second inference result are equal, it indicates that the two results are structurally perfectly aligned. At this point, for each target word in the first inference result, the ratio of its generation probability (first probability) under the current policy of the language model to be trained is calculated to the generation probability (second probability) of the word in the same position in the second inference result under the previous policy. This ratio is used as the first importance sampling weight for that target word.

[0034] For step S1013, this step specifically includes: when the lengths of the first inference result and the second inference result are inconsistent, the first inference result is divided into two parts: a prefix part with the same length as the second inference result, and a suffix part that exceeds the length of the second inference result.

[0035] For the prefix portion, that is, each target word in the first inference result from the starting position to the position with the same length as the second inference result, the first importance sampling weight of the word is determined by using the ratio of the first probability of the word at that position to the second probability of the corresponding word at the second inference result, in the same way as in step S1012.

[0036] For the suffix part, that is, the part of the first inference result that is longer than the second inference result, since there is no corresponding word in the second inference result for comparison, the first probability of each target word in this part is directly used as its first importance sampling weight.

[0037] Furthermore, the portion of the first inference result shorter than the second inference result does not need to have its importance sampling weight calculated. For example, only the importance sampling weight of the first inference result can be calculated, that is, the importance sampling weight of the portion of the first inference result that is the same as the second inference result can be calculated.

[0038] In this way, regardless of whether the length of the first inference result changes, the first importance sampling weight can be accurately assigned to each target word in the first inference result, providing a reliable basis for subsequent weight clipping and objective function calculation.

[0039] For step S102, this step specifically includes: in order to limit the magnitude of policy updates and prevent the model from becoming unstable due to excessive single update magnitude, a dynamic clipping threshold is determined for each target word included in the first inference result.

[0040] Here, the dynamic clipping threshold can be adaptively determined based on the characteristics of the lexical unit, the current training step, and other conditions. For example, for high-frequency words, the clipping threshold can be set smaller to avoid frequent changes affecting the overall training; for reflection-related low-frequency words, the clipping threshold can be set larger to support the model's exploratory learning and improve training efficiency. Then, the first importance sampling weights calculated in step S101 are clipped using this dynamic clipping threshold to obtain the second importance sampling weights. In this way, on the one hand, the updated strategy is ensured not to deviate too far from the old strategy, guaranteeing training stability; on the other hand, it can better encourage the model's exploratory learning, improving training efficiency and model performance.

[0041] Dynamic shear thresholds may include upper shear thresholds and / or lower shear thresholds.

[0042] Furthermore, in one embodiment provided in this application, three methods for determining the dynamic clipping threshold are provided, specifically as follows: the dynamic clipping threshold of the target word is determined through the following steps: A. Determine the dynamic clipping threshold of the target word based on at least one of the following: the training steps of the current model to be trained, the gradient values ​​corresponding to at least one previous training step, and / or the model performance. B. Alternatively, for each target word, identify the degree of semantic association between the target word and each preset word; wherein, the preset words are words related to the model's reflective behavior; and determine the dynamic clipping threshold of the target word based on the degree of semantic association between the target word and each preset word and the association weight of each preset word. C. Alternatively, the dynamic clipping threshold of the target word can be determined based on the semantic association between the target word and each preset word, the association weight of each preset word, and the training state of the current model to be trained.

[0043] For method A, the specific steps include: adaptively determining the dynamic pruning threshold for each target lexical unit based on the current training progress of the model to be trained and / or historical optimization information. The training progress can be characterized by the current training step; generally, the larger the training step, the closer the model is to convergence, and the pruning threshold can be reduced accordingly. The historical optimization information includes gradient values ​​and / or model performance metrics corresponding to at least one previous training step. For example, when the gradient values ​​corresponding to at least one previous training step are unstable, the threshold can be appropriately tightened to stabilize training; when the gradient values ​​are stable, the threshold can be appropriately relaxed to stabilize training. Model performance metrics include the perplexity or reward value of the model on the validation set; when performance improvement slows down, the threshold can be relaxed to encourage exploration; when performance improvement is rapid, the threshold can be tightened to stabilize training.

[0044] For implementation method B, the dynamic clipping threshold is determined based on the semantic association degree of lexical units.

[0045] Specifically, this involves identifying the semantic association degree between each target word in the first inference result and each preset word. Then, based on the semantic association degree between the target word and each preset word, and the association weight of each preset word, a weighted sum is performed to obtain the dynamic clipping threshold of the target word.

[0046] The preset words are those related to the model's reflective behavior, such as "reconsider," "in other words," "actually," "my mistake," "let's think about it step by step," "let's review the reasoning process above," "re-examine the reasoning process," etc., representing the model's self-correction, restatement, and reflection. The semantic relevance can be quantified by calculating the cosine similarity, Euclidean distance, or using the similarity score of the pre-trained semantic model between the word vector and the preset word vector.

[0047] For example, in the prior art, the clipping threshold is 0.5, and importance sampling weights between 0.5 and 1.5 are unaffected by clipping, i.e., 1. The pruning threshold determines the pruning range of importance sampling weights. Weights below 0.5 are uniformly pruned to 0.5, and weights above 1.5 are uniformly pruned to 1.5. In this scheme, the association weight of each preset word can be uniformly set to 5. The semantic association degree between each word and the preset word can be defined as 0~1, where 0 represents completely unrelated. After multiplying by the association weight of 5, the pruning threshold is 0, meaning that words with an importance sampling weight of 1 are not affected by pruning, which is equivalent to pruning the importance sampling weights of all words to 1. 1 represents completely related. After multiplying by the association weight of 5, the threshold is 5, meaning that words with an importance sampling weight between -4 and 6 are not affected by pruning. Compared to the pruning range of 0.5~1.5 in the prior art, this application allows for a larger range of updates on words with the same or similar preset words, thereby ensuring training stability while supporting model exploration, accelerating model iteration, and improving training efficiency.

[0048] For implementation method C, the dynamic shearing threshold is determined based on a combination of training state and semantic association degree.

[0049] Specifically, the two methods described above are combined, taking into account both the training state of the current model to be trained (such as training steps, gradients, and performance) and the semantic association between the target lexical and reflection-related words, to comprehensively determine the dynamic clipping threshold of the target lexical. For example, implementation method A can be set as a threshold training coefficient related to the training state of the current model to be trained (such as training steps, gradients, and performance), with a value ranging from 0 to 1. The final dynamic clipping threshold is obtained by multiplying the clipping threshold obtained by method B by the threshold training coefficient obtained by method A. Other combinations are also possible, such as addition, which will not be elaborated upon here.

[0050] Regarding embodiment B, in one implementation provided in this application, the step of determining the preset words includes: S201. Perform semantic and statistical analysis on multiple high-quality outputs from at least one reference language model to determine multiple words and phrases related to model reflection and the frequency of each word and phrase. S202. The words and phrases related to model reflection are used as preset words, and the relevance weight of each preset word is determined according to the frequency of each word and phrase.

[0051] Step S201 specifically includes: acquiring multiple high-quality output texts generated by at least one reference language model under various prompts; performing semantic analysis on these high-quality output texts to identify words and phrases expressing reflective behaviors such as model self-correction, rethinking, error correction, and logical transitions; and simultaneously, performing frequency statistics on each identified word and phrase, recording the number of times it appears in all high-quality outputs.

[0052] Here, high-quality output refers to text that has been manually or automatically screened and is considered to have high accuracy, logic, and fluency, such as responses that score highly in question-and-answer, dialogue, and text generation tasks. Semantic analysis can employ methods such as keyword matching, syntactic analysis, pre-trained language model embedding clustering, and rule-based pattern recognition.

[0053] For step S202, this step may specifically include: taking the words and phrases identified in step S201 that are related to model reflection as words in a preset word library. Then, assigning a relevance weight to each word and phrase based on the frequency of its occurrence.

[0054] Here, the relevance weight reflects the degree of relevance between the word / phrase and the reflective behavior. The higher the frequency of occurrence, the more often the word / phrase is used in high-quality reflective expressions, and its relevance weight can be set accordingly higher.

[0055] This method can automatically extract words related to reflective behavior from a large amount of high-quality text and quantify their importance, providing an objective basis for dynamically adjusting the shearing threshold in subsequent reinforcement learning training.

[0056] Continuing with step S102, in one embodiment provided in this application, determining the second importance sampling weight of each target word by cutting its first importance sampling weight based on the dynamic cutting threshold of that target word includes: S1021. For each target word, determine the weight threshold range of the target word based on the dynamic clipping threshold of the target word. S1022. Identify whether the first importance sampling weight of the target word is within its weight threshold range; S1023. If it is located in the target word, the first importance sampling weight of the target word is determined as the second importance sampling weight of the target word. S1024. If not located, the boundary weight closest to the first importance sampling weight will be determined as the second importance sampling weight of the target word.

[0057] For step S1021, this step specifically includes: for each target word in the first inference result, obtaining the dynamic cut-off threshold determined for that word, and determining the weight threshold range of the target word based on the obtained dynamic cut-off threshold and its form.

[0058] Here, the dynamic clipping threshold can be a single value or an interval [min, max]. If the dynamic clipping threshold is given in numerical form, for example, the weight threshold range for the target word can be determined as [1-ε, 1+ε], where ε is the dynamic clipping threshold; if the dynamic clipping threshold is given in interval form, then this interval is directly used as the weight threshold range.

[0059] For step S1022, this step specifically includes: for each target word, comparing its first importance sampling weight calculated in step S101 with the weight threshold range determined in step S1021, and determining whether the weight value falls within the interval (including the endpoint value). If it does, then proceed to step S1023; if it does not, then proceed to step S1024.

[0060] For step S1023, the specific steps are as follows: if it is determined that the first importance sampling weight of the target word is within its weight threshold range, the first importance sampling weight is directly determined as the second importance sampling weight.

[0061] Here, when the first importance sampling weight is within the threshold range, it indicates that the weight value is within an acceptable update range and no clipping is required. Therefore, it is directly used as the second importance sampling weight for subsequent calculations.

[0062] For step S1024, the specific steps are as follows: if it is determined that the first importance sampling weight of the target word is outside its weight threshold range, the boundary weight closest to the first importance sampling weight is determined as the second importance sampling weight.

[0063] Here, when the weight of the first importance sampling exceeds the threshold range, it indicates that the weight value is too large or too small, and pruning is required. The pruning method is to take the boundary value within the weight threshold range that is closest to the weight of the first importance sampling as the weight of the second importance sampling for that target word. Specifically, if the weight of the first importance sampling is greater than the upper limit of the threshold range, the upper limit is taken as the weight of the second importance sampling; if the weight of the first importance sampling is less than the lower limit of the threshold range, the lower limit is taken as the weight of the second importance sampling.

[0064] Furthermore, the weights after dynamic shearing thresholding can be determined using the following publicly available formula: (1) in, It is the first importance sampling weight before cropping. It represents the second most important sampling weight after clipping, where `clip` is the clipping function. The lower dynamic shear threshold determined by the above method, This is the dynamically determined upper dynamic shear threshold. In our actual operation, we usually use... Set to a larger value so that there is no lower limit for the importance sampling weight.

[0065] For step S103, this step specifically includes: constructing and calculating the objective function value of the current training step based on at least one of the second importance sampling weight of each target word obtained in step S102 and the first probability of each target word determined by the language model to be trained, combined with the reward value corresponding to the first inference result. For example, the objective function value can be obtained using the following objective function calculation formula: (2) in, It is the second most important sampling weight after each target word is truncated. It is the reward value corresponding to each target word in the first reasoning result. The first probability for each target word, The target function value is obtained by normalizing the function value of each target word in the first inference result.

[0066] In one embodiment provided in this application, determining the current objective function value of the language model to be trained based on at least one of the second importance sampling weight of each target word and the first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result, includes: S10311, Determine the function value of each target word based on the product of the second importance sampling weight of each target word and the reward value corresponding to the first inference result; or, The second importance sampling weight of each target word is coefficientized, and the function value of each target word is determined by the product of the coefficient corresponding to the second importance sampling weight of each target word, the logarithmic result determined based on the first probability of each target word, and the reward value corresponding to the first inference result; for example, sg() in formula (2) represents "stop gradient". 1. Its mathematical meaning is: the sg() function plays the role of "blocking" in the computation graph. Its effect is that during forward propagation (calculating loss): sg(x) is equal to x and participates in the calculation normally. During backpropagation (calculating gradient): sg(x) is regarded as a constant, and the gradient will not be passed to its input through sg(). This is equivalent to coefficientizing the second importance sampling weight of each target word, and determining the function value of each target word by the product of the coefficient corresponding to the second importance sampling weight of each target word, the logarithmic result determined based on the first probability of each target word, and the reward value corresponding to the first inference result.

[0067] S10312. Based on the mean of the function values ​​of all target lexical units in the first inference result, determine the current target function value of the language model to be trained.

[0068] For step S10311, in which the function value of each target lexical is determined, at least one of the following two methods is used: Method 1: Based on the product of the second importance sampling weight and the reward value, specifically including: for each target word in the first inference result, multiply the second importance sampling weight determined in step S102 by the reward value corresponding to the entire first inference result to obtain the function value of the word.

[0069] Method 2: Based on the product of the second importance sampling weight coefficient, the logarithm of the word probability, and the reward value, this method involves: First, coefficientizing the second importance sampling weight of each target word, for example, by using it as the coefficient itself, or by obtaining the coefficient through some transformation (such as taking the logarithm or exponential scaling). Then, calculating the logarithm of each target word determined by the language model to be trained. Finally, multiplying the coefficient, the logarithm, and the reward value yields the function value of that word.

[0070] Specifically, step S10312 includes: after calculating the function value of each target word in the first inference result, calculating the arithmetic mean (or weighted average, summation, etc.) of the function values ​​of all words, and using the obtained mean as the target function value of the language model to be trained for the training text data in the current training step.

[0071] Furthermore, in another embodiment provided in this application, the method further includes: S10321. For each training text data, a parallel sampling method is used to obtain multiple first inference results determined by the language model to be trained in the current training step inferring the training text data. S10322. Based on the mean of the function values ​​of all target lexical units in multiple first inference results, determine the current target function value of the language model to be trained.

[0072] Step S10321 specifically includes: for each training text data, in the current training step, inputting the training text data into the language model to be trained. A parallel sampling method is used, enabling the model to generate multiple different output texts at once based on its current network parameters; these output texts are all used as the first inference result.

[0073] For step S10322, specifically, it includes: for each of the K first inference results obtained in step S10321, firstly, the function value of each target word in the inference result is calculated according to the aforementioned steps (such as S10311). Then, the function values ​​of all target words in all K inference results are aggregated together, and the arithmetic mean (or weighted average, summation, etc.) of these function values ​​is calculated. The obtained mean is used as the target function value of the language model to be trained for the training text data in the current training step.

[0074] In addition, the function value of each target word can be determined by setting hyperparameters based on the magnitude of the first importance sampling weight, thereby determining the target function value of each inference result. For example, as shown in formulas (3) and (4), if the magnitude of the importance sampling weight exceeds the pruning threshold, the word will not participate in gradient calculation and parameter update. For example, please refer to the following judgment formula: (3) (4) in, The hyperparameter function set in the objective function determines whether the function value corresponding to each target word should be included in the calculation of the objective function value based on whether the first importance sampling weight corresponding to each target word exceeds the clipping threshold.

[0075] Furthermore, in another embodiment provided in this application, the method further includes: S10331. Determine the lexical coefficient of each target lexical function value based on at least one of the following: current model training steps, model performance, reward value of the first inference result, and dynamic clipping threshold of each target lexical.

[0076] S10332. Based on the average of the function values ​​of all target lexical units and the corresponding lexical unit coefficients in the first inference result, determine the current target function value of the language model to be trained.

[0077] Lexical coefficients, for example, in formulas (3) and (4) Specifically, step S10331 includes: for each target lexical in the first inference result, determining the lexical coefficient of the lexical function value based on at least one of the following factors; the specific factors referred to are as follows: Current model training steps: For example, in the early stage of training, a larger lexical coefficient can be set. For example, in the early stage of training, the coefficient corresponding to "otherwise" in formula (4) can be set to 1 to encourage exploration. In the later stage of training, the lexical coefficient can be reduced. For example, in the later stage of training, the coefficient corresponding to "otherwise" in formula (4) can be set to 0.5 to ensure training stability.

[0078] Model performance: The distribution of lexical coefficients is dynamically adjusted based on the model's perplexity or reward value on the validation set. When performance is good, the lexical coefficients can be reduced to ensure training stability; when performance is poor, the lexical coefficients can be increased to encourage exploration.

[0079] The reward value of the first inference result: For outputs with higher reward values, the coefficient of the word can be appropriately increased so that the model references the better output strategy more. For outputs with lower reward values, the coefficient can be decreased so that the model reduces the reference influence of the worse output strategy.

[0080] Dynamic clipping threshold for each target lexical unit: For example, lexical units with larger clipping thresholds can be assigned higher coefficients to encourage adjustment; lexical units with smaller clipping thresholds can be assigned lower coefficients to maintain stability.

[0081] Specifically, step S10332 may include: after calculating the function value of each target word in the first inference result, using the word coefficients corresponding to each word determined in step S10331, performing a weighted average of the function values ​​of all words to obtain the target function value of the language model to be trained for the training text data under the current training step. For example, the calculation method shown in formula (3).

[0082] Furthermore, in another embodiment provided in this application, the method further includes: determining the length difference between the first inference result and the second inference result; if the difference is greater than or equal to a preset number of tokens, and / or the ratio of the difference to the first inference result or the second inference result is greater than or equal to a preset ratio, then at least one of the following is applied to the objective function: applying a penalty term, reducing the dynamic clipping threshold, discarding the objective function value corresponding to the first inference result, and setting a weight less than 1 to calculate the gradient value of the objective function value corresponding to the first inference result.

[0083] Specifically, this embodiment includes: determining the length difference between the first inference result and the second inference result; determining whether the length difference exceeds a preset threshold; and if so, performing an exception handling operation.

[0084] Here, the number of lexical units contained in the first inference result obtained in the current training step and the second inference result obtained in the previous training step are calculated, and the absolute value of the difference between the two is obtained to obtain the length difference.

[0085] The exception handling operation may specifically include: in response to the exception determination, performing at least one of the following operations to suppress unstable updates: 1. Apply a penalty term to the objective function: Add a penalty term to the original objective function, such as a KL divergence penalty based on length difference or an additional loss term proportional to ΔL, so that the model tends to suppress policy changes that lead to drastic length changes when updating.

[0086] 2. Reduce the dynamic shearing threshold: Temporarily reduce the dynamic shearing threshold originally used in step S102, for example by multiplying it by a coefficient less than 1, thereby more strictly limiting the range of importance sampling weights and reducing the update step size.

[0087] 3. Discard the objective function value corresponding to the first inference result: Ignore this training sample in the current step and do not use it for gradient calculation, that is, skip this update.

[0088] 4. Set a weight less than 1 for the objective function value corresponding to the first inference result to calculate the gradient value: When calculating the gradient, assign a lower weight to the objective function value to reduce its impact on the model parameter update.

[0089] In this way, when an abnormal change in the length of the generated sequence is detected, constraint measures are actively taken to prevent the model from deviating from the optimization direction due to a single large fluctuation, thereby improving the stability and convergence of the training process.

[0090] Step S104 specifically includes: calculating the gradient values ​​of each network parameter in the language model to be trained using the backpropagation algorithm based on the objective function value calculated in step S103. Then, updating the network parameters of the language model to be trained using the calculated gradient values ​​and a preset optimizer. Steps S101 to S104 are repeated until the model converges or reaches a preset number of training epochs, ultimately resulting in a trained target text language model.

[0091] In this way, the model trained by this method can better meet the reward objectives set in the reinforcement learning stage when generating text.

[0092] Furthermore, in another embodiment provided in this application, after updating the parameters of the language model to be trained according to the gradient value, the training method further includes: determining whether iteratively updating the language model to be trained using the current text data to be trained has reached a preset termination condition; wherein the preset termination condition includes at least one of the following: whether the number of training iterations has reached a threshold, or whether the performance requirements are met; if so, obtaining the next text data to be trained to continue iteratively updating the language model to be trained until the update termination condition is reached; if not, continuing to iteratively update the language model to be trained using the current text data to be trained until the threshold is reached.

[0093] Based on the same inventive concept, this application also provides a training device corresponding to the training method. Since the principle of the device in this application is similar to the training method described above, the implementation of the device can refer to the implementation of the method, and the repeated parts will not be described again.

[0094] Please see Figure 2 , Figure 2 This is a schematic diagram of the structure of a reinforcement learning training device for a text language model provided in an embodiment of this application. Figure 2 As shown, the training device 300 includes: The acquisition module 310 is used to acquire, for each training text data, a first inference result determined by the language model to be trained in the current training step in the inference of the training text data, and based on the first inference result and the second inference result determined in the previous training step, determine the first importance sampling weight of each target word included in the first inference result. The cutting module 320 is used to cut the first importance sampling weight of each target word based on the dynamic cutting threshold of the target word, and determine the second importance sampling weight of the target word. The determination module 330 is used to determine the current objective function value of the language model to be trained based on at least one of the second importance sampling weight of each target word and the first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result. The update module 340 is used to calculate the gradient value based on the objective function value, and update the parameters of the language model to be trained according to the gradient value to obtain the target text language model.

[0095] Optionally, the training device 300 is further configured to determine the dynamic clipping threshold of the target word through the following steps: The dynamic plucking threshold for the target word is determined based on at least one of the following: the training step of the current model to be trained, the gradient value corresponding to at least one previous training step, and / or the model performance; or, For each target lexical unit, the semantic association degree between the target lexical unit and each preset word is identified; wherein, the preset words are words related to the model's reflective behavior; The dynamic pruning threshold of the target word is determined based on the semantic association between the target word and each preset word, as well as the association weight of each preset word; or, the dynamic pruning threshold of the target word is determined based on the semantic association between the target word and each preset word, the association weight of each preset word, and the training state of the current model to be trained.

[0096] Optionally, the training device is further configured to determine the preset words through the following steps: Semantic and statistical analysis is performed on multiple high-quality outputs from at least one reference language model to determine multiple words and phrases related to model reflection and the frequency of each word and phrase. The words and phrases related to model reflection are used as preset words, and the relevance weight of each preset word is determined according to the frequency of each word and phrase.

[0097] Optionally, when determining the first importance sampling weight of each target word included in the first inference result based on the first inference result and the second inference result determined in the previous training step, the acquisition module 310 is used to: Determine whether the length of the first inference result is the same as the length of the second inference result; If they are consistent, then based on the ratio of the first probability of each target word in the first inference result to the second probability of each target word in the second inference result, the first importance sampling weight of each target word included in the first inference result is determined. If they are inconsistent, the first importance sampling weight of each target word in the first inference result with the same length as the second inference result is determined based on the ratio of the first probability of each target word in the first inference result to the second probability of each target word in the second inference result. The first probability of each target word in the part of the first inference result that is longer than the second inference result is used as the first importance sampling weight of each target word.

[0098] Optionally, the training device 300 is further used for: Determine the length difference between the first inference result and the second inference result; If the difference is greater than or equal to a preset number of tokens, and / or the ratio of the difference to the first inference result or the second inference result is greater than or equal to a preset ratio, then at least one of the following is applied to the objective function: applying a penalty term, reducing the dynamic shearing threshold, discarding the objective function value corresponding to the first inference result, or setting a weight less than 1 to calculate the gradient value of the objective function value corresponding to the first inference result.

[0099] Optionally, when the cutting module 320 is used to cut the first importance sampling weight of each target word based on the dynamic cutting threshold of the target word to determine the second importance sampling weight of the target word, the cutting module 320 is used to: For each target word, the weight threshold range of the target word is determined based on the dynamic clipping threshold of the target word. Identify whether the first importance sampling weight of the target word is within its weight threshold range; If it is located, the first importance sampling weight of the target word is determined as the second importance sampling weight of the target word; If it is not located, the boundary weight closest to the first importance sampling weight will be determined as the second importance sampling weight of the target word.

[0100] Optionally, when determining the current objective function value of the language model to be trained based on at least one of the second importance sampling weight of each target word and the first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result, the determining module 330 is used to: The function value of each target word is determined by multiplying its second importance sampling weight by the reward value corresponding to the first inference result; or, The second importance sampling weight of each target word is coefficientized, and the function value of each target word is determined by multiplying the coefficient corresponding to the second importance sampling weight of each target word, the logarithmic result determined based on the first probability of each target word, and the reward value corresponding to the first inference result. Based on the mean of the function values ​​of all target lexical units in the first inference result, the current target function value of the language model to be trained is determined.

[0101] Optionally, the training device 300 is further used for: For each training text data, a parallel sampling method is used to obtain multiple first inference results determined by the language model to be trained in the current training step in the inference of the training text data; Based on the mean of the function values ​​of all target lexical units in multiple first inference results, the current target function value of the language model to be trained is determined.

[0102] Optionally, the training device 300 is further used for: Based on at least one of the following: current model training steps, model performance, reward value of the first inference result, and dynamic pruning threshold for each target lexical unit, determine the lexical coefficient of the function value of each target lexical unit; Based on the average of the function values ​​of all target lexical units and the corresponding lexical unit coefficients in the first inference result, the current target function value of the language model to be trained is determined.

[0103] Please see Figure 3 , Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 3 As shown, the electronic device 400 includes a processor 410, a memory 420, and a bus 430.

[0104] The memory 420 stores machine-readable instructions executable by the processor 410. When the electronic device 400 is running, the processor 410 communicates with the memory 420 via the bus 430. When the machine-readable instructions are executed by the processor 410, they can perform the operations described above. Figure 1 The steps in the method embodiment shown are specifically implemented in the method embodiment and will not be repeated here.

[0105] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, can perform the above-described actions. Figure 1 The steps in the method embodiment shown are specifically implemented in the method embodiment and will not be repeated here.

[0106] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0107] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Additionally, the shown or discussed mutual couplings, direct couplings, or communication connections may be through some communication interfaces; indirect couplings or communication connections between devices or units may be electrical, mechanical, or other forms.

[0108] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0109] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0110] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a processor-executable, non-volatile, computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0111] Finally, it should be noted that the above-described embodiments are merely specific implementations of this application, used to illustrate the technical solutions of this application, and not to limit them. The scope of protection of this application is not limited thereto. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the scope of the technology disclosed in this application. Such modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A reinforcement learning training method of a text language model, characterized by, The training method includes: For each training text data, obtain the first inference result determined by the language model to be trained in the current training step in the inference of the training text data, and based on the first inference result and the second inference result determined in the previous training step, determine the first importance sampling weight of each target word included in the first inference result. For each target word, the first importance sampling weight of the target word is clipped based on the dynamic clipping threshold of the target word to determine the second importance sampling weight of the target word; The current objective function value of the language model to be trained is determined based on at least one of the second importance sampling weight of each target word and the first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result. The gradient value is calculated based on the objective function value, and the parameters of the language model to be trained are updated according to the gradient value to obtain the target text language model.

2. The training method of claim 1, wherein, Determining the dynamic clipping threshold for the target word includes: The dynamic plucking threshold for the target word is determined based on at least one of the following: the training step of the current model to be trained, the gradient value corresponding to at least one previous training step, and / or the model performance; or, For each target lexical unit, the semantic association degree between the target lexical unit and each preset word is identified; wherein, the preset words are words related to the model's reflective behavior; The dynamic pruning threshold of the target word is determined based on the semantic association between the target word and each preset word, as well as the association weight of each preset word; or, the dynamic pruning threshold of the target word is determined based on the semantic association between the target word and each preset word, the association weight of each preset word, and the training state of the current model to be trained.

3. The training method of claim 2, wherein, The steps for determining the preset words include: Semantic and statistical analysis is performed on multiple high-quality outputs from at least one reference language model to determine multiple words and phrases related to model reflection and the frequency of each word and phrase. The words and phrases related to model reflection are used as preset words, and the relevance weight of each preset word is determined according to the frequency of each word and phrase.

4. The training method according to claim 1, characterized in that, The determination of the first importance sampling weight for each target word in the first inference result, based on the first inference result and the second inference result determined in the previous training step, includes: Determine whether the length of the first inference result is the same as the length of the second inference result; If they are consistent, then based on the ratio of the first probability of each target word in the first inference result to the second probability of each target word in the second inference result, the first importance sampling weight of each target word included in the first inference result is determined. If they are inconsistent, the first importance sampling weight of each target word in the first inference result with the same length as the second inference result is determined based on the ratio of the first probability of each target word in the first inference result to the second probability of each target word in the second inference result. The first probability of each target word in the part of the first inference result that is longer than the second inference result is used as the first importance sampling weight of each target word.

5. The training method according to claim 4, characterized in that, The method further includes: Determine the length difference between the first inference result and the second inference result; If the difference is greater than or equal to a preset number of tokens, and / or the ratio of the difference to the first inference result or the second inference result is greater than or equal to a preset ratio, then at least one of the following is applied to the objective function: applying a penalty term, reducing the dynamic shearing threshold, discarding the objective function value corresponding to the first inference result, or setting a weight less than 1 to calculate the gradient value of the objective function value corresponding to the first inference result.

6. The training method according to claim 1, characterized in that, For each target word, the first importance sampling weight of the target word is clipped based on the dynamic clipping threshold of the target word to determine the second importance sampling weight of the target word, including: For each target word, the weight threshold range of the target word is determined based on the dynamic clipping threshold of the target word. Identify whether the first importance sampling weight of the target word is within its weight threshold range; If it is located, the first importance sampling weight of the target word is determined as the second importance sampling weight of the target word; If it is not located, the boundary weight closest to the first importance sampling weight will be determined as the second importance sampling weight of the target word.

7. The training method according to claim 1, characterized in that, The step of determining the current objective function value of the language model to be trained based on at least one of the second importance sampling weight of each target word and the first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result, includes: The function value of each target word is determined by multiplying its second importance sampling weight by the reward value corresponding to the first inference result; or, The second importance sampling weight of each target word is coefficientized, and the function value of each target word is determined by multiplying the coefficient corresponding to the second importance sampling weight of each target word, the logarithmic result determined based on the first probability of each target word, and the reward value corresponding to the first inference result. Based on the mean of the function values ​​of all target lexical units in the first inference result, the current target function value of the language model to be trained is determined.

8. The training method according to claim 7, characterized in that, The method further includes: For each training text data, a parallel sampling method is used to obtain multiple first inference results determined by the language model to be trained in the current training step in the inference of the training text data; Based on the mean of the function values ​​of all target lexical units in multiple first inference results, the current target function value of the language model to be trained is determined.

9. The training method according to claim 7, characterized in that, The method further includes: Based on at least one of the following: current model training steps, model performance, reward value of the first inference result, and dynamic pruning threshold for each target lexical unit, determine the lexical coefficient of the function value of each target lexical unit; Based on the average of the function values ​​of all target lexical units and the corresponding lexical unit coefficients in the first inference result, the current target function value of the language model to be trained is determined.

10. A reinforcement learning training device for a text language model, characterized in that, The training device includes: The acquisition module is used to acquire, for each training text data, the first inference result determined by the language model to be trained in the current training step in the inference of the training text data, and based on the first inference result and the second inference result determined in the previous training step, determine the first importance sampling weight of each target word included in the first inference result. The cutting module is used to cut the first importance sampling weight of each target word based on the dynamic cutting threshold of the target word, and determine the second importance sampling weight of the target word. The determination module is used to determine the current objective function value of the language model to be trained based on at least one of the second importance sampling weight of each target word and the first probability of each target word determined by the language model to be trained, and the reward value corresponding to the first inference result. The update module is used to calculate the gradient value based on the objective function value, and update the parameters of the language model to be trained according to the gradient value to obtain the target text language model.