Training method and device of dialogue model, equipment, storage medium and program product

By dynamically adjusting reward parameters and constraint thresholds, the problem of insufficient data stability generated by large language models under different business scenarios is solved, thereby improving the stability and controllability of dialogue model training and ensuring the effectiveness and convergence of the model during training.

CN122242631APending Publication Date: 2026-06-19KINGDEE SOFTWARE(CHINA) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
KINGDEE SOFTWARE(CHINA) CO LTD
Filing Date
2026-03-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, the stability and reliability of data generated by large language models in different business scenarios are insufficient because the applicability of fixed reward functions is reduced.

Method used

By acquiring the training progress of the initial dialogue model, the types of reward parameters and constraint thresholds are dynamically adjusted so that the reward judgment criteria change with the training process. This ensures that the initial training is lenient and the later training is strict, and a progressive learning strategy is adopted to enhance the effectiveness of the policy gradient and the stability of the training.

Benefits of technology

It significantly improves the stability and controllability of dialogue model training, ensures the discriminative power of reward values, enhances the model's performance across various reward parameters, and achieves more stable and efficient training convergence.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242631A_ABST
    Figure CN122242631A_ABST
Patent Text Reader

Abstract

This application relates to a training method, apparatus, computer device, computer-readable storage medium, and computer program product for a dialogue model. The method includes: obtaining the training progress of an initial dialogue model; determining a reward parameter corresponding to the training progress and a constraint threshold for the reward parameter; the number of reward parameter types is positively correlated with the training progress; the constraint threshold for the reward parameter is positively correlated with the training progress; determining the reward value of the initial dialogue model for the reward parameter; and optimizing the parameters of the initial dialogue model based on the reward value and the constraint threshold to obtain a dialogue model. This method can improve training stability and reliability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of dialogue model training, and in particular to a method, apparatus, computer device, computer-readable storage medium, and computer program product for training a dialogue model. Background Technology

[0002] In the application scenario of generating structured queries from natural language, the existing large language model reinforcement learning training scheme generates candidate expressions based on natural language input, and then iteratively optimizes the policy model with the reward signal through reinforcement learning algorithm.

[0003] In traditional techniques, the reward function and corresponding reward parameters are typically determined before training begins and remain unchanged throughout the reinforcement learning training process, ensuring that the reward determination criteria use the same set of fixed rules. However, when business scenarios differ, the applicability of these fixed rules decreases, leading to poor stability and insufficient reliability of the data generated by the model. Summary of the Invention

[0004] Therefore, it is necessary to provide a training method, apparatus, computer equipment, computer-readable storage medium, and computer program product for dialogue models that can improve training stability and reliability in response to the above-mentioned technical problems.

[0005] This application provides a method for training a dialogue model, including:

[0006] Obtain the training progress of the initial dialogue model;

[0007] Determine the reward parameter corresponding to the training progress and the constraint threshold of the reward parameter; the number of reward parameter types is positively correlated with the training progress; the constraint threshold of the reward parameter is positively correlated with the training progress.

[0008] Determine the reward value of the initial dialogue model in the reward parameter;

[0009] The parameters of the initial dialogue model are optimized based on the reward value and the constraint threshold to obtain the dialogue model.

[0010] This application also provides a training device for a dialogue model, comprising:

[0011] The progress acquisition module is used to obtain the training progress of the initial dialogue model;

[0012] A constraint module is used to determine the reward parameter corresponding to the training progress and the constraint threshold of the reward parameter; the number of reward parameter types is positively correlated with the training progress; the constraint threshold of the reward parameter is positively correlated with the training progress.

[0013] The reward value acquisition module is used to determine the reward value of the initial dialogue model in the reward parameter;

[0014] The model optimization module is used to optimize the parameters of the initial dialogue model based on the reward value and the constraint threshold to obtain the dialogue model.

[0015] This application also provides a computer device. The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the training steps of the dialogue model in any of the above embodiments.

[0016] This application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, implements the training steps of the dialogue model in any of the above embodiments.

[0017] This application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, implements the training steps of the dialogue model in any of the above embodiments.

[0018] The training methods, devices, computer equipment, computer-readable storage media, and computer program products of the aforementioned dialogue model determine the corresponding reward parameters and constraint thresholds based on the initial training progress of the dialogue model. This allows the reward judgment criteria to dynamically change with the training process, increasing the number of reward categories the model faces in the early and later stages of training. This significantly improves the discriminative power of reward values, thereby enhancing the effectiveness of policy gradients, making training more stable and easier to converge. Simultaneously, it allows different types of reward parameters to have differentiated learning rhythms, ensuring that the dialogue model performs well across all reward parameters. The increased constraint thresholds for reward parameters maintain effective discriminative power of reward signals throughout the training process. This scheduling method ensures that reward parameters are lenient in the early stages of training and strictly constrained in the later stages, thus achieving progressive learning and significantly improving the stability and controllability of reinforcement learning training. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 This is a diagram illustrating the application environment of a training method for a dialogue model in one embodiment.

[0021] Figure 2This is a flowchart illustrating the training method of a dialogue model in one embodiment;

[0022] Figure 3 This is a flowchart illustrating the process of determining reward parameters and constraint thresholds in one embodiment;

[0023] Figure 4 This is a flowchart illustrating the training method for the dialogue model in another embodiment;

[0024] Figure 5 This is a schematic diagram of the structure and process of a training device for a dialogue model in one embodiment;

[0025] Figure 6 This is a schematic diagram illustrating the change in the constraint threshold in one embodiment;

[0026] Figure 7 This is a structural block diagram of a training device for a dialogue model in another embodiment;

[0027] Figure 8 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0028] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0029] It should be noted that the terms "first," "second," etc., used in this application can be used to describe various types of data, but these data are not limited by these terms. These terms are only used to distinguish between the first type of data and the second type of data. The term "multiple" used in this application refers to two or more.

[0030] The training method for the dialogue model provided in this application embodiment can be applied to, for example... Figure 1 In the application environment shown, terminal 102 communicates with server 104 via a network. A data storage system can store the data that server 104 needs to process. The data storage system can be integrated onto server 104, or it can be located in the cloud or on another network server.

[0031] The terminal 102 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can include smart speakers, smart TVs, smart air conditioners, smart in-vehicle systems, and projection devices. Portable wearable devices can include smartwatches, smart bracelets, and head-mounted displays. Head-mounted displays can be virtual reality (VR) devices, augmented reality (AR) devices, and smart glasses. The server 104 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services.

[0032] In one exemplary embodiment, such as Figure 2 As shown, a training method for a dialogue model is provided, which can be applied to... Figure 1 Taking server 104 as an example, the explanation includes the following steps 202 to 208. Wherein:

[0033] Step 202: Obtain the training progress of the initial dialogue model.

[0034] The initial dialogue model is a trainable intelligent model that supports natural language interaction. Users can pose business questions to the initial dialogue model through dialogue, and the model automatically parses the natural language requirements into a Domain-Specific Language (DSL) that can execute data queries, ultimately returning structured analysis results. Thus, users pose questions in natural language, and the initial dialogue model parses them into a DSL, eliminating the need for manual query writing. The domain-specific language is a structured interleaving language used by the initial dialogue model in a specific business domain; it can be a SQL-like structured query language designed for enterprise data analysis scenarios.

[0035] Optionally, the initial dialogue model can be used to obtain a conversational business intelligence system (ChatBI). In this case, the training objective of the initial dialogue model is to possess semantic recognition capabilities based on domain knowledge, enabling it to identify domain-specific semantics such as business dimensions, metrics, time conditions, filtering logic, and aggregation granularity, for reinforcement learning training. Business problem categories include, but are not limited to, cost analysis, dimensional comparison, and metric queries. In the conversational business intelligence system, the domain-specific language is a structured query statement used to express specific business domains. Specific business domains include, but are not limited to, finance, operational analysis, and manufacturing, and each business domain can be further refined into specific business scenarios.

[0036] Training progress is a control variable used during parameter tuning of the initial dialogue model. It schedules the reward parameters used in training the initial dialogue model to dynamically determine the types and constraint thresholds of the reward parameters. Training progress includes, but is not limited to, training steps, training epochs, number of parameter updates, predefined training phase numbers, or external control signals of the initial dialogue model. Among these, the training steps have a higher granularity, which helps to dynamically adjust the types of reward parameters more precisely, while the training phase numbers have a coarser granularity, which helps to control the types of reward parameters in stages.

[0037] In some embodiments, the number of training steps can be determined based on the batch of processed samples of the initial dialogue model. Alternatively, after each forward and backward propagation of the model, the corresponding number of training steps can be increased, and the accumulated number of training steps can be accumulated to represent the training progress. Each time the training data of the dataset has undergone one forward and backward propagation, a training round is completed, and the completed training rounds are accumulated to represent the training progress.

[0038] Step 204: Determine the reward parameters and constraint thresholds corresponding to the training progress; the number of reward parameter types is positively correlated with the training progress; the constraint thresholds of the reward parameters are also positively correlated with the training progress.

[0039] Reward parameters are evaluation criteria used in the model training process and represent the types that need to be adjusted. The number of reward parameter types indicates which reward parameters are used in the training process, allowing the reward parameters to be adaptively adjusted at different training stages as the initial dialogue model progresses. The number of types is positively correlated with the training progress, ensuring that the training process initially focuses on certain types of reward parameters before training on more types, thus training the model based on core reward parameters first and then gradually expanding to non-core reward parameters.

[0040] The constraint threshold is a criterion for evaluating the reward value of a reward parameter. It determines whether the reward value is valid, thus assessing the performance of the initial dialogue model. The constraint threshold gradually increases from the initial threshold as training progresses. Because the constraint threshold is positively correlated with training progress, for the same reward parameter, within a certain range, the earlier the training progress, the smaller the constraint threshold, and the more relaxed the training process; conversely, the later the training progress, the larger the constraint threshold, and the more rigorous the training process, thus creating different training intensities.

[0041] The reward parameter provides a quality evaluation criterion, while the constraint threshold represents the numerical degree of that criterion at each stage of training. Dynamically adjusting the constraint threshold allows for adjustments to the penalty intensity, which can be made according to a stage-specific constraint strategy through module gating of the reward parameter.

[0042] Since the number of reward parameters is positively correlated with the training progress, differentiated constraints are imposed on the learning pace of different reward parameters to prevent some modules from being severely penalized too early before they have been learned, while others may be treated leniently for a long time, thus ensuring that the dialogue model performs well in each reward parameter.

[0043] In some embodiments, the reward parameters for each training stage are determined based on the training stage corresponding to the training progress; the number of categories corresponding to each training stage is positively correlated with the order of the training stages from first to last.

[0044] In an exemplary embodiment, when the training progress is in the first stage, a first reward parameter is determined to train the initial dialogue model using the first reward parameter; when the training progress is in the second stage, the first reward parameter and a second reward parameter are determined to train the initial dialogue model using both the first reward parameter and the second reward parameter; wherein the first stage and the second stage are training stages arranged in chronological order, and the ranges of the first reward parameter and the second reward parameter are different. Thus, the staged scheduling of the reward constraint strength gradually adjusts from the training stage of core constraints to the training stage of non-core elements, allowing the model to obtain a larger exploration space in the early stages of training and achieving fine alignment in the later stages of training.

[0045] In some embodiments, the growth method of the constraint threshold can be determined based on multiple scheduling control parameters corresponding to the reward parameters. The constraint threshold for each reward parameter is determined separately to form an asynchronous scheduling scheme of multiple modules. That is, according to the different types of reward parameters, each training progress can have its own corresponding threshold starting point, adjustment speed and other scheduling control parameters to control the constraint threshold of different reward parameters to grow according to their respective constraint change process.

[0046] In some embodiments, each reward parameter has an independent semantic module. Multiple semantic modules are designed with independent scheduling functions and scheduling control parameters. Based on their semantic complexity and learning difficulty, different adjustment speeds and activation times are used during training. This avoids the over-penalty problem caused by simultaneous tightening of all modules, allowing the model to gradually master the capabilities of each module in an order more consistent with the semantic structure of the DSL, thereby improving overall learning efficiency. Simultaneously, this process employs a multi-module asynchronous scheduling mechanism, avoiding erroneous coupling between different semantic modules caused by simultaneous activation of reward constraints, ensuring that the learning difficulty of a single module does not prematurely drag down the overall training effect.

[0047] In some embodiments, different reward parameters correspond to different semantic modules. The order in which module constraints are activated can be controlled by setting priorities for different semantic modules, without explicitly relying on specific mathematical scheduling functions, thereby updating the constraint thresholds. Alternatively, multiple semantic modules can be divided into several groups, with the same scheduling strategy applied to modules within the same group, and asynchronous scheduling used between different groups.

[0048] Step 206: Based on the reward parameters, evaluate the data generated by the initial dialogue model at this training progress to obtain the reward value.

[0049] The reward value is the evaluation result of the initial dialogue model at the aforementioned training progress. The reward value is categorized according to the type of reward parameter, and the reward value of the initial dialogue model for that reward parameter can be determined for the aforementioned training progress. The reward value is the immediate feedback value generated based on the execution result of the initial dialogue model after executing the task corresponding to the reward parameter. The data generated by the initial dialogue model at this training progress is the output result of the initial dialogue model at that training progress. The output result can be the result obtained by the initial dialogue model based on the input question's answer, and this output result is obtained by the initial dialogue model based on its own model mechanism's prediction.

[0050] In some embodiments, the reward value is obtained by evaluating the output of the initial dialogue model based on a scorer or validator. The reward value can represent the excellence of the domain language at the training progress, and the reward value can be the output module score or the inspection result, which can be the inspection result of time, metrics, filtering or structure, etc.

[0051] In some embodiments, a similarity score is calculated between the sentence structure generated from the initial dialogue model and a reference sentence structure; a reward value positively correlated with this similarity score is then determined. This similarity score can also be data such as accuracy, similarity, or the inverse of the loss value.

[0052] Step 208: Optimize the parameters of the initial dialogue model based on the reward value and the constraint threshold to obtain the dialogue model.

[0053] The dialogue model is a trained model where both the reward value and the constraint threshold satisfy the training stop condition. Because the number and types of reward parameters increase, the scope of model training can be gradually expanded. Since the constraint threshold is dynamically changing, the standards or penalty intensity can be gradually tightened, thus making the reward judgment criteria dynamically change with the training process. This explicitly models the relationship between reward parameters and training progress, enabling the model to maintain effective discrimination of reward signals throughout the training process, even when facing different intensities of reward constraints in the early and later stages of training. This significantly improves the stability and controllability of reinforcement learning training.

[0054] In some embodiments, if the reward value of each reward parameter is less than or equal to the constraint threshold of the same reward parameter, the parameters of the initial dialogue model are adjusted with the goal of increasing the reward value to obtain the adjusted parameters, until an initial dialogue model with the adjusted parameters is obtained. When the reward value of each reward parameter is greater than the constraint threshold of the same reward parameter, the trained dialogue model is determined based on the initial dialogue model with the adjusted parameters.

[0055] In the training method of the aforementioned dialogue model, the corresponding reward parameters and constraint thresholds are determined by reflecting the initial training progress of the dialogue model. This allows the reward judgment criteria to change dynamically with the training process, increasing the number of reward categories in the early and later stages of training to ensure the discriminative power of the reward values. This, in turn, enhances the effectiveness of the policy gradient, making the training more stable and easier to converge. Simultaneously, it allows different reward parameters to have differentiated learning paces, ensuring that the dialogue model performs well across all reward parameters. The increased constraint thresholds for reward parameters maintain effective discriminative power of the reward signals throughout the training process. This scheduling method ensures that reward parameters are lenient in the early stages of training and strictly constrained in the later stages, thus achieving progressive learning and significantly improving the stability and controllability of reinforcement learning training.

[0056] In some embodiments, such as Figure 3 As shown, determining the reward parameter corresponding to the training progress and the constraint threshold of the reward parameter includes steps 302 and 304; wherein:

[0057] Step 302: When the training progress corresponds to the structure training stage, determine the structure parameters and determine the constraint threshold of the structure parameters based on the training progress; the structure parameters are used to train the sentence structure of the initial dialogue model generated data so that the sentence structure matches the business scenario.

[0058] The structure training phase is the stage for optimizing the sentence structure of the initial dialogue model-generated data. The structure training phase can be derived through a numerical mapping of the training progress; for example, it can be determined based on the interval in which the training progress falls. During the structure training phase, the reward parameters may only include structure parameters. Matching the sentence structure to the business scenario means that the sentence structure of the initial dialogue model-generated data conforms to the sentence standards of that business scenario.

[0059] Structural parameters are used to instruct the training of statement structure. When using structural parameters, the training objective is to optimize the statement structure. Statement structure ensures that the data generated by the initial dialogue model is parsable, grammatically compliant, and structurally complete, so that the data output by the initial dialogue model conforms to distribution requirements. Statement structure indicates that the data generated by the initial dialogue model and the business scenario have a matching structural form to ensure reasonable distribution. The business scenario is the context in which the dialogue model is used, forming a domain-specific language within that scenario. The constraint thresholds of the structural parameters are the constraint standards for the statement structure. Each type of structural parameter has its own constraint threshold, and each constraint threshold has an independent adjustment method.

[0060] In some embodiments, structural parameters include structure and syntactic validity parameters (S) and may also include temporal semantic parameters (T). By using constraint thresholds for both the structure and syntactic validity parameters and the temporal semantic parameters, the data generated by the initial dialogue model during training ensures that the sentence structure matches the business scenario. The structure and syntactic validity parameters guarantee that the data generated by the initial dialogue model is parsable, grammatically compliant, and structurally complete. They ensure the form of the sentence structure is correct, but not its business semantics. The structure and syntactic validity parameters establish an executable shell. The temporal semantic parameters control the reasonable distribution of time slices in the sentence structure, ensuring an effective distribution of results. The structure and syntactic validity parameters control whether the data generated by the initial dialogue model is understood / executed by the system, facilitating subsequent verification and execution. This ensures that all samples receive distinguishable positive feedback, thereby guaranteeing stable computation during training and avoiding rewards. The temporal semantic parameters represent the main axis conditions or slice conditions for common queries in various business scenarios. Incorrect temporal parameters often lead to empty query results and indeterminate aggregation granularity. Therefore, we can first constrain S / T to ensure that the data generated by the initial dialogue model is at least parsable, executable, or stably evaluable, and that the time slice is basically reasonable to ensure that an effective result distribution can be generated, so that the reward distribution changes from almost all 0 to a gradient and hierarchical situation, thereby ensuring that all samples receive distinguishable positive feedback.

[0061] Step 304: When the training progress corresponds to the semantic training stage, determine the structural parameters and semantic parameters, and determine the constraint thresholds of the structural parameters and semantic parameters based on the training progress; the semantic parameters are used to train the sentence semantics of the initial dialogue model generated data so that the sentence semantics match the business scenario; the structural training stage is the training stage that precedes the semantic training stage.

[0062] The semantic training phase is the semantic optimization phase of the initial dialogue model's generated data. The structural training phase precedes the semantic training phase; therefore, the semantic training phase involves not only structural parameters but also semantic parameters. The semantic training phase can be derived through a numerical mapping of the training progress. For example, it can be determined based on the training progress interval, or it can be determined as the next phase after the structural training phase according to a preset phase order. In the structural training phase, the reward parameters may only include structural and semantic parameters. Since previous training phases also used structural parameters as reward parameters, the adjustment of semantic parameters for this phase is relatively large. Sentence semantics represent the content output by the initial dialogue model.

[0063] In some embodiments, during the structural training phase, an initial dialogue model is trained using structural parameters so that the initial dialogue model in the early training phase can fully explore reasonable domain-specific language to avoid problems of not being able to provide feedback. In the later training phase, strict business semantics are gradually aligned to output the semantics of statements in the corresponding business scenarios.

[0064] In this embodiment, phased training is first performed using structural parameters to avoid grammatical errors, inconsistent time fields, and inconsistent time ranges that could lead to abnormal sentence structure. This ensures that the dynamic thresholds during subsequent training are traceable. This approach allows the initial dialogue model to receive distinguishable positive and negative feedback, guaranteeing training stability. Once the sentence structure score is deemed appropriate based on the training progress, semantic analysis is then performed to generate professional dialogues relevant to the business scenario. Thus, in the early stages of training, when the policy model has not yet mastered the basic structure of the domain-specific language, most generated results receive extremely high or differentiated reward values, significantly enhancing the discriminative power of the reward values, ensuring the effectiveness of the model's training policy gradient, resulting in stable training and easier convergence.

[0065] In some embodiments, the semantic training phase includes a statement semantic training phase and a scenario semantic training phase following the statement semantic training phase; the semantic parameters include statement semantic parameters and scenario semantic parameters; the statement semantic parameters are used to train the statement semantic content of the data generated by the initial dialogue model so that the statement semantic content matches the business scenario; the scenario semantic parameters are used to train the data used by the business scenario in the filtering process generated by the initial dialogue model.

[0066] The sentence semantic training phase is the content optimization phase of the data generated by the initial dialogue model. The sentence semantic training phase can be obtained by numerical mapping of the training progress, or it can be determined according to the preset phase order. For example, the sentence semantic training phase can be determined according to the interval of the training progress.

[0067] During the sentence semantic training phase, the reward parameters may consist only of structural parameters and sentence semantic parameters. Context semantic parameters are used to determine the semantic content of the initial dialogue model-generated data. Matching sentence semantic content with the business scenario means that the sentence semantic content of the initial dialogue model-generated data conforms to the content standards of that business scenario and can form correct information. Matching sentence semantic content with the business scenario can mean that the similarity between the sentence semantic content and the content samples corresponding to the business scenario is greater than a preset value, or it can be determined by the neural network model used for evaluation.

[0068] The scene semantic training phase is the optimization phase for content filtering of the data generated by the initial dialogue model. The scene semantic training phase can be obtained by numerical mapping of the training progress, or it can be determined according to the preset phase order. For example, the sentence semantic training phase can be determined according to the interval of the training progress.

[0069] The filtering process involves selecting data from multiple perspectives. Due to the large number of combinations of these perspectives and the complex nesting of data logic, the data evaluation criteria during the scenario semantic training phase are quite complex. Therefore, after training with structural parameters and statement semantic parameters, the data is then processed in the scenario semantic training phase to ensure that the corresponding samples receive distinguishable positive and negative feedback. Matching the scenario semantic content with the business scenario means that the various data used in the filtering process can express the various details within that business scenario, thus achieving semantic alignment at the level of detail.

[0070] In some embodiments, the filtering process logic includes, but is not limited to: nested multi-condition AND / OR, numerical thresholds, ranges, enumerated IN, and strong coupling between conditions and dimensions / metrics. When using these logics, the initial dialogue model is difficult and noisy in the early stages of training, making it hard to obtain correct feedback; however, during the scene semantic training phase, the modules corresponding to the structural parameters and sentence semantic parameters have already approached high scores, and the adjustment process is relatively stable.

[0071] In some embodiments, the filtering process logic includes complex screening conditions, business scope limitations, and combinations of multiple conditions; complex screening conditions include the organization, product, customer, channel, contract, and status of the business scenario; business scope limitations include, but are not limited to, only "received", only "not invoiced", and only "approved"; combinations of multiple conditions include, but are not limited to, "financing time in Q1 and actual disbursement in March and status = disbursed".

[0072] When the training progress corresponds to the semantic training phase, structural parameters and semantic parameters are determined, and constraint thresholds for structural parameters and semantic parameters are determined based on the training progress. This includes: when the training progress corresponds to the statement semantic training phase, determining structural parameters and statement semantic parameters, and determining constraint thresholds for structural parameters and statement semantic parameters based on the training progress; and when the training progress corresponds to the scene semantic training phase, determining structural parameters, statement semantic parameters, and scene semantic parameters, and determining constraint thresholds for structural parameters, statement semantic parameters, and scene semantic parameters based on the training progress.

[0073] In some embodiments, the statement semantic parameters include entity / dimensional matching parameters (E), metric matching parameters (M), and granularity parameters (G), each with its own constraint thresholds; the structural parameters include temporal semantic parameters. The entity / dimensional matching parameters indicate which information dimensions, the metric matching parameters indicate what needs to be queried, the granularity parameters indicate the granularity of the data to be queried, and the temporal semantic parameters indicate the time period, thus controlling the initial dialogue model output of an executable structure containing these four parameters.

[0074] In some embodiments, sentence semantic parameters are used to train the semantic alignment capability of the initial dialogue model, and the semantic alignment capability is used to determine key semantics. For example, in Natural Language to Domain-Specific Language (NL2DSL) for a certain business scenario, the natural language intent is decomposed into four aspects: metrics, dimensions, granularity, and time. It can be an executable structure of M (what metrics to look up) + E (by what dimension) + G (by what granularity) + T (at what time). By using the reward values ​​and constraint thresholds of E / M / G / T under the same reward parameters, it is possible to ensure that the semantics of the sentences generated by the initial dialogue model match the semantics of the corresponding business scenario after the semantic training phase.

[0075] In some embodiments, based on the statement semantic parameters, the scene semantic training phase involves filtering data to optimize the model using the constraint thresholds of the filter logic (F), so that the granularity of the data generated by the initial dialogue model is highly aligned with the business scenario.

[0076] In some embodiments, given that the structural parameters and sentence semantic parameters are determined, the initial dialogue model can be optimized based on the reward values ​​of the initial model in the structural parameters and the reward values ​​of the initial model in the sentence semantic parameters. Then, the initial dialogue model can be optimized based on the reward values ​​of the structural parameters and the constraint thresholds, and the reward values ​​of the sentence semantic parameters and the constraint thresholds, respectively. Since the structural parameters are used for training first, the initial dialogue model is mainly optimized based on the sentence semantic parameters during the sentence semantic training stage.

[0077] In some embodiments, when structural parameters, statement semantic parameters, and scene semantic parameters are determined, since the structural parameters and statement semantic parameters are used for training first, the initial dialogue model is mainly optimized based on filtering logic during the scene semantic training stage.

[0078] In this embodiment, the initial dialogue model is first trained using structural parameters to ensure that its output data has the basic linguistic framework of the business scenario. Then, key semantics are aligned using sentence semantic parameters to form data with correct meaning. Subsequently, due to the large number of content combinations and the complex nesting of data logic in the filtering process, once the basic framework and basic semantics are determined, corresponding constraint thresholds are dynamically adjusted to avoid problems such as difficulty in convergence and high noise in the early stages of training.

[0079] In some embodiments, such as Figure 4 As shown, the reward parameters and their constraint thresholds are determined based on the training progress, including:

[0080] Step 402: Determine the reward parameters based on the training progress, and obtain the initial threshold, key progress nodes, and adjustment speed of the reward parameters.

[0081] The steps for determining the reward parameters can be referred to steps 302 to 304 and the corresponding embodiments. After determining the reward parameters through step 302 or step 304, the subsequent steps of step 402 can be performed using each reward parameter. That is, the reward parameters in step 402 can be structural parameters from the structural training phase, or structural and grammatical validity parameters, or temporal semantic parameters; the reward parameters in step 402 can also be semantic parameters from the semantic training phase, or statement semantic parameters and scene semantic parameters, and can be further refined into corresponding parameters.

[0082] The initial threshold is the minimum constraint threshold for the reward parameter and the starting point for the constraint threshold to increase. Key progress nodes are preset progresses used to control the degree of change in the constraint threshold, allowing for more precise control over the rate of change of the constraint threshold in conjunction with adjustment speed.

[0083] The adjustment speed refers to the rate at which the constraint threshold changes. It can be weighted and controlled by the progress difference between the training progress and key progress nodes to form a multi-dimensional tightening trend of the constraint threshold. With the training progress and key progress nodes remaining constant, a larger adjustment speed results in a faster increase in the constraint threshold, a greater rate of increase in the standard for stopping training of the reward value of that reward parameter, and a faster tightening trend as the initial dialogue model becomes more stringent under that reward parameter. Conversely, a smaller adjustment speed results in a slower increase in the constraint threshold, a smaller rate of increase in the standard for stopping training of the reward value of that reward parameter, and a slower tightening trend as the initial dialogue model becomes more stringent under that reward parameter.

[0084] The initial threshold, key progress nodes, and adjustment speed all correspond to the types of reward parameters, and each reward parameter has its own corresponding initial threshold, key progress nodes, and adjustment speed.

[0085] In some embodiments, as the number of reward parameter types representing training progress increases, the reward parameters for the newly added types are obtained; based on the reward parameters for the newly added types, the initial threshold, key progress nodes, and adjustment speed are obtained. Thus, as the number of reward parameter types increases, three types of data are obtained for the corresponding reward parameter types to efficiently adjust the constraint threshold.

[0086] Step 404: Determine the progress difference between the training progress and the key progress nodes.

[0087] The progress difference is a value used as a reference to a key progress node; it represents the numerical difference between the training progress and the key progress node. Since the key progress node is determined by the type of reward parameter, the progress difference varies depending on the reward parameter. When using the progress difference, the training progress and the rate of threshold increase exhibit a non-linear, positively correlated trend.

[0088] In some embodiments, when both the training progress and key progress nodes are training steps, the progress difference is the difference in training steps; when both the training progress and key progress nodes are training rounds, the progress difference is the difference in training rounds.

[0089] Step 406: Determine the threshold increment based on the difference between the adjustment speed and the progress.

[0090] The threshold increment is the amount by which the initial threshold is increased. Since the progress difference is a relative value, the threshold increment exhibits a non-linear growth trend at the same adjustment rate, in order to better match the changing trend of the reward parameter, thus forming a gradual increase for that reward parameter.

[0091] In some embodiments, the progress increment is determined based on a combination of the adjustment speed and the progress difference. For example, the absolute value of the progress difference can be determined first, and the threshold increment can be determined based on the product of the absolute value and the adjustment speed; alternatively, the progress difference can be non-linearly processed according to a threshold related to the business scenario to obtain the threshold increment; non-linear processing can be implemented using a smoothing function.

[0092] Step 408: Adjust the initial threshold based on the threshold increment to obtain the constraint threshold of the reward parameter.

[0093] In some embodiments, the constraint threshold can be obtained based on the sum of the threshold increment and the initial threshold, or it can be obtained based on the product of the threshold increment and the initial threshold.

[0094] In some embodiments, when the constraint threshold of the first reward parameter is less than the upper limit of the constraint threshold of the target reward parameter, the initial threshold of the first reward parameter is adjusted based on the threshold increment to obtain the constraint threshold of the first reward parameter under the training progress; when the constraint threshold of the first reward parameter is greater than or equal to the upper limit of the constraint threshold of the target reward parameter, the initial threshold of the second reward parameter is adjusted based on the threshold increment to obtain the constraint threshold of the second reward parameter under the training progress, until the constraint threshold of each reward parameter reaches its respective upper limit. The first reward parameter is the parameter whose constraint threshold is adjusted before the second reward parameter. For example, if the first reward parameter is a structural parameter, the second reward parameter is a semantic parameter; if the first reward parameter is a statement semantic parameter, the second reward parameter is a scene semantic parameter.

[0095] In an exemplary embodiment, optionally, the initial threshold, key progress nodes, and adjustment speed are three parameters within the scheduling function, which adjusts the corresponding constraint thresholds through steps 404, 406, and 408. Steps 402-408 can be represented by the following expression:

[0096]

[0097] in, It is the constraint threshold for the k-th reward parameter; is the initial threshold for the k-th reward parameter; 'a' is the parameter used for adjustment, which can be a constant or a dynamic value. Represents a smoothing function; It's about adjusting the speed; It represents the training progress of the k-th reward parameter; It is the key progress node for the k-th reward parameter; It is the progress difference under the training progress.

[0098] In some embodiments, the module-level reward scheduling function adopts a smooth sigmoid form to achieve a continuous transition of the reward threshold with the number of training steps. In another embodiment, the reward scheduling function may adopt, but is not limited to, a piecewise linear schedule, an exponential or power schedule, a step-wise schedule, or a step-wise function.

[0099] Piecewise linear functions refer to dividing the training process into several intervals and using a linearly varying threshold scheduling method within different intervals. For example, maintaining a fixed, lenient threshold in the early stages of training, linearly increasing it in the middle stages, and maintaining a strict threshold in the later stages. This method is simple to implement and suitable for application scenarios where the training phases are clearly defined.

[0100] Exponential or power function scheduling refers to using exponential or power function forms to accelerate the tightening of reward constraints in the later stages of training, while changing them more slowly in the early stages of training, in order to achieve non-linear scheduling characteristics different from Sigmoid.

[0101] A stage constant function refers to a function that directly switches to a preset threshold at different training stages, without requiring continuous changes. This approach is suitable for scenarios where training stages are explicitly defined by external strategies or human experience.

[0102] In this embodiment, the progress difference between the training progress and the key progress node is used to make the growth relationship between the training progress and the constraint threshold change non-linearly. This allows for the construction of a progressive constraint threshold adjustment method for the reward parameter, resulting in a non-linear change in the reward constraint being loose in the early stage of training and strict in the later stage of training.

[0103] In some embodiments, obtaining the initial threshold, key progress nodes, and adjustment speed of the reward parameters includes: obtaining the model score of the initial dialogue model during the pre-training phase of the reward parameters; determining the initial threshold positively correlated with the model score; determining the key progress nodes negatively correlated with the model score; and determining the adjustment speed positively correlated with the model score.

[0104] The pre-training phase is a parameter tuning phase for dynamically adjusting constraint thresholds. The pre-training phase for reward parameters is not used to adjust the initial threshold, key progress nodes, or adjustment speed of that reward parameter. Pre-training phases can be categorized according to the type of reward parameter; from a categorical perspective, the adjustment process for one reward parameter can be the pre-training phase for a subsequent parameter. Pre-training phases can also be categorized according to the training phase; from a training phase perspective, one training phase can be the pre-training phase for another subsequent training phase, and separate pre-training phases can exist between different training phases.

[0105] In some embodiments, the pre-training phase of the structural training phase is the training phase before the structural training phase. The pre-training phase of the semantic training phase can be the structural training phase or the training phase between the semantic training phase and the structural training phase. The semantic training phase includes the statement semantic training phase and the scene semantic training phase after the statement semantic training phase. In this case, the statement semantic training phase can be the structural training phase or the training phase between the statement semantic training phase and the structural training phase. The pre-training phase of the scene semantic training phase can be the statement semantic training phase, the structural training phase, or a separate phase.

[0106] The model score represents the initial dialogue model's capability during the pre-training phase for that particular reward parameter, reflecting the difficulty of training the initial dialogue model. The model score can be the reward value of the initial dialogue model for that reward parameter, or it can be a separate evaluation system. Optionally, the model score can be determined based on the average score curve from the pre-training phase.

[0107] In some embodiments, the model score can reflect the model training difficulty of the business scenario; based on the model training difficulty, the correspondence between the number of training progress and the number of categories, as well as the correspondence between the training progress and the constraint threshold, are determined, thereby obtaining the initial threshold, key progress nodes and adjustment speed of the reward parameter, and thus determining the ease or difficulty of the initial dialogue model to achieve this reward parameter based on semantic complexity and learning difficulty.

[0108] Semantic complexity represents the inherent complexity of a task, while learning difficulty is a relative attribute of how difficult it is for a model to learn it. Learning difficulty dynamically changes with data, model, or training strategy. Given data, model, hints, and training methods, learning difficulty determines how long the model needs to train and how slow its convergence speed is to achieve a high score. It depends not only on semantic complexity but also on: the number and depth of rules (e.g., the number of filtering conditions, nesting levels), semantic dependencies (indicator caliber depends on dimensions / time / organizational level), the size of the constraint space (the number of optional fields / enumerated values), the coverage of training data (whether there are enough samples), the degree of ambiguity of synonyms / calibers (whether they are prone to misbinding), the noise and discriminability of the reward signal (whether it can provide stable feedback), and the model's capabilities (parameter size / proficiency in structured generation).

[0109] In some embodiments, both the initial threshold and the adjustment speed are positively correlated with the model score. That is, when the model score is low, the reward parameter is adjusted to the upper limit of the constraint threshold quickly, thereby ensuring that the initial dialogue model has a better adjustment effect on the reward parameter; conversely, when the model score is high, the reward parameter is adjusted to the upper limit of the constraint threshold slowly, thereby ensuring that the initial dialogue model has a better adjustment speed on the reward parameter and a moderate adjustment effect.

[0110] In some embodiments, the progress nodes are negatively correlated with the model score. That is, when the model score is low, delaying the progress node where the constraint threshold changes drastically allows for a longer adjustment phase to the upper limit of the constraint threshold for that reward parameter, thus ensuring that the initial dialogue model adjusts that reward parameter more effectively. Conversely, when the model score is high, advancing the progress node where the constraint threshold changes drastically allows for a shorter adjustment phase to the upper limit of the constraint threshold for that reward parameter, thus ensuring that the initial dialogue model adjusts that reward parameter more quickly and effectively.

[0111] In this embodiment, the model score from the pre-training phase is used to pre-assess the training difficulty of the initial dialogue model under this reward parameter, allowing for more flexible determination of the corresponding adjustment parameters and clarifying the evaluation criteria for the initial dialogue model. The initial threshold and adjustment speed are both positively correlated with the model score, controlling the speed at which this reward parameter is adjusted to the upper limit of the constraint threshold. The joint progress nodes are negatively correlated with the model score, allowing the reward parameter to be adjusted to a more relaxed stage than the upper limit of the constraint threshold, thus ensuring that the initial dialogue model performs well in adjusting this reward parameter.

[0112] In some embodiments, determining the threshold increment based on the difference between the adjustment speed and the progress includes: obtaining the upper limit of the constraint threshold of the reward parameter, and determining the threshold difference between the initial threshold and the upper limit of the constraint threshold; the upper limit of the constraint threshold is positively correlated with the risk of the business scenario; and performing non-linear processing on the threshold difference based on the progress difference to obtain the threshold increment.

[0113] The upper limit of the constraint threshold is a preset maximum value related to the business scenario. Before the constraint threshold reaches the upper limit, the threshold increment can continue to be calculated to make the threshold increment continuously increase. The initial threshold is less than the upper limit of the constraint threshold, and the initial threshold can be the minimum constraint threshold.

[0114] The threshold difference is the difference between the initial threshold and the upper limit of the constraint threshold. The threshold difference is a parameter used for adjustment, which is dynamically adjusted according to the risk of the business scenario. When the initial threshold is positively correlated with the model score, the threshold difference is also dynamically adjusted based on the difficulty of model training and the risk of the business scenario.

[0115] In some embodiments, the threshold increment can be obtained based on the product of the progress difference and the threshold difference.

[0116] In this embodiment, a risk setting limit value is set based on the business scenario risk, and a corresponding threshold difference is determined so that the risk of the business scenario is reflected in the threshold difference, thereby controlling the degree of nonlinear change of the threshold increment.

[0117] In one embodiment, in the application scenario of generating structured queries from natural language, the existing reinforcement learning training scheme for large language models has the following steps: First, the policy model generates candidate DSL expressions based on natural language input; then, the quality of the generated results is evaluated through a predefined reward function; finally, the reinforcement learning algorithm (such as PPO, GRPO, DAPO, etc.) uses the reward signal to iteratively optimize the policy model.

[0118] In this process, the traditional reward function and its reward parameters remain fixed throughout the training process. That is, the structure, threshold, and penalty rules of the reward function are determined before training begins and remain unchanged throughout the reinforcement learning training. Regardless of whether the model is in the early, middle, or late stages of training, the same set of fixed rules is used for reward determination. This type of reward function is usually based on rule matching, semantic consistency verification, or human experience. However, on the one hand, since the initial dialogue model of the policy has not yet mastered the sentence structure or semantic constraints of the domain-specific language, applying strict reward determination criteria to all semantic modules simultaneously can easily lead to most generated results receiving extremely low or identical reward values. This phenomenon significantly reduces the discriminative power of the reward values, thus affecting the effectiveness of the policy gradient, causing training instability or even failure to converge. On the other hand, since the learning difficulty of each semantic module varies significantly, existing technologies using a uniform reward standard cannot differentiate constraints based on the learning pace of different modules. This results in some modules being penalized too early before they have learned, while others may be treated leniently for a long time. For example, time and structural modules corresponding to structural parameters are generally easier to learn; the selection and filtering logic of indicators corresponding to statement semantic parameters often rely on complex business semantics, and their training process depends on the statement structure; the multi-condition combination logic corresponding to scenario semantic parameters is particularly difficult in the early stages of training. Therefore, in this embodiment, by using methods such as gating release of Curriculum, the number of reward parameter types is positively correlated with the training progress, which increases the number of constraint thresholds of reward parameters that take effect and gradually increases the range of reward parameters to form differentiated constraint thresholds, thereby increasing the number of reward parameter types.

[0119] Moreover, traditional multi-semantic modules adopt a unified reward judgment criterion. However, in tasks in the dialogue scenario, such as the domain-specific language generation task in the ChatBI scenario, there are significant differences in semantic complexity and learning difficulty among different semantic modules. If a unified reward threshold and punishment intensity are used to evaluate all semantic modules, that is, during reward judgment, each module takes effect synchronously, tightens synchronously, and is punished synchronously. This design defaults that the learning progress and error tolerance of different modules are the same, but this is often not true in actual training. Among them, the semantic modules include, but are not limited to, semantic modules corresponding to reward parameters such as time conditions, indicator selection, filtering logic, aggregation granularity, and structural legality. In this black-box reward reinforcement learning framework, the reward signal only participates in policy update as a scalar weight, and the distribution and change rhythm of the reward value are used to control the training stability; however, the fixed constraint threshold of the reward parameter makes the training stage lack the ability to dynamically adjust the threshold, and it is difficult to fully utilize the advantages of this type of reinforcement learning method in the generation task in complex domain-specific languages. Therefore, in this embodiment, through the increase of the standard or punishment intensity, the constraint threshold of the reward parameter is positively correlated with the training progress, so that the threshold of the effective constraint gradually tightens, and the constraint threshold of the reward parameter increases, so as to form a differential constraint threshold change mechanism.

[0120] Among them, in the ChatBI scenario, domain-specific languages generally include a date processing module, an information extraction and matching module, and a number query domain-specific language module; and the number query domain-specific language module includes modules such as QUERY / FILTER / AGGREGATE: <Date Processing>, <Information Extraction and Matching>, <Number Query DSL>, QUERY, FILTER, GGREGATE

[0121] An example of a domain-specific language:

[0122] <Date Processing>{"Date filtering condition": "Date >= KC_StartOfYear(2025) AND Date <= KC_EndOfMonth(02)", "Time aggregation granularity": "month"}< / Date Processing>

[0123] <Information Extraction and Matching>[ "<Organization, Organization, Dimension Name, EVERY>", "<Profit,, UNMATCHED,>", "<Worst three, back row, function back row, KC_BottomN>" ]< / Information Extraction and Matching>

[0124] <Number Query DSL Generation>UNMATCHED profit

[0125] QUERY date, organization

[0126] FILTER Date >= KC_StartOfYear(2025) AND Date <= KC_EndOfMonth(02) AND Organization = 'EVERY'

[0127] AGGREGATE Month< / Count DSL Generation>

[0128] In an exemplary embodiment, this embodiment proposes a training method for multi-module asynchronous reward scheduling and reward-constrained Curriculum Learning, which is applied to the NL2DSL reinforcement learning task in the ChatBI scenario. Without changing the internal scoring logic of the reward function (such as module scoring / rule checker), this training method introduces an outer scheduler (Scheduler) that is aware of the training phase, making the constraint threshold intensity corresponding to semantic modules with different establishment parameters gradually tighten at different rhythms during the training process, and controlling the strategies of phase release and phase tightening through the Curriculum mechanism, so that the types of reward parameters are positively correlated with the training progress, thereby improving problems such as reward collapse in the initial stage of training, inconsistent module learning rhythms, and insufficient alignment in the later stage.

[0129] As Figure 5 shown, the overall system of this embodiment consists of the following modules:

[0130] Policy Model 502, which is the initial dialogue model before or during training and the dialogue model after training, and is used to generate and output candidate domain-specific languages according to the input natural language question Q.

[0131] Step Controller 504, which is used to determine the current training step s and the total number of steps ; The training step represents the training progress

[0132] Asynchronous Schedule Manager 506, which is used to set a constraint threshold with independent reward parameters for the k-th semantic module, that is, the strictness parameter. Among them, the reward parameter corresponding to the structure module S is the structure and grammar legality parameter, the reward parameter corresponding to the time module T is the time semantic parameter, the reward parameter corresponding to the dimension module E is the dimension / entity matching parameter, the reward parameter corresponding to the metric module M is the metric matching parameter, the reward parameter corresponding to the granularity module G is the aggregation granularity parameter, and the reward parameter corresponding to the filtering module is the filtering logic.

[0133] The Curriculum Controller 508 is used to determine the current training phase and control which semantic modules' reward parameters take effect or enter strict mode.

[0134] The reward calculation module 510 (Reward Evaluator) is used to receive domain-specific language and scheduling parameters, and output the final reward value R.

[0135] The reinforcement learning trainer 512 (RL Trainer), whose algorithm may involve PPO / GRPO / DAPO, is used to filter available reward values ​​using a constraint threshold of the reward parameter, and execute the update policy of the initial dialogue model based on the available reward values.

[0136] In one exemplary embodiment, the constraints on domain-specific language quality are divided into multiple sets of semantic modules:

[0137]

[0138] The T module is used to obtain the constraint threshold for time semantics, the E module is used to obtain the constraint threshold for dimension / entity matching, the M module is used to obtain the constraint threshold for metric matching, the F module is used to obtain the constraint threshold for filter logic, the G module is used to obtain the constraint threshold for aggregation granularity, and the S module is used to obtain the constraint threshold for structure and syntax validity.

[0139] These semantic modules are scheduled separately, and an independent scheduling function expression is defined for the k-th module as follows:

[0140]

[0141] in, It is the constraint threshold for the k-th reward parameter; It is the initial threshold of the k-th reward parameter, which is the lenient threshold / low strictness adopted by the module in the early stage of training of the k module; It is the upper limit of the constraint threshold for the k-th reward parameter, which is the strict threshold / high strictness adopted by this module in the later stage of training; It is the threshold difference; This represents a smoothing function used to control continuous smoothing of the schedule; It refers to adjusting the speed, also known as tightening speed or steepness; It represents the training progress of the k-th reward parameter; It is the key progress node for the k-th reward parameter; It is the progress difference under the training progress.

[0142] like Figure 6 As shown, module-level threshold As the number of training steps s increases, from Smooth transition to This scheduling method ensures lenient reward constraints in the early stages of training and strict constraints in the later stages, thereby achieving progressive learning.

[0143] In asynchronous scheduling, the following conditions must be met: different modules can have different tightening starting points, different modules can have different tightening speeds, and different modules can have different final strictness levels. For example, the time module T uses an early tightening strategy to correct query biases caused by time misalignment as early as possible, while the indicator module M uses mid-term tightening to strictly align indicators after the model has grasped the basic structure; in addition, the filtering logic module F and the structure module S use late-stage strictness to avoid directly suppressing a large number of exploration samples to extremely low rewards in the early stages.

[0144] In some embodiments, a set of training phases is defined:

[0145]

[0146] The Curriculum controller outputs stage signals, controlling the release sequence of constraints in the control module:

[0147] In the early stage, only core constraints (such as S, T) are enabled, while complex modules such as M, F are more lenient, and there are fewer types of reward parameters.

[0148] In the Mid phase, strict constraints of E, M, and G are gradually implemented, and the types of reward parameters are increased.

[0149] In the Late phase, all modules are strictly constrained, emphasizing end-to-end consistency, and the variety of reward parameters is further increased.

[0150] The formal description can be:

[0151]

[0152] in This can be implemented as stage gating, for example, allowing certain modules to be used directly in the Early stage. Or reduce its weight / penalty intensity; gradually restore it after entering Mid / Late. Complete and strict scheduling.

[0153] Therefore, by applying the Curriculum algorithm to the reward parameters and the constraint thresholds of the reward parameters, the structure training phase in the early stage of training is fully explored to avoid reward collapse; the sentence semantic training phase in the middle stage quickly establishes key semantic capabilities; and the scene semantic training phase in the later stage uses strong constraints and fine alignment to improve the final usability.

[0154] Based on this, this embodiment introduces training steps or training stages as input variables for reward construction. This makes the reward criterion no longer a static, fixed value, but rather dynamically changes with the training process. By explicitly modeling the relationship between reward parameters and training progress, the model faces different levels of reward constraints in the early and later stages of training, which is a fundamental condition for achieving progressive learning. Furthermore, the focus of Curriculum Learning is shifted from training data or task order to the reward determination logic itself. By controlling the gradual evolution of the reward constraint strength, a progressive training process from lenient to strict is achieved to solve the problems of training instability and reward collapse. Based on these two mechanisms, the independent scheduling functions of multiple semantic modules form a multi-module asynchronous scheduling mechanism.

[0155] The scheduling parameters for each semantic module, such as initial threshold, upper limit of constraint threshold, adjustment speed, and key progress nodes, can be set through configuration files or parameter tables to adapt to different business scenarios, model sizes, or data distributions. Module-level scheduling functions can be implemented in various smoothing or piecewise forms, including but not limited to the Sigmoid function, piecewise linear functions, and exponential functions. The scheduling function and Curriculum mechanism in this embodiment are independent of the syntax and business rule set of the domain-specific language, and can directly adapt to domain-specific language generation tasks in different domains without modifying the reward structure.

[0156] This embodiment significantly improves the stability, convergence quality, and engineering feasibility of reinforcement learning training without increasing training complexity by modularizing and phasing the reward constraint itself. It has clear technological advancements and practical application value, as detailed below:

[0157] First, it significantly improves the stability of the reinforcement learning training process. Specifically, this embodiment introduces a reward scheduling mechanism that is aware of the training phase, allowing the reward determination criteria to gradually transition from lenient to strict as the training progresses. This avoids the reward collapse problem caused by applying high-intensity reward constraints at the beginning of training in existing technologies. Compared to existing solutions with fixed reward function structures, this embodiment can maintain the effective discriminative power of reward signals throughout the entire training process, thereby significantly improving the stability and controllability of reinforcement learning training.

[0158] Secondly, differentiated learning pace control is achieved for different semantic modules. Specifically, this embodiment designs independent scheduling functions for multiple semantic modules, allowing different modules to adopt different tightening speeds and activation times during training based on their semantic complexity and learning difficulty. This technique avoids the excessive penalty problem caused by "synchronous tightening of all modules" in existing technologies, enabling the model to gradually master the capabilities of each module in an order that better conforms to the semantic structure of the domain-specific language, thereby improving overall learning efficiency.

[0159] Furthermore, Curriculum Learning is elevated from the data layer to the reward constraint layer. Specifically, existing Curriculum Learning methods mainly operate on training sample ranking or task difficulty classification, while this embodiment innovatively applies the Curriculum Learning mechanism directly to the reward determination logic itself. Through the phased scheduling of the reward constraint strength, this embodiment achieves a truly progressive learning process "from lenient to strict," enabling the model to gain a larger exploration space in the early stages of training and achieve fine alignment in the later stages.

[0160] Moreover, it effectively alleviates the problem of erroneous coupling in complex domain-specific language generation tasks. Specifically, decoupling occurs at the reward structure level: the quality assessment of domain-specific language is broken down into modules or sub-items of reward parameters to form interpretable sub-scores and aggregations, reducing the strong coupling problem where a single total score mixes all errors together; decoupling also exists at the training process level: multiple modules are independently scheduled, avoiding synchronous tightening of modules and reducing coupling and mutual drag during the training phase. Therefore, this embodiment avoids the erroneous coupling phenomenon caused by the synchronous activation of reward constraints between different semantic modules, ensuring that the learning difficulty of a single module does not prematurely drag down the overall training effect. This technology is particularly effective in complex business domain-specific language or multi-rule query scenarios.

[0161] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0162] Based on the same inventive concept, this application also provides a training apparatus for a dialogue model to implement the training method for the dialogue model described above. The solution provided by this apparatus is similar to the solution described in the above method; therefore, the specific limitations of one or more dialogue model training apparatus embodiments provided below can be found in the limitations of the dialogue model training method described above, and will not be repeated here.

[0163] In one exemplary embodiment, such as Figure 7 As shown, a training device for a dialogue model is provided, comprising:

[0164] The progress acquisition module 702 is used to acquire the training progress of the initial dialogue model;

[0165] The constraint module 704 is used to determine the reward parameter corresponding to the training progress and the constraint threshold of the reward parameter; the number of reward parameter types is positively correlated with the training progress; the constraint threshold of the reward parameter is positively correlated with the training progress.

[0166] The reward value acquisition module 706 is used to determine the reward value of the initial dialogue model in the reward parameter;

[0167] The model optimization module 708 is used to optimize the parameters of the initial dialogue model based on the reward value and the constraint threshold to obtain the dialogue model.

[0168] In one embodiment, the constraint module 704 is configured to:

[0169] When the training progress corresponds to the structure training phase, the structure parameters are determined, and the constraint threshold of the structure parameters is determined based on the training progress; the structure parameters are used to train the statement structure of the initial dialogue model to make the statement structure match the business scenario;

[0170] When the training progress corresponds to the semantic training phase, the structural parameters and semantic parameters are determined, and the constraint thresholds of the structural parameters and semantic parameters are determined based on the training progress; the semantic parameters are used to train the phrasal semantics of the initial dialogue model generated data so that the phrasal semantics match the business scenario; the structural training phase is a training phase that precedes the semantic training phase.

[0171] In one embodiment, the semantic training phase includes a statement semantic training phase and a scenario semantic training phase following the statement semantic training phase; the semantic parameters include statement semantic parameters and scenario semantic parameters; the statement semantic parameters are used to train the initial dialogue model to generate statement semantic content in the data so that the statement semantic content matches the business scenario; the scenario semantic parameters are used to train the initial dialogue model to generate data used by the business scenario in the filtering process.

[0172] The constraint module 704 is used for:

[0173] When the training progress corresponds to the statement semantic training stage, the structural parameters and the statement semantic parameters are determined, and the constraint thresholds of the structural parameters and the constraint thresholds of the statement semantic parameters are determined based on the training progress.

[0174] When the training progress corresponds to the scene semantic training stage, the structural parameters, the statement semantic parameters, and the scene semantic parameters are determined, and the constraint thresholds of the structural parameters, the statement semantic parameters, and the scene semantic parameters are determined based on the training progress.

[0175] In one embodiment, the constraint module 704 is configured to:

[0176] The reward parameters are determined based on the training progress, and the initial threshold, key progress nodes, and adjustment speed of the reward parameters are obtained.

[0177] Determine the progress difference between the training progress and the key progress node;

[0178] The threshold increment is determined based on the difference between the adjustment speed and the progress.

[0179] Based on the threshold increment, the initial threshold is adjusted to obtain the constraint threshold of the reward parameter.

[0180] In one embodiment, the constraint module 704 is configured to:

[0181] Obtain the model score of the initial dialogue model during the pre-training phase of the reward parameters;

[0182] Determine an initial threshold that is positively correlated with the model score;

[0183] Identify key progress nodes that are negatively correlated with the model score;

[0184] Determine the adjustment rate that is positively correlated with the model score.

[0185] In one embodiment, the constraint module 704 is configured to:

[0186] Obtain the upper limit of the constraint threshold for the reward parameter, and determine the threshold difference between the initial threshold and the upper limit of the constraint threshold; the upper limit of the constraint threshold is positively correlated with the risk of the business scenario;

[0187] Based on the progress difference, the threshold difference is processed non-linearly to obtain the threshold increment.

[0188] The modules in the training device for the aforementioned dialogue model can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.

[0189] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 8 As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and databases. The internal memory provides the environment for the operating system and computer programs stored in the non-volatile storage media to run. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When the computer program is executed by the processor, it implements a training method for a dialogue model.

[0190] Those skilled in the art will understand that Figure 8 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0191] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

[0192] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.

[0193] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0194] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0195] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0196] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0197] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A method for training a dialogue model, characterized in that, The method includes: Get the training progress of the initial dialogue model; Determine the reward parameter corresponding to the training progress and the constraint threshold of the reward parameter; the number of reward parameter types is positively correlated with the training progress; the constraint threshold of the reward parameter is positively correlated with the training progress. The initial dialogue model is evaluated based on the reward parameters during the training process to obtain a reward value. The parameters of the initial dialogue model are optimized based on the reward value and the constraint threshold to obtain the dialogue model.

2. The method according to claim 1, characterized in that, Determining the reward parameter corresponding to the training progress and the constraint threshold of the reward parameter includes: When the training progress corresponds to the structure training phase, the structure parameters are determined, and the constraint threshold of the structure parameters is determined based on the training progress; the structure parameters are used to train the statement structure of the data generated by the initial dialogue model so that the statement structure matches the business scenario; When the training progress corresponds to the semantic training phase, the structural parameters and semantic parameters are determined, and the constraint thresholds of the structural parameters and semantic parameters are determined based on the training progress; the semantic parameters are used to train the phrasal semantics of the initial dialogue model generated data so that the phrasal semantics match the business scenario; the structural training phase is a training phase that precedes the semantic training phase.

3. The method according to claim 2, characterized in that, The semantic training phase includes a statement semantic training phase and a scenario semantic training phase following the statement semantic training phase; the semantic parameters include statement semantic parameters and scenario semantic parameters; the statement semantic parameters are used to train the statement semantic content of the data generated by the initial dialogue model so that the statement semantic content matches the business scenario. The scene semantic parameters are used to train the initial dialogue model to generate the data used in the filtering process of the business scenario; The step of determining the structural parameters and semantic parameters when the training progress corresponds to the semantic training phase, and determining the constraint thresholds for the structural parameters and semantic parameters based on the training progress, includes: When the training progress corresponds to the statement semantic training stage, the structural parameters and the statement semantic parameters are determined, and the constraint thresholds of the structural parameters and the constraint thresholds of the statement semantic parameters are determined based on the training progress. When the training progress corresponds to the scene semantic training stage, the structural parameters, the statement semantic parameters, and the scene semantic parameters are determined, and the constraint thresholds of the structural parameters, the statement semantic parameters, and the scene semantic parameters are determined based on the training progress.

4. The method according to claim 1, characterized in that, The step of determining the reward parameter and the constraint threshold of the reward parameter based on the training progress includes: The reward parameters are determined based on the training progress, and the initial threshold, key progress nodes, and adjustment speed of the reward parameters are obtained. Determine the progress difference between the training progress and the key progress node; The threshold increment is determined based on the difference between the adjustment speed and the progress. Based on the threshold increment, the initial threshold is adjusted to obtain the constraint threshold of the reward parameter.

5. The method according to claim 4, characterized in that, The process of obtaining the initial threshold, key progress nodes, and adjustment speed for the reward parameters includes: Obtain the model score of the initial dialogue model during the pre-training phase of the reward parameters; Determine an initial threshold that is positively correlated with the model score; Identify key progress nodes that are negatively correlated with the model score; Determine the adjustment rate that is positively correlated with the model score.

6. The method according to claim 4, characterized in that, The step of determining the threshold increment based on the difference between the adjustment speed and the progress includes: Obtain the upper limit of the constraint threshold for the reward parameter, and determine the threshold difference between the initial threshold and the upper limit of the constraint threshold; the upper limit of the constraint threshold is positively correlated with the risk of the business scenario; Based on the progress difference, the threshold difference is processed non-linearly to obtain the threshold increment.

7. A training device for a dialogue model, characterized in that, The device includes: The progress acquisition module is used to obtain the training progress of the initial dialogue model; A constraint module is used to determine the reward parameter corresponding to the training progress and the constraint threshold of the reward parameter; the number of reward parameter types is positively correlated with the training progress; the constraint threshold of the reward parameter is positively correlated with the training progress. The reward value acquisition module is used to evaluate the data generated by the initial dialogue model during the training progress according to the reward parameters, and obtain the reward value. The model optimization module is used to optimize the parameters of the initial dialogue model based on the reward value and the constraint threshold to obtain the dialogue model.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.