Method and device for evaluating language generation model, electronic equipment and storage medium

By comparing multiple language generation models using the first model and iteratively updating using the second model, the unreliability and high cost of existing evaluation methods are solved, achieving efficient and accurate evaluation of natural language generation models.

CN116150317BActive Publication Date: 2026-06-26MASHANG CONSUMER FINANCE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
MASHANG CONSUMER FINANCE CO LTD
Filing Date
2022-11-18
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing evaluation methods for natural language generation technologies rely on automated indicators that are easy to calculate but unreliable. Manual evaluation methods are costly and prone to bias, while pairwise comparison model evaluations increase the workload of humans and have low reliability.

Method used

By comparing multiple language generation models using the first model and evaluating the results of the first model using the second model, the results are iteratively updated, reducing manual annotation and improving evaluation accuracy.

Benefits of technology

This approach enables the evaluation of highly reliable natural language generation models, reducing manual costs and improving the accuracy and efficiency of model evaluation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116150317B_ABST
    Figure CN116150317B_ABST
Patent Text Reader

Abstract

The application provides a language generation model evaluation method and device, electronic equipment and a storage medium. The method comprises the following steps: outputting model comparison results of a plurality of language generation models by a first model, and selecting two language generation models from the plurality of language generation models according to the model comparison results; obtaining a target test text, and obtaining two to-be-tested texts corresponding to the target test text by the two language generation models respectively; splicing the target test text and the two to-be-tested texts into input data and inputting the input data into a second model, and obtaining a model evaluation result according to an output result; updating the first model according to the model evaluation result; and evaluating the plurality of language generation models according to the model comparison result output by the updated first model. The method compares the plurality of language generation models by the first model, evaluates the comparison result of the first model by the second model, and continuously updates the first model to improve the accuracy, so as to obtain a more reliable natural language generation model evaluation result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of natural language processing, and in particular to an evaluation method, apparatus, electronic device, and storage medium for a language generation model. Background Technology

[0002] Natural Language Generation (NLG) technology refers to the automatic generation of comprehensible natural language text using artificial intelligence and linguistic methods. With the training of large-scale language models using massive datasets as samples, NLG technology has developed rapidly and is applied in areas such as copywriting and poetry generation. Early NLG focused on automated evaluation metrics, which were easy to calculate but unreliable. Therefore, in real-world applications, more reliable human evaluation methods are often used to assess NLG technology. However, this approach also suffers from high costs, annotator bias, high variance, and sequence effects (the current evaluation is influenced by previous evaluations). To address these issues, comparative methods are often used to evaluate multiple different NLG models, but this requires extensive manual annotation of the target text, still resulting in high labor costs and low reliability. Summary of the Invention

[0003] This application provides a method, apparatus, electronic device, and storage medium for evaluating language generation models. This application achieves the comparison of multiple language generation models using a first model, and the evaluation of the comparison results of the first model using a second model, thereby continuously updating the first model to improve accuracy and obtain more reliable natural language generation model evaluation results.

[0004] Firstly, this application provides a method for evaluating language generation models, comprising the following steps:

[0005] The model comparison results between any two language generation models among the multiple language generation models to be evaluated are obtained by outputting the first model. Multiple model comparison results are obtained, and the first language generation model and the second language generation model are selected from the multiple language generation models based on the multiple model comparison results. The model comparison results are used to characterize the comparison results between the text generation accuracy of any two language generation models.

[0006] The target test text is obtained, and a first test text corresponding to the target test text is obtained through a first language generation model, and a second test text corresponding to the target test text is obtained through a second language generation model.

[0007] The target test text, the first test text, and the second test text are concatenated as input data. The input data is then fed into the second model. Based on the output of the second model, the model evaluation results for the first language generation model and the second language generation model are obtained during this evaluation process. The model evaluation results are used to characterize the accuracy comparison results between the first test text generated by the first language generation model and the second test text generated by the second language generation model.

[0008] The first model is updated based on the model evaluation results;

[0009] The multiple language generation models are evaluated based on the model comparison results between the updated first model output and the corresponding multiple language generation models.

[0010] Secondly, this application provides an evaluation device for a language generation model, comprising:

[0011] The selection module is used to output the model comparison results between any two language generation models among the multiple language generation models to be evaluated through the first model, obtain multiple model comparison results, and select the first language generation model and the second language generation model from the multiple language generation models based on the multiple model comparison results. The model comparison results are used to characterize the comparison results between the text generation accuracy of any two language generation models.

[0012] The acquisition module is used to acquire the target test text, obtain the first test text corresponding to the target test text through the first language generation model, and obtain the second test text corresponding to the target test text through the second language generation model.

[0013] The input module is used to concatenate the target test text, the first test text, and the second test text into input data. The input data is then input into the second model. Based on the output of the second model, the model evaluation results for the first language generation model and the second language generation model are obtained during this evaluation process. The model evaluation results are used to characterize the accuracy comparison results between the first test text generated by the first language generation model and the second test text generated by the second language generation model.

[0014] The update module is used to update the first model based on the model evaluation results;

[0015] The evaluation module is used to evaluate multiple language generation models based on the model comparison results corresponding to the updated first model output.

[0016] Thirdly, this application provides an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores one or more computer programs executable by the at least one processor, the one or more computer programs being executed by the at least one processor to enable the at least one processor to perform the evaluation method of the above-described language generation model.

[0017] Fourthly, this application provides a computer-readable storage medium storing a computer program thereon, wherein the computer program, when executed by a processor / processor core, implements the evaluation method for the above-mentioned language generation model.

[0018] In the language generation model evaluation method provided in this application, firstly, a first model outputs pairwise comparison results between multiple language generation models, and two language generation models are selected based on these comparison results. Secondly, the target test text is fed into each of the two language generation models to obtain two test texts. Then, the target test text and the two test texts are concatenated as input data and input into the second model. The model evaluation result is obtained based on the output result, and the first model is updated accordingly. Finally, the multiple language generation models are evaluated based on the model comparison result output by the updated first model. Therefore, this method compares multiple language generation models using the first model and evaluates the comparison results of the first model using the second model, thereby achieving continuous updating and iteration of the first model, improving its accuracy, and obtaining a highly reliable evaluation result for the natural language generation model. Using the second model instead of manual evaluation of the first model reduces the amount of manual annotation and saves labor costs.

[0019] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this application, nor is it intended to limit the scope of this application. Other features of this application will become readily apparent from the following description. Attached Figure Description

[0020] The accompanying drawings are provided to further illustrate the present application and form part of the specification. They are used together with the embodiments of the present application to explain the application and do not constitute a limitation thereof. The above and other features and advantages will become more apparent to those skilled in the art from the detailed example embodiments described with reference to the accompanying drawings, in which:

[0021] Figure 1 A flowchart illustrating an evaluation method for a language generation model provided in Embodiment 1 of this application;

[0022] Figure 2 A flowchart illustrating an evaluation method for a language generation model provided in Embodiment 2 of this application;

[0023] Figure 3 This is a block diagram of an evaluation device for a language generation model provided in Embodiment 3 of this application;

[0024] Figure 4 This is a block diagram of an electronic device provided in Embodiment 4 of this application. Detailed Implementation

[0025] To enable those skilled in the art to better understand the technical solutions of this application, exemplary embodiments of this application are described below in conjunction with the accompanying drawings, including various details of the embodiments of this application to aid understanding. These should be considered merely exemplary. Therefore, those skilled in the art should recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this application. Similarly, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.

[0026] Where there is no conflict, the various embodiments of this application and the features thereof may be combined with each other.

[0027] As used herein, the term “and / or” includes any and all combinations of one or more related enumerated entries.

[0028] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the application. As used herein, the singular forms “a” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that when the terms “comprising” and / or “made of” are used in this specification, the presence of the stated feature, integral, step, operation, element, and / or component is specified, but the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof is not excluded. Terms such as “connected” or “linked” are not limited to physical or mechanical connections but can include electrical connections, whether direct or indirect.

[0029] Unless otherwise specified, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art. It will also be understood that terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant art and this application, and will not be interpreted as having an idealized or overly formal meaning, unless expressly so defined herein.

[0030] The evaluation method for the language generation model according to the embodiments of this application can be executed by electronic devices such as terminal devices or servers. The terminal device can be an in-vehicle device, user equipment (UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (PDA), handheld device, computing device, in-vehicle device, wearable device, etc. The server can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services. Specifically, the method can be implemented by a processor calling a computer program stored in memory.

[0031] In related technologies, extensive manual annotation of the target text to be evaluated is required, and reliability is low due to limitations in the training corpus. To address these issues, this application provides an evaluation method for language generation models. This method compares multiple language generation models using a first model, and evaluates the comparison results of the first model using a second model. This allows for continuous updating and iteration of the first model, improving its accuracy and yielding a highly reliable evaluation result for the natural language generation model. Furthermore, using a second model instead of manual evaluation of the first model reduces the amount of manual annotation and saves labor costs.

[0032] Example 1

[0033] Figure 1 This is a flowchart illustrating an evaluation method for a language generation model provided in Embodiment 1 of this application. (Refer to...) Figure 1 The method includes:

[0034] Step S110: The model comparison results between any two language generation models among the multiple language generation models to be evaluated are output by the first model, resulting in multiple model comparison results. Based on these multiple model comparison results, the first language generation model and the second language generation model are selected from the multiple language generation models. The model comparison results are used to characterize the comparison results between the text generation accuracy of any two language generation models.

[0035] The language generation model is used to automatically generate understandable natural language text from the target text. The first model is used to obtain the comparison results between multiple language generation models. The model comparison results are used to characterize the comparison results between the text generation accuracy of any two language generation models. The function of the language generation model is to automatically generate natural language text that is semantically close to and easy to understand from the input text. Correspondingly, the text generation accuracy of the language generation model refers to the accuracy of the natural language text generated by the language generation model from the input text. Generally, the higher the similarity and / or the closer the semantics between the generated natural language text and the input text, the higher the text generation accuracy of the language generation model; conversely, the lower the similarity and / or the greater the semantic difference between the generated natural language text and the input text, the lower the text generation accuracy of the language generation model. Accordingly, the model comparison results can be used to characterize whether the text generation accuracy of one language generation model is higher or lower than that of another language generation model.

[0036] Specifically, multiple language generation models are compared pairwise to obtain multiple model comparison results. Based on the comparison results with the highest accuracy probability, two language generation models are selected as the first and second language generation models in this evaluation process. In one specific implementation, the initial model comparison result is randomly generated, so the first model does not need to be pre-trained. Subsequent model comparison results are obtained based on the evaluation results of the second model, thereby gradually adjusting the accuracy of the model comparison results output by the first model based on the evaluation results.

[0037] Step S120: Obtain the target test text, and obtain the first test text corresponding to the target test text through the first language generation model, and obtain the second test text corresponding to the target test text through the second language generation model.

[0038] Step S120 is the natural language generation process. The target test text is used as a sample and put into the first language generation model and the second language generation model respectively to obtain the first test text and the second test text corresponding to the target test text. The first test text and the second test text are the understandable natural language text generated by natural language generation.

[0039] Step S130: Concatenate the target test text, the first test text, and the second test text into input data, input the input data into the second model, and obtain the model evaluation results for the first language generation model and the second language generation model in this evaluation process based on the output results of the second model.

[0040] The method of splicing the target test text, the first test text, and the second test text can be flexibly set by those skilled in the art when implementing the method, depending on the specific circumstances, and is not limited here; in a specific implementation, the target test text, the first test text, and the second test text can be sequentially spliced ​​to obtain the input data of the second model.

[0041] The second model evaluates the input data and outputs results reflecting the similarity between the first and second test texts and the target test text. This similarity allows for the evaluation of the accuracy of the model comparison results output by the first model. Therefore, the model evaluation results in this embodiment characterize the accuracy comparison between the first test text generated by the first language generation model and the second test text generated by the second language generation model. Specifically, the accuracy comparison results indicate that the first test text generated by the first language generation model is more accurate than the second test text generated by the second language generation model.

[0042] In step S110, the text generation accuracy (predicted by the first model) characterizes the comprehensive ability of the language generation model to generate natural language text. This comprehensive ability reflects the average accuracy of multiple natural language texts generated by the two language generation models during use. Conversely, the accuracy comparison result in this step (obtained through evaluation by the second model) characterizes the ability of the first and second language generation models to generate corresponding natural language texts (i.e., the first test text and the second test text) for the target test text. Therefore, the accuracy comparison result only characterizes the accuracy of the first and second test texts generated by the first and second language generation models for the target test text. For example, if the first test text has a higher similarity to the target test text and / or is semantically closer, it indicates that the first language generation model has higher accuracy in this text generation process. Accordingly, the model evaluation result is that the first language generation model has higher accuracy in this text generation process than the second language generation model.

[0043] Step S140: Update the first model based on the above model evaluation results.

[0044] In one specific implementation, the model comparison result between the first language generation model and the second language generation model is updated based on the above model evaluation result, and the updated model comparison result is used as a training sample to update the first model.

[0045] Step S150: Evaluate the multiple language generation models based on the model comparison results corresponding to the multiple language generation models output by the updated first model.

[0046] Specifically, through the interaction between the second and first models, accurate comparison results among multiple models are gradually obtained, leading to a more reliable first model. This first model is then used to evaluate multiple language generation models and determine the evaluation results for each model. The primary function of the first model is to output the model comparison results between any two language generation models (i.e., the comparison results between text generation accuracy). The primary function of the second model is to evaluate the accuracy of the test text generated by the two language generation models based on the target test text, thus obtaining an evaluation result of the two language generation models in actual use. Based on this evaluation result, the first model is corrected, resulting in more accurate model comparison results output by the corrected first model in subsequent comparisons.

[0047] Each of the above steps can be executed either once or multiple times. For example, in one optional implementation, to improve accuracy, steps S110 to S140 can be executed multiple times, so that the first model is continuously updated through multiple evaluations of the second model, thereby making the output of the first model more accurate. Optionally, during the multiple executions, the target test text used each time is of a different type, so as to conduct a comprehensive and accurate evaluation of the text generation accuracy of the language generation model.

[0048] In summary, the language generation model evaluation method provided in this embodiment firstly outputs pairwise comparison results between multiple language generation models using a first model, and selects two language generation models based on these results. Secondly, the target test text is fed into each of the two language generation models to obtain two test texts. Then, the target test text and the two test texts are concatenated as input data and input into the second model. The model evaluation result is obtained based on the output result, and the first model is updated accordingly. Finally, the multiple language generation models are evaluated based on the model comparison results output by the updated first model. Therefore, this method compares multiple language generation models using a first model and evaluates the comparison results of the first model using a second model, thereby achieving continuous updating and iteration of the first model, improving its accuracy, and obtaining a highly reliable natural language generation model evaluation result. Using the second model instead of manual evaluation of the first model reduces the amount of manual annotation and saves labor costs.

[0049] Example 2

[0050] Figure 2 This is a flowchart illustrating an evaluation method for a language generation model provided in Embodiment 2 of this application. (Refer to...) Figure 2 The method includes:

[0051] Step S210: By comparing the output of the first model with any two language generation models among the multiple language generation models to be evaluated, multiple model comparison results are obtained, and the first language generation model and the second language generation model are selected from the multiple language generation models based on the multiple model comparison results.

[0052] The language generation model is used to automatically generate understandable natural language text based on the target text. The first model is used to obtain the model comparison results between multiple language generation models. Specifically, multiple language generation models are compared pairwise to obtain multiple model comparison results. In one specific implementation, the initial model comparison result is generated randomly, so the first model does not need to be pre-trained. Subsequent model comparison results are obtained based on the evaluation results of the second model, thereby gradually adjusting the correctness of the model comparison results output by the first model based on the evaluation results.

[0053] In one specific implementation, the first model is used to determine the model score corresponding to multiple language generation models, and outputs the model comparison results corresponding to multiple language generation models based on the model scores corresponding to multiple language generation models. Then, based on the model comparison results, two language generation models are selected as the first language generation model and the second language generation model in this evaluation process.

[0054] In one alternative implementation, when selecting the first language generation model and the second language generation model from multiple language generation models based on the model comparison results, the selection is specifically achieved in the following way:

[0055] First, obtain the ranking probability parameter corresponding to the model comparison result between each pair of language generation models. This ranking probability parameter characterizes the probability that, given a specified ranking of the two language generation models, the text generation accuracy of the first-ranked model is higher than that of the second-ranked model. For example, if language generation model A and language generation model B are ranked as follows: language generation model A is ranked first, and language generation model B is ranked second, and the probability that the text generation accuracy of language generation model A is higher than that of language generation model B is 0.8, then the probability ranking parameter for language generation models A and B is 0.8. Accordingly, the model comparison result for language generation models A and B is: with natural language model A as the first rank and natural language model B as the second rank, the probability ranking parameter is 0.8.

[0056] Then, the two language generation models corresponding to the comparison results with the highest probabilities are designated as the first language generation model and the second language generation model; the first language generation model is designated as the first-ranked model, and the second language generation model as the second-ranked model. For example, from the comparison results of multiple models, the set of model comparison results with the largest parameter value of the rank probability parameter is determined, and the language generation model ranked first in this set of model comparison results is designated as the first language generation model, and the language generation model ranked second is designated as the second language generation model.

[0057] Step S220: Obtain the target test text, and obtain the first test text corresponding to the target test text through the first language generation model, and obtain the second test text corresponding to the target test text through the second language generation model.

[0058] Step S220 is the natural language generation process. The target test text is used as a sample and put into the first language generation model and the second language generation model respectively to obtain the first test text and the second test text corresponding to the target test text. The first test text and the second test text are the understandable natural language text generated by natural language generation.

[0059] Step S230: Concatenate the target test text, the first test text, and the second test text into input data. Input the input data into the second model in multiple input methods and obtain multiple evaluation values ​​of the second model corresponding to the multiple input methods.

[0060] The method of splicing the target test text, the first test text, and the second test text can be flexibly set by those skilled in the art when implementing the method, depending on the specific circumstances, and is not limited here; in a specific implementation, the target test text, the first test text, and the second test text can be sequentially spliced ​​to obtain the input data of the second model; the output result of the second model is the evaluation value for evaluating the model comparison result output by the first model.

[0061] In one specific implementation, input data is fed into the second model through various different input methods, and multiple evaluation values ​​are obtained from the second model corresponding to the different input methods. Specifically, in each input method, a portion of the network nodes in the second model are randomly deactivated using a preset sampling method, and the input data is then fed into the second model after these randomly deactivated network nodes. The types and number of randomly deactivated network nodes vary depending on the input method.

[0062] The preset sampling methods include Monte Carlo sampling. Monte Carlo sampling, also known as statistical simulation or statistical experimentation, is a numerical simulation method that takes probabilistic phenomena as its research object. It is a calculation method that uses sampling surveys to obtain statistical values ​​to estimate unknown characteristic quantities, and is suitable for computational simulation experiments on discrete systems. In computational simulation, by constructing a probabilistic model that approximates the system performance and conducting random experiments on a digital computer, the stochastic characteristics of the system can be simulated. Therefore, by randomly deactivating some network nodes in the second model using Monte Carlo sampling, multiple randomly distributed evaluation values ​​corresponding to various input methods can be obtained after inputting input data into the second model with these randomly deactivated network nodes.

[0063] Each evaluation value represents the probability that the text generation accuracy of the first language generation model is better than that of the second language generation model. Specifically, since the input data includes the target test text, the first test text, and the second test text concatenated sequentially, the second model can compare the first accuracy of the first test text relative to the target test text, and the second accuracy of the second test text relative to the target test text, based on the input data. The evaluation value is obtained based on the comparison results: if the first accuracy is significantly greater than the second accuracy, meaning the probability that the first language generation model is better than the second language generation model is higher, the evaluation value is higher; conversely, if the first accuracy is significantly lower than the second accuracy, meaning the probability that the first language generation model is better than the second language generation model is lower, the evaluation value is lower. Therefore, the evaluation value is the numerical representation of the final model evaluation result output by the second model, and the final model evaluation result output by the second model is obtained through the evaluation value. Furthermore, the physical meaning of the model evaluation result output by the second model is similar to that of the model comparison result output by the first model. Both are used to evaluate the text generation accuracy of the two language generation models. The difference is that the model comparison result output by the first model is predicted based on the model parameters inside the first model (the prediction result is affected by the model parameters and may not be accurate enough), while the model evaluation result output by the second model is based on the accuracy of the test text generated by the two language generation models for a specific target test text. Therefore, the model evaluation result output by the second model is more objective and accurate than the model comparison result output by the first model.

[0064] Step S240: Based on the dispersion index of multiple evaluation values, determine the true evaluation result corresponding to the input data.

[0065] The dispersion index is used to reflect the reliability of the evaluation value, such as the variance and standard deviation of multiple evaluation values. This can be flexibly set by those skilled in the art when implementing this method, depending on the specific circumstances, and is not restricted here. As mentioned above, each evaluation value is obtained based on the accuracy of the test text generated by the two language generation models for a specific target test text. However, since the evaluation result of this accuracy depends on the accuracy of the second model, if the accuracy of the second model is low, it will lead to inaccurate and subjective evaluation values. To solve the above problem and obtain more objective evaluation results as much as possible, in a specific implementation, step S240 is achieved through the following steps:

[0066] Step 1: Calculate the dispersion index of multiple evaluation values.

[0067] In step S230, statistical calculations are performed on the multiple evaluation values ​​output by the second model. For example, the variance and standard deviation of the multiple evaluation values ​​are calculated, and the results are used as the dispersion index of the multiple evaluation values. Therefore, the dispersion index is used to characterize the degree of deviation between multiple evaluation values, and this degree of deviation reflects the reliability of the multiple evaluation values.

[0068] Step 2: If the dispersion index is greater than the preset dispersion threshold, obtain the auxiliary annotation results triggered for the input data.

[0069] A larger dispersion index indicates a greater difference between multiple evaluation values, meaning the reliability of the second model's evaluation results is low. Therefore, manual evaluation is needed to correct the discrepancy and obtain a more reliable evaluation result. When the dispersion index exceeds a preset dispersion threshold, manual annotation results are obtained through manual evaluation and used as auxiliary annotation results. The preset dispersion threshold can be flexibly set by those skilled in the art when implementing this method, and is not limited here. Thus, the auxiliary annotation results are used to correct the specific numerical values ​​of the evaluation values. Besides manual annotation, this can also be achieved using third-party models. In short, anything that improves the accuracy of the evaluation results is acceptable, and this application does not limit the specific implementation method of the auxiliary annotation.

[0070] Step 3: Based on the auxiliary annotation results, determine the true evaluation results corresponding to the input data.

[0071] In one alternative implementation, the manually evaluated and labeled results are used as auxiliary labels, which are the actual evaluation results corresponding to the input data.

[0072] Step S250: Correct the output of the second model based on the actual evaluation results, and use the corrected results as the model evaluation results for the first language generation model and the second language generation model in this evaluation process.

[0073] If the dispersion index of multiple evaluation values ​​output by the second model in step S230 does not exceed the preset dispersion threshold, it indicates that the output result of the second model has high credibility, and the output result of the second model is the model evaluation result in this evaluation process. If the dispersion index is greater than the preset dispersion threshold, it indicates that the evaluation result of the second model has low credibility, and the real evaluation result corresponding to the input data obtained by manual evaluation is used as the model evaluation result in this evaluation process.

[0074] In one alternative implementation, after step S250, the second model is updated based on the real evaluation results. That is, the real evaluation results are used as training samples to update the second model, thereby using the human evaluation results to gradually iterate the second model into a more accurate model.

[0075] In one alternative implementation, the second model needs to be pre-trained before steps S230-S250.

[0076] Specifically, the second model is trained in the following way:

[0077] Step 1: Select the first text from the text set as the reference text; where the text set includes a set of sentences or a set of paragraphs.

[0078] Step 2: Select a second text from the text set as the opposite text corresponding to the reference text. For example, randomly select a second text from all texts in the text set other than the first text as the opposite text corresponding to the reference text.

[0079] Step 3: Perform preset processing on the first text to obtain the front text corresponding to the reference text based on the processed first text; wherein, the preset processing includes at least one of the following: random deletion processing, replacement processing, and back translation processing.

[0080] Step 4: Generate a set of sample pairs based on the reference text, the negative text, and the positive text; wherein the reference text and the negative text have opposite semantics, and the reference text and the positive text have roughly the same semantics; in an optional implementation, the sample pairs can be generated in the following way:

[0081] By sequentially concatenating the reference text, the positive text, and the negative text, a set of positive sample pairs is obtained.

[0082] By sequentially concatenating the reference text, the negative text, and the positive text, a set of negative sample pairs is obtained.

[0083] Step 5: Train the second model using a sample set consisting of multiple sample pairs, so that the trained second model outputs a first output result that matches the reference text and the positive text, and / or outputs a second output result that does not match the reference text and the negative text.

[0084] Step S260: Update the first model based on the above model evaluation results.

[0085] In one specific implementation, the model comparison result between the two language generation models is output through the model score. Therefore, step S260 specifically includes:

[0086] Based on the model evaluation results during this evaluation process, the comparison results between the first language generation model and the second language generation model are determined. Based on the comparison results, the model scores of the first and second language generation models are updated. Specifically, based on the model evaluation results, at least one model parameter in the first model is updated to update the rank probability parameter corresponding to the target model comparison result. The target model comparison result is the model comparison result between the first and second language generation models output by the first model. The specific method for updating at least one model parameter in the first model includes: if the model evaluation result matches the target model comparison result output by the first model, the value of the rank probability parameter corresponding to the updated first model's output model comparison result is increased; if the model evaluation result does not match the target model comparison result output by the first model, the value of the rank probability parameter corresponding to the updated first model's output model comparison result is decreased. Through this method, the subsequent output results of the first model can better match the real situation, thereby improving the accuracy of the first model.

[0087] Step S270: Evaluate the multiple language generation models based on the model comparison results corresponding to the multiple language generation models output by the updated first model.

[0088] Since the model comparison results output by the updated first model are closer to the real situation, the text generation accuracy of multiple language generation models can be evaluated directly based on the model comparison results corresponding to multiple language generation models output by the updated first model. This allows us to identify the language generation model with higher text generation accuracy among multiple language generation models, so that in subsequent applications, we can use the language generation model with higher text generation accuracy to automatically generate understandable natural language text from the target text.

[0089] In one optional implementation, to improve the accuracy of the evaluation results of the first model, steps S210-S260 are executed repeatedly to update and iterate the first model multiple times. Accordingly, step S270 specifically includes:

[0090] Determine whether the updated first model satisfies the loop termination condition;

[0091] If so, the multiple language generation models are evaluated based on the model comparison results of the first model after this update and the corresponding multiple language generation models. That is, the model comparison results of the first model after this update and the corresponding multiple language generation models are taken as the final model comparison results.

[0092] If not, the above steps are repeated using the updated first model until the updated first model meets the loop termination condition. The multiple language generation models are then evaluated based on the model comparison results output by the first model that meets the loop termination condition and the corresponding models of the multiple language generation models.

[0093] The loop termination conditions include: the number of loops reaches a preset number; or, the model comparison results output by the updated first model corresponding to multiple language generation models satisfy a preset convergence condition. The preset number of loops and the preset convergence condition can be flexibly set by those skilled in the art when implementing this method, and are not limited here. For example, the preset convergence condition includes: among the position probability parameters output by the first model corresponding to the model comparison results between every two language generation models, the position probability parameter with the largest parameter value matches a preset convergence threshold. The position probability parameter with the largest parameter value corresponds to the set of model comparison results with the highest confidence level, and the preset convergence threshold can be flexibly set by those skilled in the art.

[0094] Optionally, to make the model evaluation results of the second model more comprehensive during multiple iterations, the target test texts corresponding to each iteration are different. That is, a test text library is generated in advance, and each time step S220 is executed, a text is randomly extracted from the test text library without replacement as the target test text.

[0095] In summary, the language generation model evaluation method provided in this embodiment firstly compares the results of multiple language generation models using a first model, and selects two language generation models based on these results. Secondly, the target test text is fed into each of the two language generation models to obtain two test texts. Then, the target test text and the two test texts are concatenated as input data and input into the second model. The model evaluation result is obtained based on the output result, and the first model is updated accordingly. Finally, the multiple language generation models are evaluated based on the model comparison result output by the updated first model. Therefore, this method compares multiple language generation models using a first model and evaluates the comparison result of the first model using a second model, thereby continuously updating and iterating the first model, improving its accuracy, and obtaining a highly reliable natural language generation model evaluation result. Using a second model instead of manual evaluation of the first model reduces the amount of manual annotation and saves labor costs. Simultaneously, the reliability of the second model's output result is calculated, and manual evaluation corrects output results with lower reliability, thus providing reliability for the evaluation result.

[0096] To facilitate understanding, a specific example will be used below to explain the detailed implementation of the above embodiments.

[0097] In recent years, thanks to the use of massive datasets to train large-scale language models, natural language generation technology has developed rapidly, with applications such as copywriting and poetry generation. However, the evaluation system for natural language generation technology still faces significant challenges.

[0098] Generally, early natural language generation (NLP) relied heavily on automated evaluation metrics, such as the ROUGE metric. While easy to calculate, this metric is unreliable. For example, "I like to eat apples" and "I don't like to eat apples" are very similar under the ROUGE framework, but they convey completely opposite meanings. Furthermore, some practical applications have shown that these evaluation metrics are entirely unrelated to semantic similarity. Therefore, in real-world applications, more reliable human evaluation methods are often used to evaluate NLP technologies. However, this approach also has its problems, such as high cost, time consumption, and high labor costs. In addition, human evaluation has been shown by relevant research to suffer from annotator bias, high variance, and sequence effects (the current evaluation is influenced by previous items). To address these issues, many related technologies employ comparative methods to evaluate two different NLP models. Pairwise comparisons are less difficult than direct human scoring and do not suffer from the aforementioned problems. However, this approach introduces more human workload. For example, with K NLP models, the original method only requires scoring K results, while pairwise comparisons require scoring K*(K-1) results.

[0099] To address the above issues, this specific example provides an evaluation method for language generation models. The method includes: configuring a test dataset and a set of natural language generation models; selecting two models from the set using a first model (i.e., the first language generation model and the second language generation model in this example); randomly selecting a test data point (i.e., the target test text in this example) from the test dataset and inputting it into these two models; inputting the results of these two models into the second model to obtain an uncertain score and a first predicted value (equivalent to the dispersion index and evaluation value in this example); if the uncertain score is greater than a first threshold (i.e., the preset dispersion threshold in this example), the result is submitted to annotators for scoring, and the first predicted value is updated based on the human scoring value (i.e., the auxiliary annotation result in this example), and the second model is also updated; the first predicted value is then used to update the first model; this process is repeated until specified conditions are met, at which point the optimal natural language generation model is output.

[0100] The second model consists of an encoding network and an evaluation network; the encoding network is a pre-trained language model, and the evaluation network is a feedforward network; the input of the second model is composed of test text and text generated by the natural language generation model in sequence.

[0101] The training data for the second model comes from publicly available Chinese corpora and is processed accordingly based on the natural language generation task.

[0102] For sentence-level generation tasks, training data is constructed using the following method:

[0103] 1. Segment the dataset to obtain the full set of sentences (equivalent to the reference text in the example);

[0104] 2. Traverse all sentences to obtain the current sentence, and sample a specified number of sentences that are not the current sentence from all sentences as negative samples (equivalent to the reverse text in the example), and construct a reference-negative sample sentence pair;

[0105] 3. Traverse the reference-negative sample sentence pairs, and obtain the positive sample sentence (equivalent to the positive text in the example) by performing partial random deletion, replacement, back translation, etc. on the reference sentence, and construct the reference-positive-negative sample sentence triplet;

[0106] 4. Traverse the reference-positive-negative sentence triplets and remove the low-difficulty reference-positive-negative sentence triplets.

[0107] For document-level generation tasks, training data is constructed using the following method:

[0108] 1. Segment the dataset to obtain the full paragraphs (equivalent to the reference text in the example);

[0109] 2. Traverse all paragraphs to obtain the current paragraph, and sample a specified number of non-current sentences from all paragraphs as negative samples (equivalent to the reverse text in the example), and construct a reference-negative sample pair;

[0110] 3. Traverse the reference-negative sample pairs, and obtain the positive sample paragraph (equivalent to the positive text in the example) by performing partial random deletion, replacement, back translation, etc. on the sentences in the reference paragraph, and construct the reference-positive-negative sample paragraph triplet;

[0111] 4. Traverse the reference-positive-negative sample triplet and remove the low-difficulty reference-positive-negative sample triplets.

[0112] The "low-difficulty reference-positive-negative sample pair" must meet the following conditions:

[0113] 1. The ROUGE score between the positive and negative samples is less than the second threshold;

[0114] 2. The BLEU score between the positive and negative samples is less than the third threshold;

[0115] 3. The similarity between the representation vectors of the positive and negative samples is less than or greater than the fourth threshold;

[0116] The second, third, and fourth thresholds are preset by those skilled in the art when implementing the method, depending on the specific circumstances. The purpose of this step is to eliminate sample groups where the positive and negative samples are very similar, as these samples have no comparative value.

[0117] The "representation vectors between positive and negative samples" are calculated by a representation model, which is a general-domain trained representation model based on the BERT network architecture. In this method, it is only used to score the training samples and is not involved in the specific training process.

[0118] The second sentence-level model is trained using the following method:

[0119] 1. Randomly select a reference-positive-negative sample triplet, and use special characters to sequentially concatenate the reference-positive-negative sample into the input text, and set the regression value to 1.0 (the regression value is the actual evaluation result in the example);

[0120] 2. The triplet is concatenated into the input text using special characters in the order of reference-negative sample-positive sample, and the regression value is set to 0.0;

[0121] 3. The triplet is concatenated into the input text using special characters in the order of reference-sample-othersample, and the regression value is set to 0.5; among them, othersamples are obtained by sampling the reference sample.

[0122] 4. The triplet is concatenated into the input text using special characters in the order of reference-negative sample-other negative samples, and the regression value is set to 0.5; among them, other negative samples are obtained by negative sampling based on the reference.

[0123] 5. After word segmentation, the above input text is input into the second model to obtain the predicted value (i.e., the evaluation value in the example). The MSE loss and the gradient of the model parameters are calculated using the predicted value and the regression value. The model parameters are then updated by the optimizer.

[0124] 6. Repeat steps 1-5 until the model converges.

[0125] The second model at the paragraph level is similar to that at the sentence level, except that it requires the following processing for reference, positive sample, negative sample, other positive sample, and other negative sample:

[0126] 1. If the number of characters in a paragraph is greater than the first hyperparameter, then select the second hyperparameter of characters before the paragraph and the third hyperparameter of characters after the paragraph and concatenate them, using special characters as delimiters during concatenation;

[0127] 2. If the number of characters in a paragraph is less than the first hyperparameter, then use special characters to expand the number of characters in the current paragraph to the first hyperparameter;

[0128] The first hyperparameter, the second hyperparameter, and the third hyperparameter are preset by those skilled in the art when implementing the method, depending on the specific circumstances.

[0129] The first model mentioned above selects model pairs in the following manner:

[0130] 1. Initialize Kx(K-1) parameter tuples (used to represent the position probability parameters at the specified positions mentioned above) to indicate that natural language generation model 1 is superior to natural language generation model 2;

[0131] 2. Traverse the parameter pairs and perform beta distribution sampling based on the parameter pairs to obtain the sampling probability;

[0132] 3. Select the generative model pair corresponding to the parameter pair with the highest sampling probability.

[0133] The second model obtains the uncertainty score and the first predicted value through the following process:

[0134] 1. The reference text and the results of the natural language generation model are concatenated in a specific order using special characters and then segmented to obtain the input;

[0135] 2. Calculate the evaluation result of the second model according to the number of times the third hyperparameter is applied (this number is equal to the number of different input methods mentioned above). In the calculation process, some network nodes are randomly deactivated by Monte Carlo sampling.

[0136] 3. Calculate the variance of the above results as the uncertainty score (equivalent to the dispersion index), and the mean as the first predicted value (i.e., the regression value);

[0137] The process of updating the second model based on the annotation results from the annotators includes the following steps:

[0138] 1. Receive the reference (i.e., the target test text), the first test text, the second test text, and the corresponding regression values;

[0139] 2. Following the order of reference, first test text, and second test text, use special characters to concatenate them to construct training text, and use the regression value as the value of the input text;

[0140] 3. Construct training text by splicing special characters according to the order of reference, second prediction text, and first prediction text, and use 1 minus the regression value as the value of the input text;

[0141] 4. Update the network parameters of the second model according to the aforementioned method.

[0142] Updating the first model parameters using the first predicted value includes the following three cases:

[0143] 1. If the first predicted value is greater than 0.5, add the first parameter of the parameter tuple of the first natural language generation model and the second natural language generation model that adopt the first-order arrangement to the fourth hyperparameter (which is the first preset value).

[0144] At the same time, the second-order parameter in the parameter tuple of the second natural language generation model and the first natural language generation model, which adopt the second-order arrangement method, is added to the fifth hyperparameter (that is, the second preset value).

[0145] It should be noted that the first-order arrangement is as follows: the first natural language generation model is in the first position, and the second natural language generation model is in the second position. The second-order arrangement is as follows: the first natural language generation model is in the second position, and the second natural language generation model is in the first position. The parameter tuple is in the form of (a, b), where a is the first-order parameter and b is the second-order parameter, and the relative magnitude of a and b is used to represent the probability magnitude of the order probability parameter. Since the position probability parameter represents the probability that the text generation accuracy of the first-ranked natural language generation model is higher than that of the second-ranked natural language generation model under a specified position arrangement, if the positions of the first-ranked and second-ranked natural language generation models are interchanged, the relative relationship between the values ​​of a and b will also change. Accordingly, the purpose of updating the parameters of the first model can be achieved by increasing the parameter value of the first-ranked parameter in the parameter tuple of the first and second natural language generation models using the first-ranked arrangement and / or by increasing the parameter value of the second-ranked parameter in the parameter tuple of the first and second natural language generation models using the second-ranked arrangement, and the two methods are essentially the same.

[0146] 2. Similarly, if the first predicted value is less than 0.5, add the fifth hyperparameter to the second position parameter in the parameter tuple of the first natural language generation model and the second natural language generation model in the first model.

[0147] At the same time, the first secondary parameter in the parameter tuples of the second natural language generation model and the first natural language generation model is added to the fourth hyperparameter.

[0148] 3. If the first predicted value is 0.5, no parameter update is performed;

[0149] The fourth and fifth hyperparameters are preset by those skilled in the art when implementing the method, depending on the specific circumstances.

[0150] The specified condition mentioned above refers to a parameter pair (a, b) in the first model satisfying the following condition:

[0151] 1. It is the largest among all parameter pairs and is greater than 0.5;

[0152] 2. It is less than the fifth hyperparameter.

[0153] The aforementioned optimal natural language generation model refers to the natural language generation model that satisfies the first order of the specified condition parameter tuples in the first output model.

[0154] To clearly and completely describe the method in this specific example, the implementation process of this method in the evaluation scenario of similar question generation model in the insurance industry will be further explained in detail below.

[0155] This example includes two processes: training the second model and evaluating the natural language generation model. Optionally, the training of the second model in this example consists of an encoding network and an evaluation network. The network structure and weight selection of the encoding network are implemented using the open-source pre-trained language model chinese-roberta-wwm-ext. The evaluation network consists of a single-layer feedforward network. The first model does not require training.

[0156] For constructing the training data, the open-source CLUE database was chosen, and the relevant training data was constructed according to the following method:

[0157] 1. Segment the dataset to obtain the complete set of sentences;

[0158] 2. Traverse all sentences to obtain the current sentence, and sample 10 non-current sentences from all sentences as negative samples to construct a reference-negative sample sentence pair;

[0159] 3. Traverse the reference-negative sample sentence pairs, and operate on the reference sentence using the following method to obtain the positive sample sentence and construct the reference-positive-negative sample sentence triplet:

[0160] a) Deletion: Randomly delete a portion of the string; the deletion rate is obtained by uniform sampling based on 10%-30%, and the length of the deleted text is calculated based on the deletion rate; at the same time, for the length from 0 to the sentence length, the length of the deleted text is subtracted, and uniform sampling is performed to obtain the deletion starting point. The text from the deletion starting point to the starting point plus the length of the deleted text is used to obtain the sample sentence.

[0161] b) Replacement: Randomly replace a portion of the text; the replacement rate is obtained by uniform sampling from 10% to 70%, and the replacement text length is calculated based on the replacement rate; at the same time, for the sentence length from 0, the length of the deleted replacement text is subtracted, and uniform sampling is performed to obtain the replacement starting point. The text from the replacement starting point to the starting point plus the replacement text length is used as the replacement text. The insertion point of the remaining text is uniformly sampled, and the replacement text is inserted to obtain the sample sentence.

[0162] c) Back translation: Translate the text into English, and then translate the English back into Chinese to obtain the sample sentence;

[0163] 4. Traverse the reference-positive-negative sentence triplets and remove the low-difficulty reference-positive-negative sentence triplets.

[0164] The low-difficulty reference-positive-negative sentence triplet is defined as follows:

[0165] 1. The ROUGE score between the positive and negative samples is less than 0.1;

[0166] 2. The BLUE score between the positive and negative samples is less than 0.15;

[0167] 3. The cosine similarity between the representation vectors of the positive and negative samples is less than 0.05; the representation vectors are calculated using the open-source SimCSE model.

[0168] After obtaining the training data, the second model is trained using the following method:

[0169] 1. Randomly select a reference-positive-negative sample triplet, concatenate the reference-positive-negative sample triplet into the input text using special characters, and set the regression value to 1.0;

[0170] 2. The triplet is concatenated into the input text using special characters in the order of reference-negative sample-positive sample, and the regression value is set to 0.0;

[0171] 3. The triplet is concatenated into the input text using special characters in the order of reference-sample-othersample, and the regression value is set to 0.5; among them, othersamples are obtained by sampling the reference sample.

[0172] 4. The triplet is concatenated into the input text using special characters in the order of reference-negative sample-other negative samples, and the regression value is set to 0.5; among them, other negative samples are obtained by negative sampling based on the reference.

[0173] 5. After word segmentation of the above input text, input it into the first model to obtain the predicted value, and calculate the MSE loss and the gradient of the model parameters using the predicted value and the regression value. Then, update the model parameters using the AdamW optimizer.

[0174] 6. Repeat steps 1-5 until the model converges; specifically, determine whether the model has converged by observing the MSE loss value of the model on the validation set. That is, if the loss value on the validation set does not decrease within the set number of steps, the model is considered to have converged.

[0175] After training the second model, the natural language generation models can be evaluated. In this example, there are 20 natural language generation models to be evaluated, and the specific process is as follows:

[0176] 1. Initialize the first model parameters (as shown in Table 1), resulting in 380 parameter pairs, all with an initial value of 1;

[0177] Table 1

[0178] Assumption Parameter 1 Parameter 2 Model 1 is better than Model 2 1 1 Model 1 is better than Model 3 1 1 … … … Model k-1 is better than model k 1 1

[0179] 2. Calculate the probabilities of the two natural language generation model sequences using the first model, and select the two natural language generation model sequences with the highest probabilities as the sequential comparison models (i.e., the first language generation model and the second language generation model in the embodiment), denoted as model A and model B.

[0180] 3. Sample one test data point from the test data without replacement, denoted as text 0, and input text 0 into model A to generate text A. Similarly, generate text B.

[0181] 4. Concatenate the text 0, text A, and text B sequentially using the special character "[sep]", and then segment them using a word segmenter to obtain the input sequence A of the second model;

[0182] 5. Randomly deactivate some network nodes through 20 Monte Carlo samplings, and input sequence A in each deactivation, calculate the corresponding prediction results, and obtain 20 prediction result sets;

[0183] 6. The variance of the 20 prediction results set is taken as the uncertainty score, and the average value is taken as the first prediction value;

[0184] 7. If the uncertainty score is greater than 0.03, the result will be given to the annotators, who will score the confidence level of the hypothesis that "Model A is better than Model B". The score range is 0-1, where 0 indicates that the hypothesis that "Model A is better than Model B" is very unlikely to be true, and 1 indicates that the hypothesis that "Model A is better than Model B" is very likely to be true. The annotators' scores will then be fed back to the second model to update it. At the same time, the first predicted value will be updated to the manually scored annotated result.

[0185] 8. Update the first model using the first predicted value, and determine whether the parameter state of the first model meets the termination condition. If it does, output the best natural language generation model; otherwise, continue with process 2-8.

[0186] Specifically, the above-mentioned "calculation of the probability of two natural language generation model sequences through the first model" is obtained by randomly sampling by substituting the parameter pairs in the first model into the beta distribution; in the first model, each pair of parameter pairs corresponds to a natural language generation model that is better than the other.

[0187] Specifically, the "updating of the second model" mentioned above includes the following process:

[0188] 1. Receive reference text (text 0), first predicted text (text A), second predicted text (text B), and the corresponding regression value (first predicted value);

[0189] 2. Construct training text by splicing the reference text, the first predicted text, and the second predicted text in that order using special characters ("[sep]"), and use the regression value (first predicted value) as the value of the input text;

[0190] 3. Construct training text by splicing special characters according to the order of reference, second prediction text, and first prediction text, and use 1 minus the regression value as the value of the input text;

[0191] 4. After the two input texts are segmented, they are input into the second model to obtain the predicted values. The MSE loss and the gradient of the model parameters are calculated using the predicted values ​​and the regression values. The model parameters are then updated using the AdamW optimizer.

[0192] Specifically, the aforementioned "update of the first model" includes the following situations:

[0193] 1. If the first predicted value is greater than 0.5, add 1 to the first secondary parameter in the parameter tuple of the first natural language generation model and the second natural language generation model in the first model;

[0194] At the same time, add 1 to the second position parameter in the parameter tuples of the second natural language generation model and the first natural language generation model;

[0195] 2. If the first predicted value is less than 0.5, add 1 to the second position parameter in the parameter tuple of the first natural language generation model and the second natural language generation model in the first model;

[0196] At the same time, add 1 to the first secondary parameter in the parameter tuples of the second natural language generation model and the first natural language generation model;

[0197] 3. If the first predicted value is 0.5, no parameter update is performed.

[0198] In summary, this method utilizes a second model based on a large-scale pre-trained model to automatically score samples with low uncertainty, saving manual costs. It uses the first model to find the optimal model with less manual annotation. Although the amount of manual annotation is reduced, the second model method involves manual comparison of generated text, which is less difficult, has less noise, and has higher evaluation reliability. Therefore, this method has the beneficial effects of high reliability and low manual labor.

[0199] It is understood that the various method embodiments mentioned above in this application can be combined with each other to form combined embodiments without violating the principle and logic. Due to space limitations, this application will not elaborate further. Those skilled in the art will understand that in the above methods of specific implementation, the specific execution order of each step should be determined by its function and possible internal logic.

[0200] Example 3

[0201] Figure 3 This is a block diagram of an evaluation device 30 for a language generation model provided in Embodiment 3 of this application. (Refer to...) Figure 3 The evaluation device 30 for the language generation model includes:

[0202] The selection module 31 is used to output the model comparison results between any two language generation models among the multiple language generation models to be evaluated through the first model, obtain multiple model comparison results, and select the first language generation model and the second language generation model from the multiple language generation models based on the multiple model comparison results. The model comparison results are used to characterize the comparison results between the text generation accuracy of any two language generation models.

[0203] The acquisition module 32 is used to acquire the target test text, obtain the first test text corresponding to the target test text through the first language generation model, and obtain the second test text corresponding to the target test text through the second language generation model.

[0204] The input module 33 is used to concatenate the target test text, the first test text, and the second test text into input data, input the input data into the second model, and obtain the model evaluation results for the first language generation model and the second language generation model in this evaluation process based on the output results of the second model. The model evaluation results are used to characterize the accuracy comparison results between the first test text generated by the first language generation model and the second test text generated by the second language generation model.

[0205] Update module 34 is used to update the first model based on the model evaluation results;

[0206] Evaluation module 35 is used to evaluate multiple language generation models based on the model comparison results corresponding to multiple language generation models output by the updated first model.

[0207] Optionally, the selection module 31 is specifically used to: obtain the position probability parameter corresponding to the model comparison result between each pair of language generation models; wherein, the position probability parameter corresponding to the model comparison result between each pair of language generation models is used to characterize the probability that the text generation accuracy of the first-ranked language generation model is higher than that of the second-ranked language generation model when the two corresponding language generation models are arranged in a specified position; and take the two language generation models corresponding to the comparison result with the highest probability as the first language generation model and the second language generation model; wherein, the first language generation model is the first-ranked model and the second language generation model is the second-ranked model.

[0208] Optionally, the input module 33 is specifically used for: inputting input data into the second model through various different input methods, and obtaining multiple evaluation values ​​output by the second model corresponding to the various different input methods; wherein each evaluation value is used to characterize the probability that the text generation accuracy of the first language generation model is better than that of the second language generation model; calculating the dispersion index of the multiple evaluation values; if the dispersion index is greater than a preset dispersion threshold, obtaining the auxiliary annotation result triggered by the input data; determining the true evaluation result corresponding to the input data based on the auxiliary annotation result; correcting the output result of the second model based on the true evaluation result, and using the corrected result as the model evaluation result for the first language generation model and the second language generation model in this evaluation process; and updating the second model based on the true evaluation result.

[0209] Optionally, the input module 33 is specifically used to: in each input method, randomly deactivate some network nodes in the second model using a preset sampling method, and input the input data into the second model after randomly deactivating some network nodes; wherein, the types and number of randomly deactivated network nodes are different in different input methods; wherein, the preset sampling method includes: Monte Carlo sampling method.

[0210] Optionally, the update module 34 is specifically used to: update at least one model parameter in the first model according to the model evaluation result, so as to update the position probability parameter corresponding to the target model comparison result, wherein the target model comparison result is the model comparison result between the first language generation model and the second language generation model output by the first model; wherein, the specific method of updating at least one model parameter in the first model includes: if the model evaluation result matches the target model comparison result output by the first model, then the parameter value of the position probability parameter corresponding to the updated model comparison result output by the first model is increased; if the model evaluation result does not match the target model comparison result output by the first model, then the parameter value of the position probability parameter corresponding to the updated model comparison result output by the first model is decreased.

[0211] Optionally, the evaluation module 35 is specifically used to: determine whether the first model after this update satisfies the loop termination condition;

[0212] If yes, evaluate the multiple language generation models based on the model comparison results corresponding to the multiple language generation models output by the updated first model; otherwise, repeatedly execute the above steps using the updated first model until the updated first model meets the loop termination condition, and evaluate the multiple language generation models based on the model comparison results corresponding to the multiple language generation models output by the first model that meets the loop termination condition.

[0213] Optionally, the loop termination condition includes: the number of loops reaches a preset number; or, the model comparison results of the updated first model output corresponding to multiple language generation models satisfy a preset convergence condition; wherein, the preset convergence condition includes: among the position probability parameters output by the first model corresponding to the model comparison results between every two language generation models, the position probability parameter with the largest parameter value matches a preset convergence threshold.

[0214] Optionally, the input module 33 is also used for training the second model, which is trained in the following way: selecting a first text as a reference text from the text set; randomly selecting a second text as the opposite text corresponding to the reference text from other texts in the text set besides the first text; performing preset processing on the first text to obtain the front text corresponding to the reference text based on the processed first text; wherein the preset processing includes at least one of the following: random deletion processing, substitution processing, and back translation processing; generating a set of sample pairs based on the reference text, the opposite text, and the front text; training the second model through a sample set consisting of multiple sets of sample pairs, so that the trained second model outputs the output result of the reference text and the front text matching and / or the reference text and the opposite text not matching.

[0215] The specific structure and working principle of each of the above modules can be found in the descriptions of the corresponding parts of Method Embodiment 1 and Embodiment 2, and will not be repeated here.

[0216] Example 4

[0217] Figure 4 This is a block diagram of an electronic device 40 provided in Embodiment 4 of this application. (Refer to...) Figure 4 The electronic device includes:

[0218] At least one processor 401; at least one memory 402; and one or more I / O interfaces 403 connected between the processor 401 and the memory 402; wherein the memory 402 stores one or more computer programs 405 that can be executed by at least one processor 401, and the one or more computer programs 405 are executed by at least one processor 401 to perform the evaluation method of the above-mentioned language generation model.

[0219] This application also provides a computer-readable storage medium storing a computer program thereon, wherein the computer program, when executed by a processor / processor core, implements the above-described evaluation method for the language generation model. The computer-readable storage medium may be volatile or non-volatile.

[0220] This application also provides a computer program product, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code. When the computer-readable code is run in the processor of an electronic device, the processor in the electronic device executes the evaluation method of the above-mentioned language generation model.

[0221] Those skilled in the art will understand that all or some of the steps, systems, and apparatuses disclosed above, and their functional modules / units, can be implemented as software, firmware, hardware, or suitable combinations thereof. In hardware implementations, the division between functional modules / units mentioned above does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit (ASIC). Such software can be distributed on a computer-readable storage medium, which may include computer storage media (or non-transitory media) and communication media (or transient media).

[0222] As is known to those skilled in the art, the term computer storage medium includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable program instructions, data structures, program modules, or other data). Computer storage media includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), static random access memory (SRAM), flash memory or other memory technologies, portable compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible to a computer. Furthermore, it is known to those skilled in the art that communication media typically contain computer-readable program instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.

[0223] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.

[0224] The computer program instructions used to perform the operations of this application may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Smalltalk, C++, etc., and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The computer-readable program instructions may be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuits, such as programmable logic circuits, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), are personalized by utilizing state information from the computer-readable program instructions. These electronic circuits can execute the computer-readable program instructions to implement various aspects of this application.

[0225] The computer program product described herein can be implemented specifically through hardware, software, or a combination thereof. In one alternative embodiment, the computer program product is specifically embodied in a computer storage medium; in another alternative embodiment, the computer program product is specifically embodied in a software product, such as a software development kit (SDK), etc.

[0226] Various aspects of this application are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0227] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0228] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0229] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0230] Example embodiments have been disclosed herein, and while specific terminology has been used, it is for general illustrative purposes only and should not be construed as limiting. In some instances, it will be apparent to those skilled in the art that features, characteristics, and / or elements described in conjunction with particular embodiments may be used alone, or in combination with features, characteristics, and / or elements described in conjunction with other embodiments, unless otherwise expressly indicated. Therefore, those skilled in the art will understand that various changes in form and detail may be made without departing from the scope of this application as set forth by the appended claims.

Claims

1. An evaluation method for a language generation model, characterized in that, include: The model comparison results between any two language generation models among the multiple language generation models to be evaluated are obtained by outputting the first model, and the first language generation model and the second language generation model are selected from the multiple language generation models based on the multiple model comparison results. The model comparison results are used to characterize the comparison results between the text generation accuracy of any two language generation models. Obtain the target test text, and obtain the first test text corresponding to the target test text through the first language generation model, and obtain the second test text corresponding to the target test text through the second language generation model; The target test text, the first test text, and the second test text are concatenated into input data. The input data is then input into the second model. Based on the output of the second model, the model evaluation results for the first language generation model and the second language generation model are obtained during this evaluation process. The model evaluation results are used to characterize the accuracy comparison results between the first test text generated by the first language generation model and the second test text generated by the second language generation model. The first model is updated based on the model evaluation results; The multiple language generation models are evaluated based on the model comparison results corresponding to the multiple language generation models output by the updated first model. The step of selecting the first language generation model and the second language generation model from the multiple language generation models based on the model comparison results includes: obtaining a position probability parameter corresponding to the model comparison result between each pair of language generation models; wherein the position probability parameter corresponding to the model comparison result between each pair of language generation models is used to characterize the probability that, given a specified positional arrangement of the two corresponding language generation models, the text generation accuracy of the first-ranked language generation model is higher than that of the second-ranked language generation model; the two language generation models corresponding to the comparison result with the highest probability are selected as the first language generation model and the second language generation model.

2. The method according to claim 1, characterized in that, The first language generation model is ranked first, and the second language generation model is ranked second.

3. The method according to claim 1, characterized in that, The process of obtaining the model evaluation results for the first language generation model and the second language generation model based on the output of the second model includes: The input data is input into the second model through various different input methods, and multiple evaluation values ​​are obtained from the output of the second model corresponding to the various different input methods; wherein, each evaluation value is used to characterize the probability that the text generation accuracy of the first language generation model is better than that of the second language generation model. Calculate the dispersion index of the multiple evaluation values; If the dispersion index is greater than a preset dispersion threshold, obtain the auxiliary annotation result triggered for the input data; Based on the auxiliary annotation results, determine the true evaluation result corresponding to the input data; The output of the second model is corrected based on the actual evaluation results, and the corrected results are used as the model evaluation results for the first language generation model and the second language generation model in this evaluation process; and the second model is updated based on the actual evaluation results.

4. The method according to claim 3, characterized in that, The input data is input into the second model through various different input methods, including: In each input method, some network nodes in the second model are randomly deactivated using a preset sampling method, and the input data is then input into the second model after the random deactivated network nodes. The types and numbers of randomly deactivated network nodes vary depending on the input method; the preset sampling method includes Monte Carlo sampling.

5. The method according to any one of claims 2-4, characterized in that, The step of updating the first model based on the model evaluation result includes: Based on the model evaluation results, at least one model parameter in the first model is updated to update the position probability parameter corresponding to the target model comparison result. The target model comparison result is the model comparison result between the first language generation model and the second language generation model output by the first model. The specific method for updating at least one model parameter in the first model includes: if the model evaluation result matches the target model comparison result output by the first model, then the parameter value of the rank probability parameter corresponding to the updated model comparison result output by the first model is increased; if the model evaluation result does not match the target model comparison result output by the first model, then the parameter value of the rank probability parameter corresponding to the updated target model comparison result output by the first model is decreased.

6. The method according to claim 2, characterized in that, The step of evaluating the multiple language generation models based on the model comparison results corresponding to the multiple language generation models output by the updated first model includes: Determine whether the updated first model satisfies the loop termination condition; If so, the multiple language generation models are evaluated based on the model comparison results corresponding to the multiple language generation models output by the first model after this update; If not, the above steps are repeated using the updated first model until the updated first model meets the loop termination condition. The multiple language generation models are then evaluated based on the model comparison results output by the first model that meets the loop termination condition and corresponding to the multiple language generation models.

7. The method according to claim 6, characterized in that, The loop termination condition includes: The loop count has reached the preset number; or, The updated first model outputs a model comparison result corresponding to the multiple language generation models that satisfies a preset convergence condition; The preset convergence condition includes: among the position probability parameters output by the first model and corresponding to the model comparison results between each two language generation models, the position probability parameter with the largest parameter value matches the preset convergence threshold.

8. The method according to claim 1, characterized in that, The second model was trained in the following way: Select the first text from the text collection as the reference text; From the texts in the text set other than the first text, a second text is randomly selected as the opposite text corresponding to the reference text; The first text is subjected to a preset processing, and the front text corresponding to the reference text is obtained based on the processed first text; wherein, the preset processing includes at least one of the following: random deletion processing, substitution processing, and back translation processing; Generate a set of sample pairs based on the reference text, the reverse text, and the front text; The second model is trained by a sample set consisting of multiple sample pairs, such that the trained second model outputs a first output result that the reference text and the positive text match, and / or outputs a second output result that the reference text and the negative text do not match.

9. An evaluation device for a language generation model, characterized in that, include: The selection module is used to output the model comparison results between any two language generation models among the multiple language generation models to be evaluated through the first model, obtain multiple model comparison results, and select the first language generation model and the second language generation model from the multiple language generation models based on the multiple model comparison results. The model comparison results are used to characterize the comparison results between the text generation accuracy of any two language generation models. The acquisition module is used to acquire the target test text, obtain a first test text corresponding to the target test text through the first language generation model, and obtain a second test text corresponding to the target test text through the second language generation model. The input module is used to concatenate the target test text, the first test text, and the second test text into input data, input the input data into the second model, and obtain the model evaluation results for the first language generation model and the second language generation model in this evaluation process based on the output results of the second model. The model evaluation results are used to characterize the accuracy comparison results between the first test text generated by the first language generation model and the second test text generated by the second language generation model. An update module is used to update the first model based on the model evaluation results; The evaluation module is used to evaluate the multiple language generation models based on the model comparison results corresponding to the multiple language generation models output by the updated first model. Specifically, the selection module is used to: obtain the ranking probability parameter corresponding to the model comparison result between each pair of language generation models; wherein, the ranking probability parameter corresponding to the model comparison result between each pair of language generation models is used to characterize the probability that, when the corresponding two language generation models are arranged in a specified ranking, the text generation accuracy of the first-ranked language generation model is higher than that of the second-ranked language generation model; and select the two language generation models corresponding to the comparison result with the highest probability as the first language generation model and the second language generation model.

10. An electronic device, characterized in that, include: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores one or more computer programs that can be executed by the at least one processor, the one or more computer programs being executed by the at least one processor to enable the at least one processor to perform the evaluation method of the language generation model as described in any one of claims 1-8.

11. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the evaluation method for the language generation model as described in any one of claims 1-8.