Example 1
 like figure 1 As shown in the figure, a manual evaluation method for a dialogue system to assist understanding mainly includes the following steps:
 Step 11: Screen out several dialog evaluation criteria from the existing evaluation criteria, construct an evaluation criteria framework, and generate a basic evaluation template.
 Step 12: Design reading questions with reference to the reading comprehension assessment method, embed the reading questions into the dialogue content to be assessed on the basic assessment template, generate an assessment template containing the reading questions, and provide them to workers participating in the manual assessment of the dialogue system.
 In the embodiment of the present invention, a missing sentence selection strategy and a dialogue content sorting strategy are designed. These two strategies correspond to different reading problems. A corresponding strategy can be selected according to the dialogue content to be evaluated, and a corresponding reading can be generated based on the selected strategy. questions and embedded in the content of the conversation.
Step 13: Receive an assessment template containing reading questions filled in by each worker, extract the answering result of the reading question from it, use the answering result of the reading question to screen the workers, and extract the dialogue content from the assessment template containing the reading question filled in by the selected workers The evaluation results are used as the results of manual evaluation.
 In this embodiment of the present invention, the comprehension level of the workers is evaluated according to the reading questions set in step 12, and only the evaluation results submitted by the workers who have passed the comprehension test are used as the results of manual evaluation. performance analysis to illustrate the advantages of the present invention.
 For ease of understanding, the following describes the main principles of the above solutions.
 In order to determine the evaluation criteria and their definitions, the present invention investigated 105 related papers in several major conferences in the field of natural language processing from 2016 to 2020, and based on the method of group analysis, the 27 criteria used in them were taken as research objects. In addition, in order to better classify and define the standards, the definitions and usage scenarios of these quality standards are also explored in dictionaries and linguistics papers, thus dividing the 27 standards into 7 groups as shown in Table 1.
 group number standard name group by 1 Fluency, Grammaticality, Correctness, Readability, Understandable Sentence quality evaluation 2 Relevance, Coherence, Consistency, Sensibleness, Listening, Maintain Context, Logic Correlation evaluation with conversation history 3 Informativeness, Diversity, Specificity, Proactivity, Flexible General or repetitive evaluation of sentences 4 Overall Quality, Appropriateness, Naturalness, Humanness, Adequacy Overall Quality Evaluation of Sentences under Dialogue History 5 Engagement, Interestingness Interactive experience evaluation 6 Empathy, Emotion Emotional experience evaluation 7 Others /
 Table 1 Grouping situation
 In order to choose the final evaluation criteria to be used, the definition of each evaluation criteria and their use in the field of dialogue need to be considered. For example, in the first group, "Grammaticality" and "Correctness" are the same in definition and both focus on The consistency of grammar rules, and this type of annotation does not require manual labor, and "Readability" is generally considered to be better than "Grammaticality" because it emphasizes the ease with which sentences are understood. In addition, although "Fluency" is the most frequently used criterion in this group, it emphasizes the language ability of "human" or "machine", and it is more appropriate to use Readability assessment when facing sentences. Therefore, "Readability" was selected as one of the dialogue criteria. Finally, five evaluation criteria (Readability, Relevance, Consistency, Informativeness, Naturalness) were selected as the evaluation criteria for dialogue responses in subsequent experiments. Apart from Naturalness representing overall sentence quality, the principle followed is that the selected standard definitions are non-intersecting and cover all aspects of dialog response evaluation. The five dialog evaluation criteria selected by the present invention and their corresponding definitions are shown in Table 2:
 Readability The quality of the response to be understood easily Relevance The quality of a response to connect with the context Consistency The quality of a response agreeing with the known information Informativeness The quality of the response providing new information Naturalness The plausibility of the response generated by a human
 Table 2 Screening evaluation criteria and their definitions
 Based on the above five dialogue evaluation criteria, construct an evaluation criteria framework and generate basic evaluation templates, such as figure 2 As shown, an example of a basic evaluation template is provided, the upper part contains the dialogue history and dialogue responses (i.e. the content of the dialogue), and the lower part is the scoring area.
 On the basis that the worker has basic language knowledge and reading ability, the present invention provides an auxiliary comprehension strategy suitable for the conversation in the form of small talk, assisting the worker to understand the conversation history, thereby improving the evaluation result.
 The present invention summarizes 7 kinds of text class selection tasks that can be applied in chat-type dialogue, and considers whether each task/topic type requires additional manual labeling, and the analysis results are given in Table 3. It can be seen that the missing text selection and sorting selection can be carried out without manual annotation and designing additional questions, so it is suitable to be added to the dialogue history auxiliary comprehension scheme as a reading task.
 Detail understanding choice Subject summary selection Sentence comprehension choice reasoning judgment choice Attitude Emotional Choice Missing text selection Sort selection Whether additional manual annotation is required Yes Yes Yes Yes Yes no no
 Table 3 Text class selection reading task analysis
 Conventional reading proficiency assessments rely on post-reading comprehension (such as setting up multiple-choice questions after dialogue content), requiring workers to answer relevant questions after reading the text. However, “comprehension” occurs during the reading process, and answering multiple discrete reading comprehension questions after the reading has ended makes it more difficult for workers to reason about the information in the reading material, and also increases the cost of evaluation and annotation. Therefore, in the aided comprehension scheme of the dialogue history, the present invention helps workers to better understand long dialogues by embedding questions into the reading process of dialogues. Since long dialogue is different from other texts in that it can be divided into paragraphs or chapters, it is a continuous dialogue process, and its coherence is broken when independent questions are embedded. Therefore, in the process of embedding, direct questions (such as asking: which sentence do you think should be inserted in the vacancy) are omitted, and the task process of reading comprehension is directly integrated with the dialogue content. The specific strategies and front-end interface design are as follows:
 1) Missing sentence selection strategy (Strategy 1 for short): Workers read the dialogue history in a single task while making single-item selection of missing sentences in the dialogue, and then score the sentences.
 Specifically: referring to the reading comprehension assessment method in the English test, the dialogue content to be assessed, select the sentence at the designated position A as the single-choice test question, the options include the original sentence at the designated position A in the dialogue content to be assessed, and A sentence randomly selected from the dataset. The reading problem under this strategy is to expect workers to accurately select the original sentence at the specified position A. like image 3 , an example of a missing sentence selection strategy is provided; image 3 The position of the missing sentence is the first sentence of the third round (turn) dialogue. There are two options to set, namely the original sentence and a random sentence in the data set, in random order. In the implementation of the front-end page, before the worker selects, there is a gray prompt "Please select the proper sentence" (please select the correct sentence) in the selection box; when the worker clicks the selection box, two options pop up, one for each dataset Randomly collected sentences as well as original sentences from the conversation. The background of the option the mouse passes over will turn orange, and the text in the option will be replaced by the gray prompt after the worker clicks.
 2) Dialogue content sorting strategy (Strategy 2 for short): While reading the dialogue history in a single task, workers need to reorder the randomly scrambled sentences in the dialogue, and then score the sentences.
 Specifically: referring to the reading comprehension assessment method in the English test, the dialogue content to be evaluated is randomly scrambled and the workers are asked to reorder them. The reading problem under this strategy is to expect the workers to be able to sort the randomly scrambled sentences. Revert to the original sentence ordering. Considering that it is very difficult to scramble a single sentence and then sort it, such as Figure 4 As shown, take the turn of the dialogue as the unit, take the three rounds in the middle to scramble, the dialogue turns that need to be scrambled are marked in green and have text prompts, you can drag and drop to sort, click confirm (confirm) The button cannot be dragged again after that. In the front-end implementation, the following inspection steps are set: when the worker scores directly without dragging, there will be a pop-up prompt "You should drag and sort the above dialogue turns!" (ie "You should drag and sort the above dialogue turns!" Sort the dialogue above”), and cannot perform the subsequent scoring task, thus ensuring that workers perform sentence evaluation after the reading task.
 Since the reading problems corresponding to the two strategies are embedded in a dialogue, it will be difficult to read, so one of the strategies can be selected in the application, so as to achieve the purpose of assisting workers to understand the dialogue history. Specifically, when the number of turns (turn) of the dialogue history content is less than the set value, it is recommended to apply the missing sentence selection strategy of Strategy 1; when the number of turns (turn) of the dialogue history content is greater than or equal to the set value, the two strategies can be used, for example, the set value can be 4. In the above strategy 1, since the strategy of single-choice completion of missing sentences is added when reading the dialogue history, the correctness check can be carried out according to the correct answer (sentence in the original dialogue), and the workers who have made the correct answer are screened out for Subsequent data analysis; in strategy 2, since the strategy of allowing workers to sort dialogues is added when reading the dialogue history, the correctness check is carried out according to the correct sorting (the sorting of the original dialogue round), and the correct sorting workers are screened out. The present invention only takes the evaluation results provided by the above-screened workers as the results of manual evaluation, and performs subsequent data analysis.
 Compared with the existing manual evaluation process of chat-type dialogue, the above solution in the embodiment of the present invention has the following advantages: (1) the details of the manual evaluation process in the chat-type dialogue are improved; (2) it verifies that the basic Adding an auxiliary understanding strategy to the template to improve the understanding of workers can improve the consistency of manual annotation in dialogue evaluation.
 In order to verify the technical effect and performance of the above-mentioned solution of the present invention, an experiment is used to illustrate.
 1. Experimental setup.
 In order to study the advantages brought by each scheme and different strategies in more detail, as shown in Table 4, the following settings are made on the basic evaluation template: setting 1 is the basic evaluation template, setting 2 and setting 3 are the basic templates respectively On top of this, the missing sentence selection strategy and the dialogue content sorting strategy are added.
 Table 4 Experimental setup
 The above-mentioned missing sentence selection strategy + basic evaluation template, dialogue content sorting strategy + basic evaluation template all belong to the evaluation template including reading questions defined in the foregoing step 12 .
 In order to verify the advantages of the present invention, each dialogue history and its corresponding answer are published on the amt platform as the content of a task, and >20 workers are recruited to participate in each task, and it is stipulated that the following workers are eligible to participate: 1) The country where the worker is located is one of US (United States), CA (Canada), AU (Australia), as far as possible to ensure that the daily language of the worker is English; (2) HIT approval rate (workers submit all tasks on the platform) (3) The number of approvals (the total number of approved tasks for all submitted tasks by workers on the platform) is >100. Finally, workers who meet the conditions and pass the correctness check under each setting are selected for subsequent data analysis, as shown in Table 5.
 Table 5 Distribution of the number of participants (number of participants per HIT)
 This experiment is based on the selected experimental data set dailydialog and 4 mainstream dialogue generation models (Hred, Gpt, Blender, Dialogpt), the dialogue data obtained is based on the basic template, and three front-end interfaces are constructed in combination with the solution in the first embodiment. , used to collect and observe workers' assessment scores and submitted answers. Considering that amplitude estimation and contrastive assessment are not widely used in dialogue assessment, this experiment uses a 5-point Likert scale for assessment.
 Second, the consistency of workers has improved.
 In human-labeled experiments without standard responses as a reference, worker consistency is often used to evaluate data validity. The intra-class correlation coefficient was used in this experiment to measure the consistency of worker ratings.
The focus of the experiments was whether the above-described protocol provided by the present invention had a positive effect on consistency among workers. The intra-class correlation coefficients in the interval of N=[3,20] (N is the number of workers) were calculated respectively, and the four evaluation criteria of Readability, Relevance, Information, and Consistency were analyzed when N=6. The analysis results As shown in Table 6. It can be seen that the consistency of the evaluation results of setting 1 (ie, the basic template) on the five standards is not high, and the reliability of the results is low. After adding the strategy of assisting comprehension to the basic template and embedding reading questions in the dialogue history, the consistency of setting 2 and setting 3 in each standard is improved. Especially on the two standards Relevance and Consistency that belong to the "Association Evaluation of Dialogue History" group in the dialog frame, the consistency of setting 3 is above 0.6, indicating that the content drag and drop sorting strategy is a very effective strategy. Sorting enhances workers' understanding of the conversation history, which improves evaluation results. Experiments verify that adding reading questions for missing sentence selection or reading questions for content drag-and-drop sorting on the basic evaluation template can effectively improve worker consistency, as shown in Table 6.
 Table 6 Consistency results under different standards and settings
 Third, the average score analysis
 This experiment compares the average scores of the four dialogue systems, as shown in Table 7, where human is the original reply in the dialogue dataset. Experiments show that the responses generated by the Gpt and Dialogpt models are better than those of the Hred and Blender models in small talk-type conversations, and even the scores on the evaluation standard Readability are higher than the scores of human responses. Very readable reply. On the two evaluation criteria of Relevance and Consistency, the scores of human responses far exceed the dialogue model, which indicates that the dialogue model still needs to be improved in the evaluation criteria of association with dialogue history.
 Table 7 Average scores of dialogue models under different criteria
 Fourth, time cost analysis.
 Due to the addition of two additional reading tasks of selection and sorting in settings 2 and 3, we considered whether the time spent by workers in each setting had an impact on the results. Unlike the previous evaluation process that did not consider time, or only used the average time for overall responses, this experiment focused on two main indicators, as shown in Table 8, including: the time spent on reading comprehension (reading comprehension) for workers to complete tasks under different settings. time) and the time it took to score the evaluation (answer time), which are obtained by collecting the timestamps returned in the front-end code. In settings 2 and 3, the processing (selection, sorting) time of the dialogue history is also included in the reading comprehension time. Separating the two parts of the time for statistics can better help us distinguish the difficulty of context reading and the difficulty of evaluating dialogue criteria.
 Table 8 Time spent under different settings
 Comparing the reading time and answering time under different settings, it can be seen that the average time spent answering questions is the least for setting 1, while the answering time for setting 2 and setting 3 is longer. Combining the differences in intra-class correlation coefficients, it can be clearly seen that after the addition of the auxiliary comprehension strategy, workers are more serious in scoring the criteria than before. Since the reading time includes the time required by the reading task and the history of the dialogue, in the statistics of the reading time, it is found that the reading time of setting 2 will be longer than that of setting 3, which indicates that the selection of missing sentences as a reading task will cost more than the dialogue content sorting strategy. more time. Combining the answering time under the three settings and the consistency results of workers, it shows that workers can give more serious and consistent evaluation results after understanding the dialogue standard or dialogue history.