Agent evaluation method and device, storage medium and program product

By combining multiple evaluators with machine learning models and human evaluation, the problem of incomplete evaluation of complex business agents was solved, enabling comprehensive evaluation and optimization iteration of agents, thereby improving service quality and user experience.

CN122240438APending Publication Date: 2026-06-19JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD
Filing Date
2026-03-19
Publication Date
2026-06-19

Smart Images

  • Figure CN122240438A_ABST
    Figure CN122240438A_ABST
Patent Text Reader

Abstract

This disclosure provides an agent evaluation method and apparatus, storage medium, and program product, relating to the field of computer technology. The agent evaluation method includes: generating multiple evaluators, where the nth evaluator corresponds to the nth evaluation index of the agent under test; using the agent under test to process each problem in a test set, obtaining the output result for each problem; using the nth evaluator to evaluate the output result based on the prompt words of the nth evaluator, obtaining the nth machine evaluation result of the output result; acquiring multiple human evaluation results of the output result; determining the optimization prompt words of the nth evaluator based on the nth machine evaluation result and the human evaluation results corresponding to the nth evaluation index from among the multiple human evaluation results; and using N evaluators to evaluate the output result of the agent under test based on their corresponding optimization prompt words, obtaining the evaluation result of the agent under test.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and in particular to an intelligent agent evaluation method and apparatus, storage medium and program product. Background Technology

[0002] With the rapid development of large-scale model technology, large-scale model intelligent agents, as AI (Artificial Intelligence) systems with autonomous decision-making, environmental interaction, and task execution capabilities, have been widely applied in government affairs, public services, and other fields. The performance of the intelligent agent directly affects its reliability and user experience. Scientific and comprehensive evaluation of the intelligent agent during the R&D process can effectively improve its service quality, reliability, and user experience.

[0003] By evaluating agents, we can identify the problems that agents currently have, thereby promoting their rapid iteration and improvement. Summary of the Invention

[0004] The inventors noted that current evaluations of intelligent agents are mostly for general-purpose agents, primarily focusing on end-to-end performance evaluations, including pure machine evaluation, manual sampling evaluation, and a combination of machine evaluation and manual sampling. Current evaluation methods are not suitable for relatively complex business-oriented intelligent agents, such as knowledge base question-answering agents. Since these agents involve multiple stages such as intent recognition, retrieval, rearrangement, and generation, end-to-end performance evaluation alone is insufficient to support subsequent optimization and iteration.

[0005] Accordingly, this disclosure provides an agent evaluation method that can comprehensively and accurately evaluate agents, thereby effectively supporting their subsequent optimization and iteration.

[0006] In a first aspect of this disclosure, an agent evaluation method is provided, comprising: generating a plurality of evaluators, wherein the nth evaluator corresponds to the nth evaluation metric of the agent under test. N represents the total number of evaluation metrics. The agent under test processes each problem in the test set to obtain the output result for each problem. The nth evaluator evaluates the output result based on its prompts to obtain the nth machine evaluation result. Multiple human evaluation results are obtained for the output result. Based on the nth machine evaluation result and the human evaluation result corresponding to the nth evaluation metric among the multiple human evaluation results, the optimized prompts for the nth evaluator are determined. The N evaluators evaluate the output result of the agent under test based on their corresponding optimized prompts to obtain the evaluation result of the agent under test.

[0007] In some embodiments, obtaining the evaluation result of the agent under test includes: using the nth evaluator to evaluate the output result of the agent under test based on the optimized prompt words of the nth evaluator, and obtaining the nth evaluation result; and determining the evaluation result of the agent under test based on the obtained N evaluation results.

[0008] In some embodiments, obtaining multiple manual evaluation results of the output results includes: sampling problems in the test set according to preset rules to generate a problem sample set; selecting the multiple evaluators in the personnel information database; and obtaining multiple sample evaluation results fed back by the multiple evaluators for each problem sample.

[0009] In some embodiments, among the acquired multiple sample evaluation results, the consistency rate between the evaluation score and the scoring reason for each sample evaluation result is detected; sample evaluation results whose consistency rate between the evaluation score and the scoring reason is less than a predetermined consistency rate threshold are deleted.

[0010] In some embodiments, the preset rule includes at least one of a first sub-rule that samples based on problem confidence, a second sub-rule that samples based on problem type, and a third sub-rule that samples based on the consistency of evaluation results.

[0011] In some embodiments, sampling questions in the test set according to a first sub-rule includes: evaluating the output of the m-th question in the test set using the N evaluators to obtain N machine evaluation results. M is the total number of problems; determine whether the evaluation results of the N machines are consistent; if the evaluation results of the N machines are inconsistent, then the m-th problem is taken as a problem sample to generate the problem sample set.

[0012] In some embodiments, sampling questions in the test set according to the second sub-rule includes: clustering all questions in the test set according to question type to obtain multiple question sets; and sampling questions in each question set according to a preset ratio to generate the question sample set.

[0013] In some embodiments, sampling questions in the test set according to a third sub-rule includes: evaluating the output of the m-th question in the test set using the N evaluators to obtain N machine evaluation results, wherein each machine evaluation result includes an evaluation score and a reason for the score. M represents the total number of problems; if there is a machine evaluation result in which the evaluation score and the reason for the score are inconsistent among the N machine evaluation results, then the m-th problem is taken as a problem sample to generate the problem sample set.

[0014] In some embodiments, selecting the plurality of evaluators from the personnel information database includes: calculating the keyword similarity between each question sample in the question sample set and each personnel information in the personnel information database; calculating the semantic representation similarity between each question sample and each personnel information; calculating the weighted sum of the keyword similarity and the semantic representation similarity to determine the matching degree between each question sample and each personnel information; and for each question sample, selecting the personnel information corresponding to the personnel information with a matching degree greater than a predetermined matching degree threshold as evaluators.

[0015] In some embodiments, determining the optimized prompt word for the nth evaluator includes: optimizing the prompt word for the nth evaluator based on the deviation between the nth machine evaluation result and the human evaluation results corresponding to the nth evaluation index among the plurality of human evaluation results; repeatedly evaluating the output result using the nth evaluator based on the prompt word for the nth evaluator until a predetermined termination condition is met; and using the current prompt word of the nth evaluator as the optimized prompt word for the nth evaluator.

[0016] In some embodiments, the predetermined termination condition includes: the number of iterations equals a predetermined iteration number threshold, or the evaluation consistency rate between the nth machine evaluation result and the multiple human evaluation results corresponding to the nth evaluation index is greater than a predetermined evaluation consistency rate threshold.

[0017] In some embodiments, when the predetermined termination condition is met, the evaluation consistency rate of the nth machine evaluation result and the multiple human evaluation results corresponding to the nth evaluation index is calculated as the human-machine evaluation consistency rate of the nth evaluator; the weighted sum of the human-machine evaluation consistency rates of the N evaluators is calculated as the confidence level of the evaluation result of the agent under test.

[0018] In some embodiments, generating multiple evaluators includes: obtaining descriptive information of the agent under test; determining multiple evaluation metrics based on a first machine learning model according to the descriptive information; and generating an evaluator for each evaluation metric based on a second machine learning model and the descriptive information.

[0019] In some embodiments, determining the evaluation metrics based on the first machine learning model according to the description information includes: determining the intermediate links of the agent under test according to the description information; and determining multiple evaluation metrics based on the first machine learning model when the number of intermediate links is greater than a predetermined number, wherein each evaluation metric corresponds to at least one intermediate link, and the multiple evaluation metrics correspond to all the intermediate links.

[0020] In some embodiments, when the number of intermediate links is less than or equal to the predetermined number, an evaluation metric and a prompt word are generated based on the description information using a third machine learning model, wherein the evaluator includes the evaluation metric and the prompt word.

[0021] In some embodiments, generating an evaluator for each evaluation metric based on a second machine learning model includes: determining prompt words for each evaluation metric based on the description information according to the second machine learning model, wherein the prompt words include the input requirements of the intermediate link corresponding to the evaluation metric, the output requirements of the intermediate link corresponding to the evaluation metric, the evaluation criteria, the scoring rules, and the output requirements of the evaluator, and the evaluator includes the evaluation metric and the prompt words.

[0022] In a second aspect of this disclosure, an agent evaluation apparatus is provided, comprising: a memory; and a processor coupled to the memory, the processor being configured to execute instructions stored in the memory to implement the agent evaluation method as described in any of the above embodiments.

[0023] In a third aspect of this disclosure, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores computer instructions that, when executed by a processor, implement the agent evaluation method as described in any of the above embodiments.

[0024] In a fourth aspect of this disclosure, a computer program product is provided, including computer instructions, wherein the computer instructions, when executed by a processor, implement the agent evaluation method as described in any of the above embodiments.

[0025] Other features and advantages of this disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description

[0026] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0027] Figure 1 This is a flowchart illustrating an embodiment of the intelligent agent evaluation method of this disclosure; Figure 2 This is a flowchart illustrating an intelligent agent evaluation method according to another embodiment of the present disclosure; Figure 3 This is a schematic diagram of the structure of an intelligent agent evaluation device according to an embodiment of the present disclosure. Detailed Implementation

[0028] The technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit this disclosure or its application or use. All other embodiments obtained by those skilled in the art based on the embodiments of this disclosure without creative effort are within the scope of protection of this disclosure.

[0029] Unless otherwise specifically stated, the relative arrangement, numerical expressions, and values ​​of the components and steps set forth in these embodiments do not limit the scope of this disclosure.

[0030] At the same time, it should be understood that, for ease of description, the dimensions of the various parts shown in the accompanying drawings are not drawn according to actual scale.

[0031] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.

[0032] In all examples shown and discussed herein, any specific values ​​should be interpreted as merely exemplary and not as limitations. Therefore, other examples of exemplary embodiments may have different values.

[0033] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.

[0034] It should be noted that in related technologies, agent evaluation mainly focuses on end-to-end performance evaluation, such as the GAIA (General AI Assistants) benchmark, or some general evaluations related to safety and compliance.

[0035] The inventors noted that current methods for evaluating intelligent agents can be categorized into the following types.

[0036] 1) Pure machine evaluation: The evaluation results of this method are somewhat referential, but their reliability is unclear and their practicality is insufficient.

[0037] 2) Purely manual evaluation: This method has high labor costs and is not practical enough.

[0038] 3) Machine evaluation + human evaluation: This type of method is highly practical, but the current evaluation process is relatively simple. The machine evaluation results are sampled and distributed to humans for labeling. It does not take into account the background information of the personnel for accurate matching, and it is not suitable for relatively complex business intelligent agents.

[0039] For example, for specific business application agents, such as RAG (Retrieval-augmented Generation) intelligent question answering, complex process calls are often involved. The evaluators in related technologies are difficult to help locate the weak points of the agent's performance and have low reference value for agent improvement.

[0040] Accordingly, this disclosure provides an agent evaluation method that can comprehensively and accurately evaluate agents, thereby effectively supporting their subsequent optimization and iteration.

[0041] Figure 1 This is a schematic flowchart of an agent evaluation method according to an embodiment of the present disclosure. In some embodiments, the following agent evaluation method is performed by an agent evaluation device, including steps 11-16.

[0042] In step 11, multiple evaluators are generated, where the nth evaluator corresponds to the nth evaluation metric of the agent under test. N represents the total number of evaluation indicators.

[0043] For example, the output of an agent includes results from multiple stages such as retrieval results, rearrangement results, and generated results. Therefore, the evaluator needs to evaluate indicators such as retrieval relevance, retrieval comprehensiveness, rearrangement accuracy, generated relevance, and generated correctness.

[0044] In some embodiments, the step of generating multiple evaluators includes steps S101-S103.

[0045] S101. Obtain the description information of the agent under test.

[0046] In some embodiments, the description information includes a functional introduction of the agent under test. When the processing of the agent under test includes multiple intermediate links (or intermediate modules), the description information includes an introduction to the processing, such as a full-link description of the agent under test, which includes a description of the functions performed by each intermediate link of the agent under test.

[0047] In some embodiments, the description information includes evaluation requirements for the agent under test. These evaluation requirements may match the functional description of the agent under test; for example, they may include evaluation requirements for intermediate links in the processing of the agent under test, or requirements for the agent under test as a whole.

[0048] In some embodiments, users can provide evaluation requirements as needed. For example, users can specify that multiple intermediate links should be evaluated as a whole, or that some or all intermediate links should be evaluated independently. This approach achieves a balance between the evaluator's efficiency and comprehensiveness, meeting the user's needs and increasing the user's control over the evaluator's generation process.

[0049] In some embodiments, the description information may also include the identifier of the agent under test, such as the name of the agent under test, so as to identify and distinguish the agents under test corresponding to different evaluators, thereby facilitating the accurate use of the evaluator when different agents under test exist, and improving the accuracy of the evaluation.

[0050] S102. Based on the description information, determine multiple evaluation metrics based on the first machine learning model.

[0051] It should be noted that the evaluation metrics correspond to some or all of the processing steps (intermediate links) of the agent under test. An evaluation metric is a relatively independent evaluation operation, targeting some or all of the intermediate links of the agent under test. In some embodiments, the evaluation metric can be the name of the evaluator to be generated later, such as a retrieval completeness evaluator or a reordering relevance evaluator.

[0052] For evaluation tasks, a key issue is how to design evaluation metrics. From the perspective of response type, current agents can be divided into generative and numerical responses. For numerical responses, there are standard quantitative judgments of whether they are correct or not. However, agents that respond with generative information lack clear, quantitative metrics. The method in this disclosure can automatically generate evaluation metrics based on descriptive information using a machine learning model, and is suitable for evaluating generative agents.

[0053] The aforementioned first machine learning model can be a model with semantic analysis capabilities, thereby improving the processing efficiency of evaluation metrics for descriptive information. For example, the first machine learning model can be a large language model, thus eliminating the need for targeted training operations. By leveraging the large-scale knowledge reserves and semantic understanding capabilities of the large language model, the implementation difficulty can be reduced and the deployment efficiency improved.

[0054] In some embodiments, intermediate links of the agent under test are determined based on the description information. If the number of intermediate links is greater than a predetermined number, multiple evaluation metrics are determined based on a first machine learning model, wherein each evaluation metric corresponds to at least one intermediate link, and the multiple evaluation metrics correspond to all intermediate links.

[0055] The predetermined quantity is a positive integer greater than 1, for example, 2. For the same agent, the smaller the predetermined quantity, the more detailed the evaluation result of the generated evaluator is for the agent, and the better the effect of locating defects; the larger the predetermined quantity, the more general the evaluation result of the generated evaluator is for the agent, but the higher the generation efficiency of the evaluator.

[0056] In some embodiments, when the number of intermediate links is less than or equal to a predetermined number, an evaluation metric and a prompt word are generated based on the description information using a third machine learning model, wherein the evaluator includes the evaluation metric and the prompt word.

[0057] S103. Based on the second machine learning model, generate an evaluator for each evaluation metric according to the description information.

[0058] In some embodiments, based on a second machine learning model, prompt words for each evaluation metric are determined according to descriptive information. The prompt words include the input requirements of the intermediate link corresponding to the evaluation metric, the output requirements of the intermediate link corresponding to the evaluation metric, the evaluation criteria, the scoring rules, and the output requirements of the evaluator. The evaluator includes the evaluation metric and the prompt words.

[0059] It should be noted that the second machine learning model can be a model with semantic analysis capabilities, thereby improving the processing efficiency of obtaining evaluation metrics from descriptive information. For example, the second machine learning model can be a large language model, thus eliminating the need for targeted training operations. Leveraging the large-scale knowledge reserves and semantic understanding capabilities of the large language model reduces implementation difficulty and improves deployment efficiency. In some embodiments, the first and second machine learning models are the same model, thereby reducing the number of machine learning models required in the evaluator generation process and simplifying device configuration requirements. In some embodiments, based on the second machine learning model, prompts for each evaluation metric are determined according to the descriptive information. The prompts include the input requirements of the intermediate link corresponding to the evaluation metric, the output requirements of the intermediate link corresponding to the evaluation metric, the evaluation criteria, the scoring rules, and the output requirements of the evaluator. The evaluator includes the evaluation metric and the prompts.

[0060] The input and output requirements of the intermediate link can be expressed through textual descriptions and function calls. For example, the input of the intermediate link is: user question {query}, and the output of the intermediate link is: retrieved content {retrieved_context}.

[0061] The evaluation criteria can include multiple categories of standards, along with descriptive information for each category and scoring rules. For example, evaluation criteria might include coverage and comprehensiveness. Coverage descriptions include whether the retrieved content relates to the core elements of the problem, with evaluation rules including 0 = no coverage, 1 = partial coverage, and 2 = complete coverage. Comprehensiveness descriptions include whether multiple relevant perspectives were retrieved, with evaluation rules including 0 = single perspective, 1 = partially multiple perspectives, and 2 = comprehensive multiple perspectives. This method allows for the generation of scores for each category of standards based on the evaluation criteria, improving the intuitiveness of the evaluator's output.

[0062] The scoring rules include rules for summarizing scores from multiple categories, such as weighted averages or score conversions. This approach allows for the aggregation of evaluation results from different perspectives, building upon individual assessments and further enhancing the intuitiveness of the output results.

[0063] The evaluator's output requirements include both content and format requirements. For example, the output content includes a score and evaluation reason, output in a format such as {{"score": score, "reason": "detailed evaluation reason"}}. Detailed evaluation reasons can be generated based on the evaluation results for each category in the evaluation criteria. This approach not only provides users with intuitive evaluation scores but also facilitates analysis of the reasons behind those scores, further helping users identify performance weaknesses, improve the effective utilization of information, and enhance the efficiency of agent improvement.

[0064] The following example, using a business knowledge question-answering AI agent, illustrates the evaluator generation method disclosed herein.

[0065] 1. Input the agent's description information. The description information includes a basic description of the agent, a complete description of the agent's lifecycle, and a brief description of the evaluation requirements.

[0066] Basic description of the intelligent agent: a business knowledge question-answering intelligent agent.

[0067] Full-link description: This agent is a RAG-type agent. After the user inputs a question, it retrieves relevant content from the knowledge base, rearranges it, and then generates an answer based on the relevant content using a large model.

[0068] Evaluation Requirements: Evaluate the business knowledge question-answering AI agent by designing comprehensive evaluation metrics to assess the agent's overall performance, including three stages: retrieval, reordering, and generation. The retrieval results should comprehensively retrieve relevant content. The reordering stage should further select the most relevant content set. The generation stage should generate comprehensive, accurate, and user-friendly answers based on the reordering results.

[0069] 2. Based on the input agent description, a large model is used to generate an evaluator set, including evaluator names, evaluator prompts, and input parameters. Depending on the problem complexity, if there are few intermediate links, evaluators are generated directly. If there are many intermediate links, to improve performance, they are generated in steps: first, evaluation metrics (i.e., evaluator names) are generated, and then prompts are generated for each metric.

[0070] Based on the above requirements, the commands for calling the large model are as follows. These commands are merely examples and do not constitute an undue limitation on this disclosure.

[0071] You are a professional AI evaluator design expert. Please design multiple high-quality evaluators for the "Business Knowledge Question Answering Agent".

[0072] ## Agent Information

[0073] - Basic description: { request.agent_name}

[0074] - Full-chain description: {request.agent despy}

[0075] - Evaluation requirement: {frequest.generate_desp}

[0076] ##Design Requirements

[0077] ### 1. Number of evaluators

[0078] Please generate multiple evaluators with different dimensions, covering the agent's core capabilities and key metrics.

[0079] ### 2. Evaluator Naming

[0080] - Use clear and professional naming conventions, such as "Accuracy Evaluator," "Logical Reasoning Evaluator," and "Verbal Fluency Evaluator."

[0081] - The name should directly reflect the dimensions and objectives of the assessment.

[0082] ### 3. Cue Word Design

[0083] - Use Python string formatting syntax: {{variable name}} represents the parameter to be filled in.

[0084] -The prompt should include: Clear evaluation criteria and dimensions Specific scoring rules (e.g., 0, 1, 2 points). Detailed evaluation steps and considerations Output format requirements ### 4. Input Field Design - Design reasonable input parameters based on the characteristics of the intelligent agent. - Commonly used fields: query (user question), answer (agent answer), context, etc. - Field names should be in English, and descriptions should be in Chinese. ### 5. Output Field Specifications - score: rating (0, 1, or 2 points, where 0 is the worst and 2 is the best) - Reason: Detailed evaluation reasons and analysis ###Please ensure: - Each evaluator has unique evaluation dimensions. - Detailed and actionable prompts - The input field name is exactly the same as the variable name in the prompt. - Standardized output format ## Output Format Please return the JSON in strict accordance with the following format: {{ "evaluators":[ {{"name": "Specific evaluator name", "prompt": "You are a professional evaluation expert. Please conduct an evaluation of the following: User issue: {{query}}" "input_fields": {{ "query": "user question" "answer": The agent's response. }}, "output_flelds": {{ "score": "score (a floating-point number between 0 and 1)" "reason": "Detailed evaluation reasons and analysis" }} For the above input example, the following evaluator is obtained:

[0085] In the examples above, the expressions in curly braces "{}" represent content that needs to be called during the process. For example, during the generation of the evaluator, the user's basic description, end-to-end description, and evaluation requirements will be called through {request.agent_name}, {request.agent despy}, and {frequest.generate_desp}, respectively. Similarly, during the evaluation of the agent using the evaluator, {query}, {knowledge_base}, {retrieved_context}, {reranked_context}, and {answer} will be called to obtain the user's question, knowledge base content, retrieved content, reranked content, and agent's answer, respectively.

[0086] In step 12, the agent under test processes each question in the test set to obtain the output result for each question.

[0087] In some embodiments, the test set includes subfields such as questions, standard answers, and question types.

[0088] For example, the assessment set for RAG-type knowledge questions is as follows:

[0089] In step 13, the output result is evaluated using the nth evaluator based on the prompt words of the nth evaluator, and the nth machine evaluation result of the output result is obtained.

[0090] For example, each machine evaluation result includes an evaluation score and a reason for the score. The reason for the score is similar to the thought process, and outputting the reason for the score can improve the effectiveness, can be used to check logical consistency, and, as additional explanatory information, can be used for comparison with subsequent human evaluation results, facilitating subsequent optimization.

[0091] For example, an exemplary machine evaluation result is as follows.

[0092]

[0093] In step 14, obtain multiple human evaluation results of the output.

[0094] In some embodiments, the step of obtaining multiple manual evaluation results of the output results includes steps S201-S203.

[0095] S201. Sample problems from the test set according to preset rules to generate a problem sample set.

[0096] It should be noted that by sampling, representative problem samples are selected from the test set to help obtain more accurate evaluator prompts.

[0097] In some embodiments, the preset rule includes at least one of a first sub-rule that samples based on problem confidence, a second sub-rule that samples based on problem type, and a third sub-rule that samples based on the consistency of evaluation results.

[0098] For example, the steps for sampling issues in the test set according to the first sub-rule include steps S301-S303.

[0099] S301. Use N evaluators to evaluate the output of the m-th question in the test set, and obtain N machine evaluation results. M represents the total number of questions.

[0100] S302. Determine whether the evaluation results of N machines are consistent.

[0101] S303. If the evaluation results of N machines are inconsistent, the m-th problem will be used as a problem sample to generate a problem sample set.

[0102] It should be noted here that if the agent under test processes problem A to obtain output result B, and the evaluation results of N evaluators for output result B are inconsistent, it indicates that the confidence level of output result B is not high.

[0103] For example, the steps of sampling problems in the test set according to the second sub-rule include steps S401-S402.

[0104] S401. Cluster all the questions in the test set according to the question type to obtain multiple question sets.

[0105] S402. Sample problems from each problem set according to a preset ratio to generate a problem sample set.

[0106] It should be noted that the above processing can make the sampling results consistent with the actual problem distribution, thereby simulating the real application scenario as closely as possible.

[0107] For example, the steps for sampling problems in the test set according to the third sub-rule include steps S501-S502.

[0108] S501. Use N evaluators to evaluate the output of the m-th question in the test set, obtaining N machine evaluation results, where each machine evaluation result includes an evaluation score and a reason for the score. M represents the total number of questions.

[0109] S502. If there are machine evaluation results among the N machine evaluation results where the evaluation score and the reason for the score are inconsistent, then the m-th problem will be used as a problem sample to generate a problem sample set.

[0110] It should be noted that if the evaluation score and the reason for the score are inconsistent, it indicates that the large model has difficulty processing the relevant information for this problem.

[0111] In some embodiments, inconsistency analysis can be performed by inputting large model prompts into the large model to obtain machine evaluation results where the evaluation scores and the reasons for scoring are inconsistent. The corresponding prompts are as follows.

[0112]

[0113] S202. Select multiple evaluators from the personnel information database.

[0114] In some embodiments, the step of selecting multiple evaluators from the personnel information database includes steps S601-S604.

[0115] S601. Calculate the keyword similarity between each question sample in the question sample set and each piece of information about a person in the personnel information database.

[0116] In some embodiments, keywords can be extracted from the question type and data description information of the question sample. The data description information refers to a description of the test set. For example, the test set is an evaluation dataset about module XXX of product XX, covering XXX sub-functional points, with a total of XXX data entries, etc. For example, the question type can be used as the first keyword of the question sample. Second keywords are extracted from the data description information. Keywords for the question sample are generated based on the first and second keywords.

[0117] In addition, personnel information includes the evaluator's basic background, work experience, areas of expertise, types of tasks they are proficient in, and business modules. Keywords for personnel information can be extracted from this information. For example, keyword extraction methods such as TF-IDF can be used.

[0118] Next, Jaccard similarity can be used to calculate the similarity S1 between the keyword set of the question sample and the keyword set of personnel information.

[0119] S602. Calculate the semantic representation similarity between each question sample and each person's information.

[0120] It should be noted that, in order to improve deep matching, a deep semantic representation vector is calculated for each question sample and each person's information. A pre-trained semantic model (such as BGE or Sentence-BERT) is used to obtain the deep semantic representation vectors for each question sample and each person's information, and then cosine similarity is used to calculate the semantic representation similarity S2 between each question sample and each person's information.

[0121] S603. Calculate the weighted sum of keyword similarity and semantic representation similarity to determine the matching degree between each question sample and each person's information.

[0122] It's important to note that the weights can be adjusted for different task types. For simple tasks, the keyword similarity weight can be increased. For more complex labeled data, such as those requiring deeper business background and experience, the semantic similarity weight should be increased.

[0123] S604. For each problem sample, the personnel whose matching degree is greater than the predetermined matching degree threshold are selected as the evaluators.

[0124] By using the keyword + semantic matching method described above, accurate matching between data and evaluators can be achieved.

[0125] S203. Obtain multiple sample evaluation results from the output results of multiple evaluators for each problem sample.

[0126] In some embodiments, among the acquired multiple sample evaluation results, the consistency rate between the evaluation score and the reason for the score is detected for each sample evaluation result. Sample evaluation results with a consistency rate between the evaluation score and the reason for the score that is less than a predetermined consistency rate threshold are deleted.

[0127] It's important to note that in daily work, due to the diverse backgrounds of different personnel, human evaluation results are often more prone to logical inconsistencies. Therefore, conducting consistency analysis on human evaluation results and filtering out samples with inconsistent evaluation scores and reasons for scoring helps in the subsequent accurate evaluation of the agent.

[0128] For example, when performing consistency analysis using a large model, the prompts used can refer to the inconsistency analysis prompts mentioned above.

[0129] In step 15, based on the nth machine evaluation result and the human evaluation result corresponding to the nth evaluation index among multiple human evaluation results, the optimization prompt word for the nth evaluator is determined.

[0130] In some embodiments, the step of determining the optimization prompt word for the nth evaluator includes steps S701-S703.

[0131] S701. Based on the deviation between the nth machine evaluation result and the human evaluation result corresponding to the nth evaluation index among multiple human evaluation results, optimize the prompt words of the nth evaluator.

[0132] In some embodiments, a difference analysis is performed on samples where human evaluation results and machine evaluation results are inconsistent to analyze the problems of each evaluator, and then the prompt words are optimized.

[0133] For example, the following are some suggestions for differential analysis keywords.

[0134]

[0135] Accordingly, the following are some suggestions for suggestion optimization.

[0136]

[0137] S702. Repeatedly evaluate the output result using the nth evaluator based on the prompt of the nth evaluator until the predetermined termination condition is met.

[0138] In some embodiments, the predetermined termination condition includes: the number of iterations equals a predetermined iteration number threshold, or the evaluation consistency rate of the human evaluation result corresponding to the nth evaluation index among the nth machine evaluation result and multiple human evaluation results is greater than a predetermined evaluation consistency rate threshold, that is, the human-machine evaluation consistency rate is greater than the predetermined evaluation consistency rate threshold.

[0139] S703. Use the current prompt word of the nth evaluator as the optimized prompt word of the nth evaluator.

[0140] In some embodiments, when a predetermined termination condition is met, the evaluation consistency rate of the nth machine evaluation result and the multiple human evaluation results corresponding to the nth evaluation index is calculated as the human-machine evaluation consistency rate of the nth evaluator. Next, the weighted sum of the human-machine evaluation consistency rates of the N evaluators is calculated as the confidence level of the evaluation result of the agent under test.

[0141] In step 16, N evaluators are used to evaluate the output of the agent under test based on the corresponding optimization prompts, and the evaluation results of the agent under test are obtained.

[0142] In some embodiments, the output of the agent under test is evaluated using the nth evaluator based on the optimized prompts of the nth evaluator, resulting in the nth evaluation result. Next, based on the obtained N evaluation results, the evaluation result of the agent under test is determined.

[0143] For example, the evaluation score of the agent under test can be obtained by calculating the weighted sum of the evaluation scores in N evaluation results.

[0144] For example, by outputting the evaluation results of the agent under test and the confidence level of the evaluation results, users can gain a more accurate understanding of the evaluation status of the agent under test.

[0145] In the agent evaluation method provided in the above embodiments of this disclosure, by combining the machine evaluation results provided by the evaluator and the manual evaluation results provided by the evaluator, the prompts of the evaluator are optimized, which can comprehensively and accurately evaluate the agent, thereby effectively supporting its subsequent optimization iteration.

[0146] Figure 2 This is a flowchart illustrating an intelligent agent evaluation method according to another embodiment of the present disclosure, including steps 21-211.

[0147] In step 21, prepare the evaluation set and the agent to be tested. Use the agent to be tested to process each question in the evaluation set and obtain the output result for each question.

[0148] In step 22, an evaluator set is generated.

[0149] For example, multiple evaluators are generated according to the steps S101-S103 described above.

[0150] In step 23, the i-th evaluator in the evaluator set uses its own prompt words to evaluate the output of each question and obtain the machine evaluation result.

[0151] In step 24, determine whether the iteration termination condition is met.

[0152] For example, the step of determining whether the iteration termination condition is met is as shown in step S702 above.

[0153] If the iteration termination condition is not met, proceed to step 25. If the iteration termination condition is met, proceed to step 211.

[0154] In step 25, data sampling is performed using sampling rules.

[0155] For example, data sampling is performed according to step S201 above.

[0156] In step 26, the data is sent to the matched evaluator.

[0157] For example, matching evaluators are selected according to the matching rules used in steps S601-S604 above.

[0158] In step 27, the output of each question is manually evaluated by the evaluator.

[0159] In step 28, obtain the manual evaluation results from the evaluators.

[0160] In step 29, a consistency analysis of the evaluation scores and scoring reasons is performed on the manual evaluation results, and samples with inconsistent evaluation scores and scoring reasons are filtered out.

[0161] In step 210, the prompts for the evaluator are optimized based on the machine evaluation results and the human evaluation results, and the process returns to step 23 to achieve iterative optimization of the prompts for the evaluator.

[0162] In step 211, exit the iteration loop and output the evaluation result of the agent under test, as well as the confidence level of the evaluation result of the agent under test.

[0163] For example, the evaluation result of the agent to be tested is determined according to step 16 above.

[0164] In some embodiments, the confidence level of the evaluation result of the agent under test is obtained by calculating the weighted sum of the human-machine evaluation consistency rates of multiple evaluators in the evaluator set.

[0165] Figure 3 This is a schematic diagram of the structure of an intelligent agent evaluation device according to an embodiment of the present disclosure.

[0166] like Figure 3 As shown, the agent evaluation device 30 can be represented in the form of a general computing device. The agent evaluation device 30 includes a memory 31, a processor 32, and a bus 33 connecting different system components.

[0167] The memory 31 may include, for example, system memory, non-volatile storage media, etc. System memory may store, for example, an operating system, application programs, a boot loader, and other programs. System memory may include volatile storage media, such as random access memory (RAM) and / or cache memory. Non-volatile storage media may store, for example, instructions for a corresponding embodiment of an executing agent evaluation method. Non-volatile storage media include, but are not limited to, disk storage, optical storage, flash memory, etc.

[0168] Processor 32 can be implemented using a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA) or other programmable logic devices, discrete hardware components such as discrete gates or transistors. Accordingly, each module, such as the acquisition module, calculation module, and adjustment module, can be implemented by executing instructions in the central processing unit (CPU) running memory to perform the corresponding steps, or by implementing dedicated circuitry to perform the corresponding steps.

[0169] For example, processor 32 is configured for memory-based instruction execution implementation such as Figure 1 , 2The method involved in any of the embodiments.

[0170] Bus 33 can use any of the various bus architectures. For example, bus architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MCA) bus, and the Peripheral Component Interconnect (PCI) bus.

[0171] The interfaces 34, 35, and 36 of the intelligent agent evaluation device 30, as well as the memory 31 and processor 32, can be connected via bus 33. Input / output interface 34 provides a connection interface for input / output devices such as a monitor, mouse, and keyboard. Network interface 35 provides a connection interface for various networked devices. Storage interface 36 provides a connection interface for external storage devices such as floppy disks, USB flash drives, and SD cards.

[0172] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus, and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations thereof, can be implemented by computer-readable program instructions.

[0173] These computer-readable program instructions are provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces means for implementing the functions specified in one or more boxes of the flowchart and / or block diagram.

[0174] These computer-readable program instructions may also be stored in a computer-readable storage medium. These instructions cause a computer to work in a particular manner to produce an article of manufacture, including instructions that implement the functions specified in one or more boxes in a flowchart and / or block diagram.

[0175] This disclosure may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects.

[0176] This disclosure also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement... Figure 1 , 2 The method involved in any of the embodiments.

[0177] This disclosure also provides a computer program product, including computer instructions, wherein the computer instructions, when executed by a processor, implement as follows: Figure 1 , 2 The method involved in any of the embodiments.

[0178] By implementing the above embodiments of this disclosure, the following beneficial effects can be obtained.

[0179] 1) To effectively address the issue of unreliable evaluation results, this disclosure provides a human-machine collaborative evaluation method. The consistency rate between machine and human evaluation results, i.e., the human-machine consistency rate, is used as an indicator of the reliability of the evaluation results. Furthermore, based on comparative analysis of human and machine evaluations, the evaluator is continuously iterated and optimized to improve the reliability of the evaluation.

[0180] 2) To achieve efficient evaluation and reduce costs, this disclosure employs data sampling and precise data distribution to achieve accurate annotation with minimal manpower, thereby improving annotation efficiency. Furthermore, the annotation information can be accumulated into a high-quality evaluation training set for iterative optimization of the evaluator, maximizing its value.

[0181] In some embodiments, the functional units described above may be implemented as general-purpose processors, programmable logic controllers (PLCs), digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or any suitable combination thereof for performing the functions described herein.

[0182] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.

[0183] The description in this disclosure is provided for illustrative and descriptive purposes only and is not intended to be exhaustive or to limit the disclosure to its forms. Many modifications and variations will be apparent to those skilled in the art. The embodiments were chosen and described in order to better illustrate the principles and practical application of this disclosure and to enable those skilled in the art to understand this disclosure and to design various embodiments with various modifications suitable for a particular purpose.

Claims

1. A method for evaluating an agent, comprising: Multiple evaluators are generated, wherein the nth evaluator corresponds to the nth evaluation metric of the agent under test. N represents the total number of evaluation indicators; The agent under test is used to process each question in the test set to obtain the output result of each question; The output result is evaluated using the nth evaluator based on the prompts of the nth evaluator, to obtain the nth machine evaluation result of the output result; Obtain multiple human evaluation results of the output; Based on the nth machine evaluation result and the human evaluation result corresponding to the nth evaluation index among the multiple human evaluation results, the optimization prompt words of the nth evaluator are determined; The N evaluators are used to evaluate the output of the agent under test based on the corresponding optimization prompts, so as to obtain the evaluation result of the agent under test.

2. The agent evaluation method according to claim 1, wherein, The evaluation results obtained for the agent under test include: The output of the agent under test is evaluated using the nth evaluator based on the optimized prompts of the nth evaluator, and the nth evaluation result is obtained. Based on the N evaluation results obtained, the evaluation result of the agent under test is determined.

3. The agent evaluation method according to claim 1, wherein, The multiple manual evaluation results used to obtain the output include: Problems are sampled from the test set according to preset rules to generate a problem sample set; Select the aforementioned assessment personnel from the personnel information database; Obtain multiple sample evaluation results from the output results of the multiple evaluators for each problem sample.

4. The agent evaluation method according to claim 3 further includes: In the multiple sample evaluation results obtained, the consistency rate between the evaluation score and the reason for the score for each sample evaluation result is detected; Delete the sample evaluation results where the consistency rate between the evaluation score and the scoring reason is less than a predetermined consistency rate threshold.

5. The agent evaluation method according to claim 3, wherein, The preset rules include at least one of the following: a first sub-rule that samples based on problem confidence, a second sub-rule that samples based on problem type, and a third sub-rule that samples based on the consistency of evaluation results.

6. The agent evaluation method according to claim 5, wherein, Problem sampling in the test set according to the first sub-rule includes: The output of the m-th question in the test set is evaluated using the N evaluators to obtain N machine evaluation results. M represents the total number of questions; Determine whether the evaluation results of the N machines are consistent; If the evaluation results of the N machines are inconsistent, the m-th problem will be used as a problem sample to generate the problem sample set.

7. The agent evaluation method according to claim 5, wherein, Problem sampling in the test set according to the second sub-rule includes: All questions in the test set are clustered according to question type to obtain multiple question sets; Problems are sampled from each problem set according to a preset ratio to generate the problem sample set.

8. The agent evaluation method according to claim 5, wherein, Problem sampling in the test set according to the third sub-rule includes: The output of the m-th question in the test set is evaluated using the N evaluators, resulting in N machine evaluation results. Each machine evaluation result includes an evaluation score and a reason for the score. M represents the total number of questions; If there is a machine evaluation result among the N machine evaluation results where the evaluation score and the reason for the score are inconsistent, then the m-th problem is taken as a problem sample to generate the problem sample set.

9. The agent evaluation method according to claim 3, wherein, The selection of the multiple evaluators from the personnel information database includes: Calculate the keyword similarity between each question sample in the question sample set and each piece of information about a person in the personnel information database; Calculate the semantic representation similarity between each question sample and each person's information; Calculate the weighted sum of the keyword similarity and the semantic representation similarity to determine the matching degree between each question sample and each person's information; For each problem sample, the personnel whose matching degree is greater than a predetermined matching degree threshold are selected as evaluators.

10. The agent evaluation method according to claim 1, wherein, The optimization prompts for determining the nth evaluator include: Based on the deviation between the nth machine evaluation result and the human evaluation results corresponding to the nth evaluation index among the multiple human evaluation results, the prompt words of the nth evaluator are optimized; The process of evaluating the output result using the nth evaluator based on the prompts of the nth evaluator is repeated until a predetermined termination condition is met. The current prompt word of the nth evaluator is used as the optimized prompt word of the nth evaluator.

11. The agent evaluation method according to claim 10, wherein, The predetermined termination conditions include: the number of iterations equals a predetermined iteration number threshold, or the evaluation consistency rate between the nth machine evaluation result and the multiple human evaluation results corresponding to the nth evaluation index is greater than a predetermined evaluation consistency rate threshold.

12. The agent evaluation method according to claim 10, further comprising: When the predetermined termination condition is met, the evaluation consistency rate of the nth machine evaluation result and the multiple human evaluation results corresponding to the nth evaluation index is calculated, and it is used as the human-machine evaluation consistency rate of the nth evaluator. The weighted sum of the human-machine evaluation consistency rates of the N evaluators is calculated as the confidence level of the evaluation result of the agent under test.

13. The agent evaluation method according to any one of claims 1-12, wherein, The generation of multiple evaluators includes: Obtain the description information of the intelligent agent under test; Based on the described information, multiple evaluation metrics are determined using a first machine learning model. Based on the second machine learning model, an evaluator is generated for each evaluation metric according to the described information.

14. The agent evaluation method according to claim 13, wherein, The step of determining the evaluation index based on the first machine learning model according to the described information includes: Based on the described information, determine the intermediate links of the agent under test; When the number of intermediate links is greater than a predetermined number, multiple evaluation metrics are determined based on a first machine learning model, wherein each evaluation metric corresponds to at least one of the intermediate links, and the multiple evaluation metrics correspond to all of the intermediate links.

15. The agent evaluation method according to claim 14, further comprising: If the number of intermediate links is less than or equal to the predetermined number, an evaluation index and a prompt word are generated based on the description information using a third machine learning model, wherein the evaluator includes the evaluation index and the prompt word.

16. The agent evaluation method according to claim 13, wherein, The generation of evaluators for each evaluation metric based on the second machine learning model includes: Based on the second machine learning model, prompt words for each evaluation indicator are determined according to the description information. The prompt words include the input requirements of the intermediate link corresponding to the evaluation indicator, the output requirements of the intermediate link corresponding to the evaluation indicator, the evaluation criteria, the scoring rules, and the output requirements of the evaluator. The evaluator includes the evaluation indicator and the prompt words.

17. An agent evaluation device, comprising: Memory; A processor, coupled to a memory, is configured to implement the agent evaluation method as described in any one of claims 1-16 based on the memory-stored instructions.

18. A computer-readable storage medium, wherein, The computer-readable storage medium stores computer instructions that, when executed by a processor, implement the agent evaluation method as described in any one of claims 1-16.

19. A computer program product comprising computer instructions, wherein the computer instructions, when executed by a processor, implement the agent evaluation method as described in any one of claims 1-16.