Model evaluation method, device, apparatus and storage medium
By parsing the user-input evaluation scenario description and matching it with the skill library, an evaluation task is generated, which solves the problem of the gap between the existing model evaluation system and real business scenarios, and realizes the scenario-based and accurate model evaluation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA MERCHANTS BANK
- Filing Date
- 2026-04-02
- Publication Date
- 2026-06-23
Smart Images

Figure CN122262004A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of model evaluation technology, and in particular to model evaluation methods, apparatus, equipment and storage media. Background Technology
[0002] The large language model evaluation paradigm consists of three steps: first, select an appropriate benchmark evaluation set based on the model capabilities to be evaluated; second, run the model to obtain the response results on the benchmark evaluation set; and third, select evaluation metrics and calculate the model score.
[0003] Current evaluation systems can only assess whether a model is solid in basic capabilities such as language understanding, logical reasoning, knowledge reserves, code generation, and mathematical ability, which falls short of the real-world business scenarios of users. Summary of the Invention
[0004] The main purpose of this application is to provide a model evaluation method, apparatus, device and storage medium, which aims to solve the technical problem that there is a gap between the current model evaluation system and real business scenarios.
[0005] To achieve the above objectives, this application proposes a model evaluation method, which includes: The evaluation scenario description input by the user is parsed to obtain the parsing results; The evaluation requirements are determined based on the analysis results, and the dimensions of the capabilities to be evaluated are determined based on the evaluation requirements. The assessment skills corresponding to the dimensions of the ability to be assessed are determined based on a preset skill library. The evaluation model is evaluated based on the evaluation skills and evaluation requirements to obtain the evaluation results.
[0006] Optionally, determining the assessment skills corresponding to the ability dimension to be assessed based on a preset skill library includes: Retrieve the skill descriptions of each skill from the preset skill library; Determine the semantic similarity between the skill description and the dimension of the ability to be evaluated; The evaluation skills corresponding to the dimension of ability to be evaluated are determined based on the semantic similarity.
[0007] Optionally, determining the assessment skill corresponding to the capability dimension to be assessed based on the semantic similarity includes: If the semantic similarity is less than a preset similarity threshold, skill generation prompt words are determined based on the dimension of the ability to be evaluated and the description of the evaluation scenario. Based on the skill generation prompt words, a preset skill generation model is invoked to generate the target skill; Receive feedback information from the user based on the target skill; The evaluation skills corresponding to the capability dimension to be evaluated are determined based on the feedback information.
[0008] Optionally, determining the assessment skills corresponding to the capability dimension to be assessed based on the feedback information includes: If the feedback information is a custom skill, a skill template will be displayed; Obtain the user's custom skills based on the skill template; The evaluation skills corresponding to the capability dimension to be evaluated are determined based on the custom skills.
[0009] Optionally, after determining the assessment skill corresponding to the ability dimension to be assessed based on the custom skill, the process includes: The custom skill is expanded to obtain extended skills; Determine the skill similarity between the expanded skill and each skill in the preset skill library; The skills to be added to the database are determined based on the skill similarity. If the skill to be added to the database meets the preset database entry conditions, the skill to be added to the preset skill database will be added.
[0010] Optionally, the step of evaluating the model to be evaluated based on the evaluation skills and the evaluation requirements to obtain evaluation results includes: An assessment task is generated based on the assessment skills and assessment requirements. The assessment task includes input prompts and reference answers. Schedule the model to be evaluated to execute the evaluation task and collect the evaluation response data; The evaluation result is determined based on the evaluation response data.
[0011] Optionally, determining the evaluation result based on the evaluation response data includes: The evaluation score corresponding to the evaluation task is determined based on the evaluation response data. Determine the evaluation weights corresponding to the evaluation tasks based on the evaluation requirements; The evaluation result of the model to be evaluated is determined based on the evaluation score and the evaluation weight.
[0012] Furthermore, to achieve the above objectives, this application also proposes a model evaluation device, which includes: The parsing module is used to parse the evaluation scenario description input by the user and obtain the parsing results; The determination module is used to determine the evaluation requirements based on the analysis results, and to determine the capability dimensions to be evaluated based on the evaluation requirements; The matching module is used to determine the evaluation skills corresponding to the dimension of the ability to be evaluated based on a preset skill library; The evaluation module is used to evaluate the model to be evaluated based on the evaluation skills and evaluation requirements, and obtain evaluation results.
[0013] In addition, to achieve the above objectives, this application also proposes a model evaluation device, the device comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being configured to implement the steps of the model evaluation method as described above.
[0014] In addition, to achieve the above objectives, this application also proposes a storage medium, which is a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, it implements the steps of the model evaluation method described above.
[0015] In addition, to achieve the above objectives, this application also provides a computer program product, which includes a computer program that, when executed by a processor, implements the steps of the model evaluation method described above.
[0016] This application parses the user-inputted evaluation scenario description to obtain the parsing result; determines the evaluation requirements based on the parsing result, and determines the capability dimensions to be evaluated based on the evaluation requirements; determines the evaluation skills corresponding to the capability dimensions to be evaluated based on a preset skill library; and evaluates the model to be evaluated based on the evaluation skills and the evaluation requirements to obtain the evaluation result. Since this application evaluates model performance in a specific scenario based on the user-inputted evaluation scenario description, compared to the existing unified model performance evaluation system, the above method of this application can achieve a leap in model evaluation dimensions from general capabilities to scenario competence, directly measuring the efficiency and effectiveness of the model in solving practical problems, and more realistically reflecting the model's practical application value. Attached Figure Description
[0017] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0018] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a flowchart illustrating the model evaluation method of this application in Implementation Example 1. Figure 2 This is a flowchart illustrating Embodiment 2 of the model evaluation method of this application; Figure 3 This is a schematic diagram of the skill matching process provided in Embodiment 2 of the model evaluation method of this application; Figure 4 This is a flowchart illustrating the model evaluation method of this application in Embodiment 3. Figure 5 This is a schematic diagram of the overall process for the model evaluation method in Embodiment 3 of this application; Figure 6 This is a schematic diagram of the module structure of the model evaluation device in an embodiment of this application; Figure 7 This is a schematic diagram of the hardware operating environment involved in the model evaluation method in this application embodiment.
[0020] The purpose, features, and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0021] It should be understood that the specific embodiments described herein are merely illustrative of the technical solutions of this application and are not intended to limit this application.
[0022] To better understand the technical solution of this application, a detailed description will be provided below in conjunction with the accompanying drawings and specific implementation methods.
[0023] It should be noted that the executing entity in this embodiment can be a computing service device with data processing, network communication, and program execution functions, such as a tablet computer, personal computer, or mobile phone, or an electronic device or model evaluation device capable of performing the above functions. The following description uses a model evaluation device as an example to illustrate this embodiment and the subsequent embodiments.
[0024] Based on this, embodiments of this application provide a model evaluation method, referring to... Figure 1 , Figure 1 This is a flowchart illustrating the first embodiment of the model evaluation method in this application.
[0025] In this embodiment, the model evaluation method includes the following steps: Step S10: Parse the evaluation scenario description input by the user to obtain the parsing result; It should be noted that the evaluation scenario description can be the evaluation needs, domain, background, and conditions expressed by the user in natural language and in an unstructured form. For example: "I want to test whether this conversational AI is rigorous and professional enough in answering medical and health questions, and at the same time evaluate its ability to cope when deliberately challenged by users." Parsing the evaluation scenario description can transform vague user needs into structured evaluation needs. The parsing results include information such as application domain, target users, and capability dimensions. Capability dimensions can include the performance to be evaluated, such as intent understanding ability, credit card knowledge ability, and professional response ability.
[0026] In practical implementation, the model evaluation equipment includes a scenario parsing module, which serves as the system's entry point. This module is responsible for calling the parsing model to analyze the user's input natural language scenario description, transforming vague user needs into structured evaluation requirements, including application domain, target users, and capability dimensions. For example, the user's input scenario description might be: "I want to test the performance of a large language model in credit card business, requiring the model to understand customer questions about bills, points, and overdue payments, and provide professional answers according to bank regulations." The output of this module could be: { "domain": "Finance_Banking_Credit Card", / / Application domain "user": "Credit card customer", / / Target user "capabilities": ["Intent understanding", "Credit card knowledge", "Professional response"], / / Capability dimensions }
[0027] Step S20: Determine the evaluation requirements based on the analysis results, and determine the capability dimensions to be evaluated based on the evaluation requirements; It should be noted that the analysis results include the user's evaluation requirements. For example, taking the above example, the evaluation requirements could be: the user is evaluating the financial / banking / credit card sector, the target user is credit card customers, and the capability dimensions are "intent understanding," "credit card knowledge," and "professional response." The capability dimensions to be evaluated can be the capability dimensions in the evaluation requirements.
[0028] Step S30: Determine the evaluation skills corresponding to the dimension of the ability to be evaluated based on the preset skill library; It should be noted that the preset skill library can be a pre-maintained, scalable, and standardized "skill" definition system, where each skill is a standardized description of a basic capability of the model. The preset skill library can be expanded in three ways: First, dynamic expansion based on user feedback; skills manually added by users during skill matching are automatically stored in the preset skill library for later reuse. Second, automatic discovery of new skills based on evaluation scenarios; if existing skills cannot cover user needs, a large language model is used in conjunction with the user's scenario description to generate candidate skills, which are then stored in the preset skill library after user confirmation. Third, generalization based on existing skills; when a new skill is added to the library, related skills can be automatically recommended. Each skill includes the following attributes: (1) Skill ID: A unique identifier for a skill; (2) Skill Name: The name of the skill; (3) Skill Description: Provide a detailed explanation of the skill's essence; (4) Task Template: A standardized task template for generating the evaluation dataset. The template is defined in JSON format and includes input field constraints, output field constraints, and reference instructions. Related evaluation metrics: These refer to specific metrics in the evaluation metric library, along with the weights corresponding to each metric. The evaluation metric library can include various metrics used to evaluate model performance, such as factual accuracy, response consistency, and compliance.
[0029] The skill examples are shown below: { "skill_id": "FIN_QA_001", / / Skill ID "skill_name": "Financial Knowledge Q&A", / / skill name "skill_description": "Based on professional knowledge in the financial field (covering banking, insurance, funds, wealth management, loans, credit cards, etc.), accurately and clearly answer users' financial-related questions, including but not limited to explanations of concepts, business processes, and interpretations of policies and regulations. Answers should be easy to understand while ensuring the accuracy and timeliness of the information." / / Skill Description "task_template": { / / task template "input_schema": { / / Input field "type": "object", "properties": { "query": { "type": "string" "description": "Financial questions raised by users" } "required": ["query"] / / Required fields :query }, "output_schema": { / / Output fields "type": "object", "properties": { "answer": { "type": "string" "description": "A detailed answer to the question" } }, "required": ["answer"] / / Required field }, "generation_instruction": "Based on the financial questions raised by users and incorporating relevant financial knowledge, generate accurate, clear, and comprehensive answers. Answers should be based on facts and avoid subjective assumptions. If specific data or regulations are involved, provide sources whenever possible. Answers should be easy to understand and suitable for the average user." / / Reference Instructions }, "evaluation_metrics": [ ["METRIC_ACCURACY","4"], / / Evaluation metric: factual accuracy, weighted at 4 ["METRIC_COHERENCE"","2"], / / Evaluation metric: Response coherence, weighted at 2 ["METRIC_Compliance"","4"] / / Evaluation metric: Compliance, weighted at 4 ] } It should be noted that determining the assessment skills corresponding to the ability dimension to be evaluated based on the preset skill library can be achieved by performing similarity matching between the ability dimension to be evaluated and each skill in the preset skill library, obtaining the skill with the highest similarity for each ability dimension to be evaluated in the preset skill library, i.e., the skill corresponding to the ability dimension to be evaluated. The skills corresponding to all ability dimensions to be evaluated are the assessment skills. For example, the "intent understanding" ability is matched with the "intent recognition" skill in the preset skill library, and the "credit card knowledge" ability is matched with the "financial knowledge Q&A" skill in the preset skill library.
[0030] Step S40: Evaluate the model to be evaluated based on the evaluation skills and evaluation requirements to obtain the evaluation results.
[0031] It should be noted that the evaluation of the model to be evaluated based on the evaluation skills and evaluation requirements, and the resulting evaluation, can be achieved by generating an evaluation task based on the evaluation requirements and evaluation skills. The evaluation task can include evaluation questions and answers. The generated evaluation questions are input into the model to be evaluated to obtain the actual answer output by the model. The actual answer is compared with the evaluation answer to obtain the evaluation result of the model to be evaluated. The similarity between the actual answer and the evaluation answer can be used as the accuracy of the model to be evaluated.
[0032] This embodiment parses the user-inputted evaluation scenario description to obtain the parsing result; determines the evaluation requirements based on the parsing result, and determines the capability dimensions to be evaluated based on the evaluation requirements; determines the evaluation skills corresponding to the capability dimensions to be evaluated based on a preset skill library; and evaluates the model to be evaluated based on the evaluation skills and the evaluation requirements to obtain the evaluation result. Since this embodiment evaluates the model performance in a specific scenario based on the user-inputted evaluation scenario description, compared to the existing unified model performance evaluation system, the above method in this embodiment can achieve a leap in model evaluation dimensions from general capabilities to scenario competence, directly measuring the efficiency and effectiveness of the model in solving practical problems, and more realistically reflecting the model's practical application value.
[0033] Based on the first embodiment of this application, in the second embodiment of this application, the content that is the same as or similar to that in the first embodiment described above can be referred to the above description, and will not be repeated hereafter. Based on this, please refer to... Figure 2 , Figure 2 This is a flowchart illustrating the second embodiment of the model evaluation method of this application. Step S30 further includes the following steps: Step S301: Obtain the skill descriptions of each skill in the preset skill library; It should be noted that, referring to the skills exemplified in the above embodiments, each skill has a corresponding skill description. For example, in the financial knowledge skill, the skill description is: Based on professional knowledge in the financial field (covering banking, insurance, funds, wealth management, loans, credit cards, etc.), accurately and clearly answer users' financial-related questions, including but not limited to explanations of concepts, business processes, and interpretations of policies and regulations. The answers should be easy to understand, while ensuring the accuracy and timeliness of the information.
[0034] Step S302: Determine the semantic similarity between the skill description and the dimension of the ability to be evaluated; It should be noted that determining the semantic similarity between the skill description and the dimension of the ability to be evaluated can be achieved by calling the semantic model Sentence-BERT to calculate the semantic similarity between each parsed dimension of the ability to be evaluated and each skill description in the preset skill library.
[0035] Step S303: Determine the evaluation skills corresponding to the dimension of ability to be evaluated based on the semantic similarity.
[0036] It should be noted that determining the evaluation skill corresponding to the capability dimension to be evaluated based on the semantic similarity can be done by selecting the skill with the highest semantic similarity as the evaluation skill corresponding to the capability dimension to be evaluated.
[0037] Furthermore, in order to make the selected assessment skills more in line with the user's assessment needs, step S303 may include: when the semantic similarity is less than a preset similarity threshold, determining skill generation prompt words based on the dimension of the ability to be assessed and the description of the assessment scenario; Based on the skill generation prompt words, a preset skill generation model is invoked to generate the target skill; Receive feedback information from the user based on the target skill; The evaluation skills corresponding to the capability dimension to be evaluated are determined based on the feedback information.
[0038] It should be noted that the preset similarity threshold can be a pre-set similarity threshold used to determine the assessment skills corresponding to the capability dimension to be evaluated. If the semantic similarity is less than the preset similarity threshold, it can be determined that the skills in the current preset skill library do not match the capability dimension to be evaluated. In this case, a target skill can be generated through model generation. Specifically, skill generation prompts are determined based on the capability dimension to be evaluated and the assessment scenario description input by the user. The skill generation prompts are used to instruct the preset skill generation model to generate the target skill corresponding to the capability dimension to be evaluated based on the assessment scenario description. The preset skill generation model can be a pre-trained large language model for generating skills, which has pre-learned the basic templates of skills. The generated target skill is sent to the user so that the user can select or adjust it to obtain the final assessment skill.
[0039] Furthermore, if the skills generated by the model still fail to meet user expectations, the user can customize skills. The step of determining the assessment skills corresponding to the capability dimension to be assessed based on the feedback information includes: If the feedback information is a custom skill, a skill template will be displayed; Obtain the user's custom skills based on the skill template; The evaluation skills corresponding to the capability dimension to be evaluated are determined based on the custom skills.
[0040] It should be noted that the skill template can refer to the attribute information included in each skill described in Embodiment 1, and the skill template includes the aforementioned attribute information. The custom skill can be a skill filled in by the user based on the skill template. Determining the assessment skill corresponding to the ability dimension to be evaluated based on the custom skill can be done by using the custom skill as the assessment skill.
[0041] Furthermore, in order to expand the preset skill library, after determining the evaluation skills corresponding to the ability dimension to be evaluated based on the custom skills, the process includes: The custom skill is expanded to obtain extended skills; Determine the skill similarity between the expanded skill and each skill in the preset skill library; The skills to be added to the database are determined based on the skill similarity. If the skill to be added to the database meets the preset database entry conditions, the skill to be added to the preset skill database will be added.
[0042] It should be noted that the skill expansion of the custom skill can be achieved by calling the preset skill generation model to expand the custom skill, generating multiple variant skills, i.e., expanded skills. For example, skills related to the custom skill "bank customer service response" include "customer complaint emotion recognition," "complaint handling," and "script generation." Determining the skill similarity between the expanded skill and each skill in the preset skill library can be achieved by calculating the textual or semantic similarity between the skill description of the expanded skill and the skill descriptions of each skill in the preset skill library. If the skill similarity is less than a preset similarity threshold, the skill is added to the preset skill library. If the skill similarity is greater than or equal to the preset similarity threshold, it indicates that a skill with high similarity already exists in the preset skill library, and therefore, it does not need to be added.
[0043] In practical implementation, the model evaluation device also includes a skill matching module. This module is responsible for acquiring skill information from a pre-set skill library and matching the evaluation requirements parsed by the scene analysis module with the existing skills in the pre-set skill library. The matching mechanism is as follows (see details). Figure 3 , Figure 3 A schematic diagram of the skill matching process provided in Embodiment 2 of the model evaluation method of this application: (1) Call the semantic model SBERT (Sentence-BERT) to calculate the semantic similarity between each parsed dimension of the ability to be evaluated and the description of each skill in the preset skill library; (2) Filter skills that have reached the preset similarity threshold with each ability dimension to be evaluated. Return the skills that meet the conditions to the user in order of similarity from high to low for selection and adjustment. If the user selects the corresponding skill, the match is successful (e.g., "intent understanding" ability matches "intent recognition" skill in the preset skill library, "credit card knowledge" ability matches "financial knowledge Q&A" skill in the preset skill library). If the user thinks that the provided skills do not match the ability dimension to be evaluated, skip the selection, indicating that the match failed. (3) If there are unmatched assessment capabilities, then construct prompt words by combining the assessment capability dimensions and the assessment scenario description entered by the user, and call the large language model (preset skill generation model) to generate new skills (such as combining the user's assessment scenario description and the unmatched capability "professional reply" to generate the new skill "bank customer service response"). After the user confirms, it can be added to the database and the preset skill library. If the user does not accept the generated new skill, the user can manually add the skill. (4) If new skills are added to the preset skill library through user manual expansion or model-driven generation, a large language model (which can be a preset skill generation model) can be invoked to generate related recommended skills based on the new skills (e.g., skills related to the new skill "bank customer service response" include "customer complaint emotion recognition," "complaint handling," and "script generation"). The semantic similarity between the skill description of the recommended skills and the skill descriptions of each skill in the preset skill library is calculated. Skills with similarity below a threshold are recommended to the user. If the user accepts the recommended skills, the recommended skills are added to the library. This mechanism not only helps to improve the skill set for this evaluation but also continuously enriches the preset skill library and improves the efficiency of subsequent evaluations. (5) After the skill matching is completed, the system will call the large language model to assign weights to each skill according to the user's natural language scene description, and return the results to the user for confirmation and adjustment, so as to reflect the relative importance of each skill in the scene. The subsequent evaluation and scoring module will also calculate the weighted score of the model to be evaluated in the scene based on these weights.
[0044] This embodiment obtains the skill descriptions of each skill in a preset skill library; determines the semantic similarity between the skill descriptions and the dimension of the ability to be evaluated; and determines the evaluation skill corresponding to the dimension of the ability to be evaluated based on the semantic similarity. This embodiment maintains an expandable preset skill library, realizes the automated construction of scenario-based evaluation datasets, avoids the data pollution problem of public evaluation benchmarks, and achieves fitting to real application scenarios.
[0045] Based on the above embodiments of this application, in the third embodiment of this application, the same or similar content as the above embodiments can be referred to the above description, and will not be repeated hereafter. Based on this, please refer to... Figure 4 , Figure 4This is a flowchart illustrating the model evaluation method of this application in Embodiment 3. Step S40 further includes the following steps: Step S401: Generate an assessment task based on the assessment skills and assessment requirements, wherein the assessment task includes input prompts and reference answers; It should be noted that generating assessment tasks based on the assessment skills and assessment requirements can be based on the assessment requirements and assessment skills to generate multiple assessment tasks containing input prompts and reference answers. The input prompts can be user questions related to the assessment requirements.
[0046] In practical implementation, the model evaluation device also includes an automated evaluation set generation module. This module, based on evaluation requirements and matching skills, calls a generative model (an AI model capable of creating corresponding evaluation tasks based on given evaluation requirements and specific skills, which can be a large language model) to automatically generate high-quality evaluation tasks and construct a customized evaluation set (a collection of test questions consisting of multiple evaluation tasks). To avoid the uniformity of the evaluation set, this module generates diverse data through methods such as synonym substitution, sentence transformation, and domain transfer. Each evaluation task consists of input prompts and reference answers. Taking the "financial knowledge Q&A" skill as an example, an example evaluation task is as follows: { "input": "Based on the financial questions raised by users and combined with relevant financial knowledge, generate accurate, clear, and comprehensive answers. Answers should be based on facts and avoid subjective assumptions. If specific data or regulations are involved, provide the source whenever possible. Answers should be easy to understand and suitable for ordinary users. Output schema (format) is as follows:\n:{\"type\":\"object\",\"properties\":{\"answer\":{\"type\":\"string\",\"description\":\"Detailed answer to the question\"}}\nUser question (automatically generated): If I only pay the minimum payment on my credit card, will the remaining balance accrue interest?" "output":{ Answer: "Credit cards only accept minimum payments and do not offer interest-free repayment: Interest accrues on all transactions from the billing date. The amount already paid is calculated from the date of the purchase to the day before your payment. The outstanding amount is calculated from the date of the purchase to the day you actually pay it off. The interest rate is typically 0.05% per day." }} Step S402: Schedule the model to be evaluated to execute the evaluation task and collect the evaluation response data; It should be noted that the evaluation response data may include the raw response data of the model under evaluation executing the evaluation task. The raw response data includes metadata such as the raw output, response time, and number of calls of the model under evaluation, providing a basis for subsequent analysis.
[0047] Step S403: Determine the evaluation result based on the evaluation response data.
[0048] It should be noted that determining the evaluation result based on the evaluation response data can be achieved by comparing the original output of the model to be evaluated with the reference answer in the evaluation task to obtain the accuracy of the model's response, and using the accuracy of the response as the evaluation result of the model.
[0049] Furthermore, in order to accurately determine the evaluation result of the model to be evaluated, step S403 may include: determining the evaluation score corresponding to the evaluation task based on the evaluation response data; Determine the evaluation weights corresponding to the evaluation tasks based on the evaluation requirements; The evaluation result of the model to be evaluated is determined based on the evaluation score and the evaluation weight.
[0050] It should be noted that the skill template contains evaluation indicators and their corresponding weights for each skill. Based on the evaluation response data and the reference answers in the evaluation task, the score for each evaluation indicator in the evaluation task can be determined. Then, based on the scores and weights of the evaluation indicators, the evaluation score for each evaluation task can be obtained. According to the evaluation requirements, corresponding evaluation weights can be assigned to each evaluation task. Based on the evaluation weights and scores for each evaluation task, the evaluation result of the model to be evaluated is determined.
[0051] In practical implementation, the model evaluation device also includes an evaluation and scoring module. This module has a built-in rich library of evaluation metrics, including scripts for calculating objective metrics (such as accuracy, precision, recall, BLEU, ROUGE, etc.) and performance metrics (such as response speed), as well as evaluation rule templates for subjective metrics (such as relevance, coherence, etc.). This module can automatically score the raw response of the model to be evaluated based on the evaluation metrics associated with each skill (the skill template contains the corresponding evaluation metrics and weights for each skill; see the skill templates provided in Example 1 for details). If the evaluation metric is an objective or performance metric, the script can be called directly. If the evaluation metric is a subjective metric, evaluation prompts are generated according to the subjective metric evaluation rule template, and then the referee model is called to evaluate the subjective metric. The weighted average score of each evaluation metric is the score for that skill, and the weighted average score of all skills is the comprehensive score of the model to be evaluated in the user scenario. If a user wants to evaluate the overall performance of the model under evaluation across multiple scenarios, they can sequentially execute the complete evaluation process for each scenario and configure the weights for each scenario. The weighted score for each scenario is the final score. The subjective indicator evaluation rule template can be a standardized evaluation framework or question set, used to transform abstract subjective evaluation dimensions (such as relevance, coherence, etc.) into specific, operable judgment instructions or questions that the judging model can understand and execute. This ensures that the evaluation of subjective indicators is structured, consistent, and repeatable, rather than arbitrary. For example, for the coherence indicator, the template rule could be: "Please judge whether the logic between the sentences in the following answers is coherent, and whether there are any contradictions or information jumps. Scoring standard: 1 point (completely incoherent) to 5 points (completely coherent)." The judging model can be an AI model used to perform subjective evaluations. After the system generates specific evaluation prompts based on the "subjective indicator evaluation rule template," it inputs these prompts along with the original response of the model under evaluation and the reference answers in the evaluation task into the judging model. The judging model then scores the subjective indicators according to the requirements of the prompts based on its own understanding and judgment capabilities.
[0052] The model evaluation equipment also includes a visualization evaluation report generation module. Based on the evaluation results output by the evaluation and scoring module, the visualization evaluation report generation module calls a drawing tool to display and compare the performance of the model under evaluation in various skills in the form of charts, and calls the Big Prophet model to analyze the evaluation results, generate evaluation conclusions, and output an intuitive and easy-to-understand visualization evaluation report to the user.
[0053] In specific implementation, it can be referred to Figure 5 , Figure 5 This is a schematic diagram of the overall process for the model evaluation method embodiment three of this application. The overall process of this embodiment can be as follows: (1) The user inputs a natural language scene description, explaining the expected application scenario and evaluation requirements; (2) The user configures the API interface of the model to be tested (model to be evaluated) and the size and number of rows of the evaluation dataset (evaluation set); (3) The system (model evaluation device) calls the scenario parsing module to parse the user input in step 1 into structured evaluation requirements, which include application domain, target users and capability dimensions; (4) The system calls the skill matching module to match skills in the preset skill library and returns a list of matched skills to the user for confirmation. If there are any unmatched ability dimensions, the skill generation process (model-generated skills) is triggered. The user can directly use the generated skills or manually add skills. Once all the ability dimensions to be evaluated have matched skills, the system assigns weights to each skill and returns the list to the user for confirmation and adjustment. Skills added in this step will be updated synchronously in the preset skill library, and related recommended skills will be generated. The recommended skills adopted by the user will also be added to the skill library. (5) Based on the structured assessment requirements analyzed in step 3 and the matched skills, the system calls the automated assessment set generation module to generate customized assessment sets for each skill. (6) The system calls the automated evaluation execution module to run the test on the model under test and obtain the response; (7) The system calls the evaluation and scoring module to evaluate and score the model response in step 6 based on the evaluation indicators corresponding to the skills. (8) If there are other scenarios that need to be evaluated, repeat steps 1-7 to obtain the scoring results of all scenarios and set the weight of each scenario. The evaluation and scoring module will calculate the weighted score of the entire scenario. (9) The evaluation system generates an evaluation report by calling the visualization evaluation report generation module based on the scoring results output by the above steps.
[0054] This embodiment generates an assessment task based on the assessment skills and assessment requirements. The assessment task includes input prompts and reference answers. The model to be assessed is scheduled to execute the assessment task, and assessment response data is collected. The assessment result is determined based on the assessment response data. This embodiment enables the assessment of the model to transcend general capabilities and encompass scenario-specific competence, directly measuring the model's efficiency and effectiveness in solving real-world problems, and more realistically reflecting the model's practical application value.
[0055] It should be noted that the above examples are only for understanding this application and do not constitute a limitation on the model evaluation method of this application. Any simple modifications based on this technical concept are within the protection scope of this application.
[0056] This application also provides a model evaluation device; please refer to... Figure 6 The model evaluation device includes: The parsing module 10 is used to parse the evaluation scenario description input by the user and obtain the parsing result; The determination module 20 is used to determine the evaluation requirements based on the analysis results, and to determine the capability dimensions to be evaluated based on the evaluation requirements; Matching module 30 is used to determine the evaluation skills corresponding to the dimension of the ability to be evaluated based on a preset skill library; The evaluation module 40 is used to evaluate the model to be evaluated based on the evaluation skills and evaluation requirements, and obtain evaluation results.
[0057] This embodiment parses the user-inputted evaluation scenario description to obtain the parsing result; determines the evaluation requirements based on the parsing result, and determines the capability dimensions to be evaluated based on the evaluation requirements; determines the evaluation skills corresponding to the capability dimensions to be evaluated based on a preset skill library; and evaluates the model to be evaluated based on the evaluation skills and the evaluation requirements to obtain the evaluation result. Since this embodiment evaluates the model performance in a specific scenario based on the user-inputted evaluation scenario description, compared to the existing unified model performance evaluation system, the above method in this embodiment can achieve a leap in model evaluation dimensions from general capabilities to scenario competence, directly measuring the efficiency and effectiveness of the model in solving practical problems, and more realistically reflecting the model's practical application value.
[0058] The model evaluation device provided in this application, employing the model evaluation method described in the above embodiments, can solve the technical problem of the gap between the current model evaluation system and real business scenarios. Compared with the prior art, the beneficial effects of the model evaluation device provided in this application are the same as those of the model evaluation method provided in the above embodiments, and other technical features in the model evaluation device are the same as those disclosed in the methods of the above embodiments, and will not be repeated here.
[0059] This application provides a model evaluation device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the model evaluation method in Embodiment 1 above.
[0060] The following is for reference. Figure 7The diagram illustrates a structural schematic suitable for implementing the model evaluation device of the embodiments of this application. The model evaluation device in the embodiments of this application may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (Portable Application Description), PMPs (Portable Media Players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and fixed terminals such as digital TVs and desktop computers. Figure 7 The model evaluation device shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.
[0061] like Figure 7 As shown, the model evaluation device may include a processing unit 1001 (e.g., a central processing unit, a graphics processing unit, etc.), which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1002 or a program loaded from a storage device 1003 into a random access memory (RAM) 1004. The RAM 1004 also stores various programs and data required for the operation of the model evaluation device. The processing unit 1001, ROM 1002, and RAM 1004 are interconnected via a bus 1005. An input / output (I / O) interface 1006 is also connected to the bus. Typically, the following systems can be connected to the I / O interface 1006: input devices 1007 including, for example, a touchscreen, touchpad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, etc.; output devices 1008 including, for example, a liquid crystal display (LCD), speaker, vibrator, etc.; storage devices 1003 including, for example, magnetic tape, hard disk, etc.; and communication devices 1009. Communication device 1009 allows the model evaluation equipment to communicate wirelessly or wiredly with other devices to exchange data. Although the figure shows model evaluation equipment with various systems, it should be understood that implementation or possession of all the systems shown is not required. More or fewer systems may be implemented alternatively.
[0062] Specifically, according to the embodiments disclosed in this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments disclosed in this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device, or installed from storage device 1003, or installed from ROM 1002. When the computer program is executed by processing device 1001, it performs the functions defined in the methods of the embodiments disclosed in this application.
[0063] The model evaluation device provided in this application, employing the model evaluation method described in the above embodiments, can solve the technical problem of the gap between the current model evaluation system and real business scenarios. Compared with the prior art, the beneficial effects of the model evaluation device provided in this application are the same as those of the model evaluation method provided in the above embodiments, and other technical features in this model evaluation device are the same as those disclosed in the previous embodiment method, and will not be repeated here.
[0064] It should be understood that the various parts disclosed in this application can be implemented using hardware, software, firmware, or a combination thereof. In the description of the above embodiments, specific features, structures, materials, or characteristics can be combined in any suitable manner in one or more embodiments or examples.
[0065] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
[0066] This application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon, which are used to execute the model evaluation method in the above embodiments.
[0067] The computer-readable storage medium provided in this application may be, for example, a USB flash drive, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this embodiment, the computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, system, or device. The program code contained on the computer-readable storage medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (Radio Frequency), etc., or any suitable combination thereof.
[0068] The aforementioned computer-readable storage medium may be included in the model evaluation device; or it may exist independently and not be assembled into the model evaluation device.
[0069] Computer program code for performing the operations of this application can be written in one or more programming languages or a combination thereof. These programming languages include object-oriented programming languages—such as Python, Java, Smalltalk, and C++—and conventional procedural programming languages—such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0070] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0071] The modules described in the embodiments of this application can be implemented in software or hardware. The names of the modules do not necessarily limit the functionality of the unit itself.
[0072] The readable storage medium provided in this application is a computer-readable storage medium that stores computer-readable program instructions (i.e., computer programs) for executing the above-described model evaluation method. This addresses the technical problem of the gap between the current model evaluation system and real-world business scenarios. Compared with the prior art, the beneficial effects of the computer-readable storage medium provided in this application are the same as those of the model evaluation method provided in the above embodiments, and will not be elaborated upon here.
[0073] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the model evaluation method described above.
[0074] The computer program product provided in this application can solve the technical problem of the gap between the current model evaluation system and real business scenarios. Compared with the prior art, the beneficial effects of the computer program product provided in this application are the same as those of the model evaluation method provided in the above embodiments, and will not be repeated here.
[0075] The above description is only a part of the embodiments of this application and does not limit the scope of protection of this application. All equivalent structural transformations made under the technical concept of this application and using the content of this application specification and drawings, or direct / indirect applications in other related technical fields, are included in the scope of protection of this application.
Claims
1. A model evaluation method, characterized in that, The model evaluation method includes the following steps: The evaluation scenario description input by the user is parsed to obtain the parsing results; The evaluation requirements are determined based on the analysis results, and the dimensions of the capabilities to be evaluated are determined based on the evaluation requirements. The assessment skills corresponding to the dimensions of the ability to be assessed are determined based on a preset skill library. The evaluation model is evaluated based on the evaluation skills and evaluation requirements to obtain the evaluation results.
2. The model evaluation method as described in claim 1, characterized in that, The step of determining the assessment skills corresponding to the ability dimension to be assessed based on a preset skill library includes: Retrieve the skill descriptions of each skill from the preset skill library; Determine the semantic similarity between the skill description and the dimension of the ability to be evaluated; The evaluation skills corresponding to the dimension of ability to be evaluated are determined based on the semantic similarity.
3. The model evaluation method as described in claim 2, characterized in that, The step of determining the assessment skills corresponding to the dimension of ability to be assessed based on the semantic similarity includes: If the semantic similarity is less than a preset similarity threshold, skill generation prompt words are determined based on the dimension of the ability to be evaluated and the description of the evaluation scenario. Based on the skill generation prompt words, a preset skill generation model is invoked to generate the target skill; Receive feedback information from the user based on the target skill; The evaluation skills corresponding to the capability dimension to be evaluated are determined based on the feedback information.
4. The model evaluation method as described in claim 3, characterized in that, The step of determining the assessment skills corresponding to the capability dimension to be assessed based on the feedback information includes: If the feedback information is a custom skill, a skill template will be displayed; Obtain the user's custom skills based on the skill template; The evaluation skills corresponding to the capability dimension to be evaluated are determined based on the custom skills.
5. The model evaluation method as described in claim 4, characterized in that, After determining the assessment skills corresponding to the ability dimension to be assessed based on the custom skills, the process includes: The custom skill is expanded to obtain extended skills; Determine the skill similarity between the expanded skill and each skill in the preset skill library; The skills to be added to the database are determined based on the skill similarity. If the skill to be added to the database meets the preset database entry conditions, the skill to be added to the preset skill database will be added.
6. The model evaluation method according to any one of claims 1-5, characterized in that, The evaluation of the model to be evaluated based on the evaluation skills and evaluation requirements, to obtain evaluation results, includes: An assessment task is generated based on the assessment skills and assessment requirements. The assessment task includes input prompts and reference answers. Schedule the model to be evaluated to execute the evaluation task and collect the evaluation response data; The evaluation result is determined based on the evaluation response data.
7. The model evaluation method as described in claim 6, characterized in that, Determining the evaluation result based on the evaluation response data includes: The evaluation score corresponding to the evaluation task is determined based on the evaluation response data. Determine the evaluation weights corresponding to the evaluation tasks based on the evaluation requirements; The evaluation result of the model to be evaluated is determined based on the evaluation score and the evaluation weight.
8. A model evaluation device, characterized in that, The model evaluation device includes: The parsing module is used to parse the evaluation scenario description input by the user and obtain the parsing results; The determination module is used to determine the evaluation requirements based on the analysis results, and to determine the capability dimensions to be evaluated based on the evaluation requirements; The matching module is used to determine the evaluation skills corresponding to the dimension of the ability to be evaluated based on a preset skill library; The evaluation module is used to evaluate the model to be evaluated based on the evaluation skills and evaluation requirements, and obtain evaluation results.
9. A model evaluation device, characterized in that, The device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being configured to implement the steps of the model evaluation method as described in any one of claims 1 to 7.
10. A storage medium, characterized in that, The storage medium is a computer-readable storage medium, and a computer program is stored on the storage medium. When the computer program is executed by a processor, it implements the steps of the model evaluation method as described in any one of claims 1 to 7.