Artificial intelligence model evaluation optimization method and apparatus based on task network

By constructing a task network and evaluating comprehensive performance indicators, the multi-task evaluation process of artificial intelligence models is optimized, solving the problems of evaluation limitations and high resource consumption in existing technologies, and achieving more efficient and accurate evaluation.

CN122309316APending Publication Date: 2026-06-30INSPUR SUZHOU INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INSPUR SUZHOU INTELLIGENT TECH CO LTD
Filing Date
2026-05-28
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing general AI model evaluation methods have limitations and high computational resource consumption in multi-task evaluation scenarios, making it difficult to reflect the model's comprehensive intelligent performance in real-world application environments, and the evaluation efficiency is low.

Method used

By constructing a task network, the task similarity between multiple inference tasks of the target artificial intelligence model is obtained. The task network is constructed and the comprehensive performance index is evaluated to generate a task evaluation execution strategy, so as to optimize the evaluation execution process of the model in multiple inference tasks.

Benefits of technology

It improves the evaluation accuracy of the model in multi-task evaluation scenarios, reduces the consumption of computing resources, improves the evaluation efficiency, and systematically describes the comprehensive performance of the model in multi-task evaluation scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309316A_ABST
    Figure CN122309316A_ABST
Patent Text Reader

Abstract

This application discloses a method and apparatus for evaluating and optimizing artificial intelligence models based on task networks, relating to the field of computer technology. This method constructs a task network for a target artificial intelligence model by leveraging the task similarity between different inference tasks. It then evaluates the comprehensive performance indicators of the target artificial intelligence model across multiple inference tasks based on the basic performance indicators of the inference tasks corresponding to each task node in the task network. This improves upon the limitations of single-task evaluation methods, enhances the accuracy of model evaluation, and generates task evaluation execution strategies for the artificial intelligence model under evaluation across multiple inference tasks using the comprehensive performance indicators and network structure parameters of the task network. This optimizes the control of the model's evaluation execution process across multiple inference tasks, thereby effectively reducing the computational load of the model in multi-task evaluation scenarios while ensuring the model's evaluation effectiveness, and improving the overall evaluation efficiency of the model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a method and apparatus for evaluating and optimizing artificial intelligence models based on task networks. Background Technology

[0002] In recent years, with the rapid development of general artificial intelligence models such as Large Language Models (LLMs), significant progress has been made in various independent reasoning tasks, including natural language understanding, code generation, and mathematical reasoning, demonstrating strong single-task adaptability and performance ceilings. However, compared with the speed of model capability evolution, the development of model evaluation systems has lagged significantly, still mainly using a single-point evaluation paradigm centered on fixed datasets and a single metric for model evaluation.

[0003] In related technologies, the evaluation methods for general artificial intelligence models usually treat different inference tasks as independent evaluation objects. The model performance is measured in isolation by accuracy, score or ranking on predefined tasks or datasets. This type of method is effective in characterizing the model's performance in specific task scenarios, but its limitations are becoming increasingly apparent. Moreover, the evaluation process of general artificial intelligence models involves a large number of inference tasks and high computational resource consumption, resulting in low overall evaluation efficiency. Summary of the Invention

[0004] This application provides a method and apparatus for evaluating and optimizing artificial intelligence models based on task networks, in order to at least solve the problems of limitations of general artificial intelligence models in multi-task evaluation scenarios, high consumption of evaluation computing resources, and low overall evaluation efficiency in related technologies.

[0005] This application provides a method for evaluating and optimizing artificial intelligence models based on task networks, including: Obtain the task similarity between multiple inference tasks of the target artificial intelligence model; Based on the task similarity among the multiple inference tasks, a task network for the target artificial intelligence model is constructed, wherein each task node in the task network corresponds to a separate inference task. The comprehensive performance index of the target artificial intelligence model is evaluated based on the basic performance index of the inference task corresponding to each task node in the task network. Based on the comprehensive performance index of the target artificial intelligence model and the network structure parameters of the task network, a task evaluation execution strategy for the artificial intelligence model to be evaluated is determined. The task evaluation execution strategy is used to control the evaluation execution process of the artificial intelligence model to be evaluated for multiple inference tasks.

[0006] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for implementing the steps of any of the above-described task network-based artificial intelligence model evaluation and optimization methods when executing the computer program.

[0007] According to the task network-based artificial intelligence model evaluation and optimization method of this application, a task network of the target artificial intelligence model is constructed by using the task similarity between different inference tasks. Based on the constructed task network, the comprehensive performance index of the target artificial intelligence model on multiple inference tasks is evaluated according to the basic performance index of the inference task corresponding to each task node in the task network. This improves the limitations of the model evaluation method for single tasks, and can systematically characterize the comprehensive performance of the model in multi-task evaluation scenarios from an overall perspective, thereby improving the accuracy of model evaluation. Furthermore, by using the comprehensive performance index obtained from the evaluation of the target artificial intelligence model on multiple inference tasks and the network structure parameters of the task network, a task evaluation execution strategy for the artificial intelligence model to be evaluated on multiple inference tasks is generated to optimize and control the evaluation execution process of the artificial intelligence model to be evaluated on multiple inference tasks. Thus, while ensuring the evaluation effect of the artificial intelligence model to be evaluated, it can effectively reduce the evaluation inference computation of the artificial intelligence model to be evaluated in multi-task evaluation scenarios, reduce the consumption of computing resources, and improve the overall evaluation efficiency of the model. Attached Figure Description

[0008] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0009] Figure 1 This is a schematic diagram illustrating one application scenario in an embodiment of this application; Figure 2 A flowchart illustrating an artificial intelligence model evaluation and optimization method based on task networks, provided for an embodiment of this application; Figure 3 This is a flowchart illustrating a method for constructing a task network according to an embodiment of this application; Figure 4 This is a flowchart illustrating a comprehensive performance evaluation method for a target artificial intelligence model in an embodiment of this application; Figure 5 This is a flowchart illustrating another method for comprehensively evaluating the performance of a target artificial intelligence model in an embodiment of this application. Figure 6 This is a flowchart illustrating another method for comprehensively evaluating the performance of a target artificial intelligence model in an embodiment of this application. Figure 7This is a flowchart illustrating a method for generating a task evaluation and execution strategy in an embodiment of this application. Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0010] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.

[0011] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.

[0012] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0013] Figure 1 This is a schematic diagram of one application scenario in an embodiment of this application. As an optional application scenario in an embodiment of the present invention, such as... Figure 1 As shown, an artificial intelligence model evaluation and optimization system may include at least one terminal device and at least one server. Figure 1 The system is illustrated in the example, which includes a computer 101, a mobile terminal 102, and a server 103, and the terminal devices such as the computer 101 and the mobile terminal 102 are connected to the server 103 through a network 110.

[0014] Specifically, the terminal device can be a smartphone, tablet, laptop, PDA, desktop computer, game console, smart TV, smart wearable device, in-vehicle terminal, VR (Virtual Reality) device, AR (Augmented Reality) device, etc. Server 103 can be a standalone physical server, a server cluster, a distributed system, or a cloud server providing cloud services. Network 110 can be a wired or wireless network, examples of which include, but are not limited to, the Internet, corporate intranet, local area network, wide area network, mobile communication network, and combinations thereof.

[0015] In related technologies, the evaluation methods for general artificial intelligence models usually treat different reasoning tasks as independent evaluation objects. The model performance is measured in isolation by accuracy, score or ranking on predefined tasks or datasets. This type of method is effective in characterizing the model's performance in specific task scenarios, but its limitations are becoming increasingly apparent.

[0016] On the one hand, evaluation methods driven by single tasks and static datasets are insufficient to reflect the comprehensive intelligent performance of models in real-world application environments. Real-world intelligent needs often exhibit characteristics of highly coupled tasks, continuous information flow, and dynamic evolution of decisions. Models need to perform knowledge transfer, reasoning connections, and result coordination across multiple related tasks, a process that the evaluation systems of relevant technologies cannot characterize. On the other hand, existing research shows that a model's high score on a particular task category does not necessarily imply stable cross-task generalization ability and reasoning consistency. For example, a model may achieve excellent results in reading comprehension or fact retrieval tasks, but its performance may significantly degrade in problems involving multi-step logical reasoning, multi-domain knowledge fusion, or task-chain decision-making.

[0017] Furthermore, from the perspective of system execution, the evaluation methods of related technologies in multi-task evaluation scenarios typically require the model to perform inference operations on a large number of evaluation tasks one by one. Since there are often strong semantic similarities or functional overlaps between different evaluation tasks, the model needs to repeatedly execute a large number of similar inference tasks, resulting in a huge number of inference tasks, excessive evaluation computation, high consumption of computing resources, and low overall evaluation efficiency.

[0018] Therefore, this application proposes an artificial intelligence model evaluation and optimization method and electronic device based on task networks, which aims to effectively improve at least one technical problem existing in related technologies.

[0019] This application provides a method for evaluating and optimizing an artificial intelligence model based on a task network, which can be applied to the aforementioned server. Figure 2 A flowchart illustrating a task-network-based artificial intelligence model evaluation and optimization method provided in this application embodiment is shown below. Figure 2 As shown, the artificial intelligence model evaluation and optimization method includes, but is not limited to, the following steps S201 to S204.

[0020] Step S201: Obtain the task similarity between multiple inference tasks of the target artificial intelligence model.

[0021] The target artificial intelligence model is selected as an advanced or ideal Artificial General Intelligence (AGI) model, which supports task reasoning in different technical fields such as natural language processing, computer vision, speech processing, and multimodal fusion.

[0022] For inference tasks used to evaluate artificial intelligence models, representative tasks can be selected from different technical fields such as natural language processing, computer vision, speech processing, and multimodal fusion, as well as different task types such as classification, regression, generation, reasoning, and decision-making, to serve as inference tasks to be evaluated. The selected tasks should cover different difficulty levels (such as from simple fact-finding to complex multi-step reasoning) and different data scales (such as large-scale datasets and low-resource tasks) to ensure the comprehensiveness and representativeness of model evaluation.

[0023] For example, benchmark evaluation tasks and datasets using industry-recognized, publicly released, and widely used general-purpose artificial intelligence models can be used as inference tasks and datasets for evaluating artificial intelligence models, including but not limited to: For the field of natural language processing, the following benchmarks are used: General Language Understanding Evaluation (GLUE), SuperGLUE, Massive Multitask Language Understanding (MMLU), Beyond the Imitation Game Benchmark (BIG-Bench), Stanford Question Answering Dataset (SQuAD), Conversational Question Answering (CoQA), Commonsense Question Answering (CommonsenseQA), Extreme Summarization (XSum), Workshop on Statistical Machine Translation (WMT), and Code eXample Global Universal Evaluation Benchmark (CodeXGLUE). For the field of computer vision, we use image classification tasks and datasets (ImageNet), common objects in context (COCO), object detection benchmark tasks and datasets (PascalVOC), celebrity facial attribute tasks and datasets (CelebA), and scene parsing tasks and datasets (ADE20K). For the field of speech processing, we use open-source speech recognition tasks and datasets such as LibriSpeech, Common Voice, and Wall Street Journal (WSJ). For the field of multimodal fusion, we use visual question answering tasks and datasets (VQA), graph-based question answering tasks and datasets (GQA), and image-text matching tasks and datasets (Flickr30k).

[0024] After identifying multiple inference tasks for evaluating the artificial intelligence model, a task set T = {T1, T2, ..., T} is constructed. N} where N is the number of determined inference tasks. Simultaneously, standardized task metadata is constructed and stored for each selected inference task. This task metadata describes the key attributes of the inference task, and metadata fields include, but are not limited to: task_name: a unique identifier for the task; task_description: a textual description of the task's objectives and content; domain: the technical field to which the task belongs; modality: the data modality type involved in the task; task_type: the specific type of the task; source_paper_url: a link to the paper or official documentation from which the task originates; and evaluation_metrics: core performance metrics used to evaluate the task.

[0025] In step S201, the task description text of each inference task is extracted from the task metadata of each inference task, and the task similarity between any two inference tasks is calculated using the task description text of each inference task.

[0026] Step S202: Based on the task similarity between multiple inference tasks, construct a task network for the target artificial intelligence model, where each task node in the task network corresponds to a separate inference task.

[0027] In this embodiment of the application, the task network is a graph structure used to generate a graph that reflects the relationship between inference tasks. In step S202, each inference task is taken as a task node, and the connection relationship between each task node is constructed by using the task similarity between multiple inference tasks, thereby constructing the task network of the target artificial intelligence model.

[0028] Step S203: Evaluate the comprehensive performance index of the target artificial intelligence model based on the basic performance index of the inference task corresponding to each task node in the task network.

[0029] Basic performance metrics are the basic performance indicators of a model on a corresponding inference task, obtained by performing inference operations on each task node in a task network using the target AI model.

[0030] Specifically, the task set T is traversed, and at each task node, the target artificial intelligence model is used to perform zero-sample or few-sample inference tests on the corresponding inference task to obtain the original performance results of the model on the inference task and obtain the basic performance indicators of the inference task.

[0031] Based on the attributes and evaluation criteria of different inference tasks, select corresponding basic performance indicators for each task node, including but not limited to accuracy, precision, recall, F1 score, and the accuracy of the top K (Top K) model outputs.

[0032] Since the dimensions and value ranges of the basic performance indicators of different inference tasks may be different, after obtaining the basic performance indicators of the model on each inference task, the basic performance indicators of each inference task can be standardized and mapped to comparable intervals so that they can be compared between different task types, ensuring the comparability of performance indicators across tasks.

[0033] After standardizing the basic performance indicators of each inference task, in step S203, the comprehensive performance indicators of the target artificial intelligence model are evaluated based on the basic performance indicators of the inference tasks corresponding to each task node in the task network, so as to evaluate the overall performance of the model in a multi-task evaluation scenario.

[0034] Step S204: Based on the comprehensive performance index of the target artificial intelligence model and the network structure parameters of the task network, determine the task evaluation execution strategy of the artificial intelligence model to be evaluated. The task evaluation execution strategy is used to control the evaluation execution process of the artificial intelligence model to be evaluated for multiple inference tasks.

[0035] By evaluating the comprehensive performance indicators and network structure parameters of the task network obtained from evaluating the target AI model on multiple inference tasks, a task evaluation execution strategy for the AI ​​model to be evaluated on multiple inference tasks is generated. When the AI ​​model to be evaluated is actually evaluated on multiple inference tasks, the generated task evaluation execution strategy can be used to control the evaluation execution process of the AI ​​model to be evaluated on multiple inference tasks. For example, the evaluation execution order, whether to participate in the evaluation, and the frequency of evaluation execution can be controlled.

[0036] According to the task network-based artificial intelligence model evaluation and optimization method provided in this application, a task network of the target artificial intelligence model is constructed by the task similarity between different inference tasks. Based on the constructed task network, the comprehensive performance index of the target artificial intelligence model on multiple inference tasks is evaluated according to the basic performance index of the inference task corresponding to each task node in the task network. This improves the limitations of the model evaluation method for a single task, and can systematically characterize the comprehensive performance of the model in a multi-task evaluation scenario from an overall perspective, thereby improving the accuracy of model evaluation. Furthermore, by using the comprehensive performance index obtained from the evaluation of the target artificial intelligence model on multiple inference tasks and the network structure parameters of the task network, a task evaluation execution strategy for the artificial intelligence model to be evaluated on multiple inference tasks is generated to optimize and control the evaluation execution process of the artificial intelligence model to be evaluated on multiple inference tasks. Thus, while ensuring the evaluation effect of the artificial intelligence model to be evaluated, it can effectively reduce the evaluation inference computation of the artificial intelligence model to be evaluated in a multi-task evaluation scenario, reduce the consumption of computing resources, and improve the overall evaluation efficiency of the model.

[0037] Figure 3 This is a flowchart illustrating a method for constructing a task network according to an embodiment of this application. In some embodiments, such as... Figure 3 As shown, in step S202 above, constructing the task network of the target artificial intelligence model based on the task similarity between multiple inference tasks may further include: Step S301: Calculate the task similarity between any two reasoning tasks based on the task description text of each reasoning task.

[0038] For any reasoning task T i ∈T, from the reasoning task T i Extracting task name, task description, domain, data modality type, task type, and source information from the task metadata forms the reasoning task T. i A complete task description text, and the reasoning task T. i The task description text is cleaned and preprocessed, including removing HTML tags, special characters, and redundant spaces, to obtain the preprocessed task description text S. i; Traverse each reasoning task in the task set T, and construct a task description text set S = {S1, S2, ..., S...} N}

[0039] In step S301, text representation models such as Sentence-BERT and Embeddings-like language models are used to represent the task description text S corresponding to each inference task in the task description text set. i ∈S is mapped to a fixed-dimensional vector representation e i This forms the embedding vector set E={e1, e2, ..., e} of the task description text. N}

[0040] For any two reasoning tasks (T) in the task set T, i T j ), calculate its reasoning task T i Corresponding embedding vector e i and reasoning task T j The corresponding embedding vector e j The cosine similarity between two inference tasks (T) can be used as a criterion for any two inference tasks. i T j The task similarity between any two reasoning tasks (T) is calculated as follows: i T j The task similarity between tasks can be represented as:

[0041] Among them, C(T) i T j ) represents any two reasoning tasks (T) i T j Task similarity between ) Representing the reasoning task T i Corresponding embedding vector e i The model, Representing the reasoning task T i Corresponding embedding vector e i The model.

[0042] Step S302: Using each reasoning task as a task node, construct an edge structure between any two task nodes based on the task similarity between any two reasoning tasks.

[0043] Each inference task in the task set is defined as a task node in the task network. Each task node contains a unique identifier and task metadata of the corresponding inference task. An edge structure between any two task nodes is constructed based on the task similarity between any two inference tasks.

[0044] In some embodiments, constructing an edge structure between any two task nodes based on the task similarity between any two inference tasks includes: constructing an edge structure between the task nodes of any two inference tasks in response to the task similarity between any two inference tasks being greater than or equal to a first preset similarity threshold; and keeping the task nodes of any two inference tasks disconnected in response to the task similarity between any two inference tasks being less than the first preset similarity threshold.

[0045] A first preset similarity threshold θ is set, for any two reasoning tasks (T) i T j ), when any two reasoning tasks (T) i T j Task similarity C(T) between i T j If )≥θ, then in reasoning task T i Corresponding task nodes and inference task T j An undirected edge structure is established between the corresponding task nodes, and the weight of this edge structure is set to the weight of any two inference tasks (T). i T j Task similarity C(T) between i T j ), used to represent any two reasoning tasks (T) i T j The correlation strength between C(T) and T; when C(T) i T j When θ < θ, then the reasoning task T i Corresponding task nodes and inference task T j No edge structure is established between the corresponding task nodes; they remain disconnected.

[0046] Step S303: Construct the task network of the target artificial intelligence model based on the task nodes and edge structure corresponding to each inference task.

[0047] In some embodiments, when the first preset similarity threshold θ=0, the task network is a fully connected network. The sparsity of the task network can be controlled by adjusting the first preset similarity threshold θ to simulate task association structures of different granularities.

[0048] By constructing a task network for the target AI model, the relationships between multiple inference tasks of the target AI model are explicitly modeled, so as to conduct an overall performance evaluation of the target AI model in a multi-task evaluation scenario and improve the limitations of the single-task evaluation method.

[0049] Figure 4This is a flowchart illustrating a comprehensive performance evaluation method for a target artificial intelligence model according to an embodiment of this application. In some embodiments, such as... Figure 4 As shown, the comprehensive performance indicators include generality indicators, which are used to measure the range of tasks the model can stably complete in the task network and its overall performance level. In step S203 above, the comprehensive performance indicators of the target artificial intelligence model are evaluated based on the basic performance indicators of the inference tasks corresponding to each task node in the task network. This evaluation may further include: Step S401: Based on the basic performance indicators of the inference tasks corresponding to each task node in the task network, evaluate the task coverage of the target artificial intelligence model. The task coverage is the percentage of task nodes whose basic performance indicators reach the performance threshold.

[0050] A performance threshold for basic performance indicators is pre-set. In step S401, the number N1 of task nodes in the task network that achieve the basic performance indicator of the target artificial intelligence model is counted. The ratio of this number of task nodes N1 to the total number of task nodes N in the task network is calculated to obtain the task coverage rate of the target artificial intelligence model. The task coverage rate can be expressed as: TCR=N1 / N Where N is the total number of task nodes in the task network, N1 is the number of task nodes in the task network whose basic performance index reaches the performance threshold, and TCR is the task coverage rate, which reflects the range of tasks that the model can effectively handle.

[0051] Step S402: Determine the weights of each inference task based on the degree centrality of each task node in the task network and the task similarity corresponding to each edge structure.

[0052] In some embodiments, the weight of each inference task is determined based on the degree centrality of each task node in the task network and the task similarity corresponding to each edge structure, including: for each task node, calculating the average task similarity corresponding to the task node based on the task similarity corresponding to each edge structure of the task node; and summing the degree centrality of the task node and the average task similarity to obtain the weight of the inference task corresponding to the task node.

[0053] Among them, the complex network method is used to calculate T for each task node in the task network. i Degree centrality D(T) i Degree centrality is the most direct measure of node centrality in complex network analysis. It measures the number of direct connections a node has with other nodes in the network. The more direct connections a node has, the greater its influence or importance in the network, and the higher its degree centrality.

[0054] For each task node, the edge structure of that task node is the edge structure that directly connects that task node with other task nodes. The task similarity corresponding to the edge structure refers to the task similarity between the reasoning tasks of the two task nodes connected by the edge structure.

[0055] For each task node, the average task similarity is obtained by calculating the ratio of the sum of the task similarities of each edge structure of the task node to the degree value of the task node (i.e., the number of edge structures directly connecting the task node) by statistically analyzing the task similarity of each edge structure of the task node.

[0056] The centrality of the task node is obtained by summing its degree centrality with the average task similarity. This centrality represents the importance or influence of the task node in the task network, which is also the weight of the inference task corresponding to the task node.

[0057] Task node T i The centrality (weight) can be expressed as:

[0058] in, Represents task node T i Centrality (weight) Represents task node T i Degree centrality, Represents task node T i The degree value, i.e., the number of edges directly connecting the task node. Represents task node T j With task node T i There are directly connected edge structures between them. Represents task node T j With task node T i The task similarity corresponding to the edge structure between them.

[0059] Step S403: According to the weight of each inference task, the basic performance indicators of each inference task are weighted and averaged to obtain the weighted average performance indicator of the target artificial intelligence model.

[0060] Based on the importance (weight) of the inference task corresponding to each task node in the network, the basic performance indicators of the model on each inference task are weighted and averaged to obtain the weighted average performance indicator of the target artificial intelligence model.

[0061] The weighted average performance index can be expressed as:

[0062] Where, λ iIt is the task node (reasoning task) T i The weights, P(T) i ) is the model at task node (inference task) T i The basic performance metrics are defined by N, where N is the total number of task nodes (inference tasks).

[0063] Step S404: Based on task coverage and weighted average performance metrics, obtain the generality metrics of the target artificial intelligence model.

[0064] In some embodiments, general metrics include task coverage and weighted average performance metrics.

[0065] In some embodiments, the task coverage and weighted average performance metrics are summed to obtain a generality metric for the target AI model. This generality metric can be expressed as:

[0066] Where α1 is a weight parameter used to balance the breadth of task coverage and the overall performance level; TCR represents task coverage rate, AP represents weighted average performance index, and I1 represents generality index; the larger the I1 value, the higher the breadth of task coverage and average performance of the model in the task network, indicating that the model has stronger generality.

[0067] In some embodiments, the comprehensive performance index includes a generalization ability index. In step S203 above, evaluating the comprehensive performance index of the target artificial intelligence model based on the basic performance index of the inference task corresponding to each task node in the task network may further include: traversing at least one group of task nodes with edge structure in the task network, and calculating the generalization ability index of the target artificial intelligence model based on the task similarity between task nodes in each group of task nodes and the basic performance index corresponding to each task node.

[0068] The generalization ability metric is used to evaluate the model's ability to transfer knowledge between semantically similar tasks. It divides two task nodes with an edge structure in the task network into a group of task nodes, with each group containing two task nodes. It iterates through at least one group (T) of task nodes with an edge structure in the task network. i T j ), combined with the task similarity C(T) between task nodes i T j And the model's basic performance metrics on the corresponding tasks, defining the generalization ability metric as:

[0069] Among them, C(T) i T j ) represents two task nodes (T) with an edge structure. i Tj Task similarity between P(T) i ) is the model at task node T i The basic performance indicators on the P(T) j ) indicates that the model is at task node T j The basic performance indicators.

[0070] Understandably, when a model maintains high performance across multiple semantically similar reasoning tasks, the generalization ability metric I2 takes a large value, indicating that the model has strong cross-task knowledge transfer ability. Conversely, when the model has significant performance differences across multiple semantically similar reasoning tasks, the generalization ability metric I2 takes a small value, indicating that the model has weak cross-task knowledge transfer ability.

[0071] In some embodiments, the basic performance difference between two task nodes in any set of edge structures It can also reflect the model's generalization ability and basic performance differences between the inference tasks of the two task nodes. The smaller the value, the smaller the difference in basic performance between the two task nodes in the inference task, and the stronger the model's cross-task knowledge transfer ability between the inference tasks of these two task nodes. Conversely, the larger the value, the weaker the model's cross-task knowledge transfer ability between the inference tasks of these two task nodes. Therefore, in some embodiments, the difference in basic performance indicators between two task nodes with edge structures in each group is considered. As an indicator of the model's generalization ability.

[0072] Figure 5 This is a flowchart illustrating another method for comprehensively evaluating the performance of a target artificial intelligence model in an embodiment of this application. In some embodiments, such as... Figure 5 As shown, the comprehensive performance indicators include domain-specific indicators, which are used to measure the distribution differences of model capabilities across different task clusters or domains. In step S203 above, the comprehensive performance indicators of the target artificial intelligence model are evaluated based on the basic performance indicators of the inference tasks corresponding to each task node in the task network. This may further include: Step S501: Classify each reasoning task into clusters according to its domain to obtain multiple task clusters.

[0073] Specifically, the domain to which each inference task belongs is determined by the task metadata of the inference task in each task node, and inference tasks in the same domain are divided into the same task cluster, thus obtaining multiple task clusters. A task cluster is a set of tasks with strong internal semantic association in the task network.

[0074] In some embodiments, a community detection algorithm can be used to automatically divide task nodes in the network into multiple task clusters based on the task network structure; or, task clusters can be divided based on a connected subgraph constructed in the task network according to a preset similarity threshold, where the task similarity between tasks within each task cluster is significantly different from the task similarity between tasks within other task clusters.

[0075] Step S502: Obtain the average basic performance index of each task cluster and the mean of the average basic performance index of each task cluster.

[0076] For each task cluster, the average basic performance index of the task cluster is calculated based on the basic performance index of each inference task in the task cluster, and the mean of the average basic performance index of each task cluster is calculated based on the average basic performance index of each task cluster.

[0077] Step S503: Calculate the domain-specific index of the target artificial intelligence model based on the average basic performance index and mean of each task cluster.

[0078] In some embodiments, tasks are divided into several task clusters {CL1, CL2, ..., CL3} based on the task network structure or domain labels. k The average basic performance index M(CL) of the model within each task cluster is calculated. k ), and define domain-specific indicators. for:

[0079] Among them, the task cluster CL k Average basic performance index within Average base performance metrics across all task clusters N CL I3 represents the number of task clusters and indicates the domain specificity of the model. The smaller the I3 value, the more balanced the model's performance is across different domains, indicating strong global generality. When the I3 value is large, it indicates that the model's capabilities are concentrated in a specific domain.

[0080] Figure 6 This is a flowchart illustrating another method for comprehensively evaluating the performance of a target artificial intelligence model in an embodiment of this application. In some embodiments, such as... Figure 6 As shown, the comprehensive performance indicators include robustness indicators, which are used to evaluate the stability of the model under changes in task difficulty and specific task scenarios. In step S203 above, the comprehensive performance indicators of the target artificial intelligence model are evaluated based on the basic performance indicators of the inference tasks corresponding to each task node in the task network. This evaluation may further include: Step S601: Divide the reasoning tasks whose task similarity reaches the second preset similarity threshold and whose corresponding dataset characteristics are different into a task variant group.

[0081] In the task network, select inference tasks that are semantically similar but differ in difficulty or dataset characteristics, and construct several task variant groups (TVs). k .

[0082] Step S602: Obtain the average base performance index of each task variant group and the mean of the average base performance index of each task variant group.

[0083] For each task variant group, the average basic performance index of the task variant group is calculated based on the basic performance index of each inference task in the task variant group, and the mean of the average basic performance index of each task variant group is calculated based on the average basic performance index of each task variant group.

[0084] Step S603: Calculate the task variant stability index of the target artificial intelligence model based on the average basic performance index and mean of each task variant group.

[0085] In the task network, select inference tasks that are semantically similar but differ in difficulty or data characteristics, and construct several task variant groups (TV). k The performance fluctuation of the model within each task variant group is calculated, and the task variant stability metric TVS is defined as follows:

[0086] Where, N TV It is the number of task variant groups, P(TV) k ) is a task variant group TV k Average basic performance indicators within, ; It is the average base performance metric for all task variants. .

[0087] Step S604: Determine the abnormal task performance index of the target artificial intelligence model based on the difference between the average basic performance index of each inference task and the basic performance index of a specific inference task; the specific inference task is the inference task whose task similarity with multiple inference tasks is lower than the third preset similarity threshold.

[0088] This study identifies reasoning tasks in the network that differ significantly from most other reasoning tasks. These are designated as specific reasoning tasks. The model's performance on these specific reasoning tasks is evaluated based on the difference between the average baseline performance index of each reasoning task and the average baseline performance index of these specific reasoning tasks. Anomaly performance metrics are then defined. for:

[0089] Where AP is the model's average basic performance metric across all inference tasks on the task network, and IS is the set of specific inference tasks. Represents the i-th specific reasoning task T in the IS set. i The basic performance indicators.

[0090] Step S605: The stability index of the task variant and the performance index of the abnormal task are weighted and summed to obtain the robustness index of the target artificial intelligence model.

[0091] Among them, the robustness index represents for

[0092] in, The weighting parameters represent the Task Variance Stability (TVS) metric. Indicators of abnormal task performance The weight parameters are I4, which represents the robustness index of the model. The smaller the I4 value, the higher the robustness of the model under complex and specific reasoning task conditions.

[0093] Figure 7 This is a flowchart illustrating a method for generating a task evaluation and execution strategy according to an embodiment of this application. In some embodiments, such as... Figure 7 As shown, the network structure parameters of the task network include the centrality parameters of each task node, and the comprehensive performance indicators include generality indicators, generalization ability indicators, domain-specific indicators, and robustness indicators; the task evaluation execution strategy includes the evaluation execution order and evaluation execution frequency of each inference task; in the above step S204, based on the comprehensive performance indicators of the target artificial intelligence model and the network structure parameters of the task network, the task evaluation execution strategy of the artificial intelligence model to be evaluated is determined, which may further include: Step S701: Sort multiple inference tasks in descending order of the centrality parameters of each task node to generate an evaluation execution order for multiple inference tasks.

[0094] According to the centrality parameter λ of each task node i The centrality parameter is ordered in descending order, from high to low, to sort multiple reasoning tasks and obtain the evaluation execution order of multiple reasoning tasks. The higher the centrality parameter, the higher the priority of the reasoning task for evaluation and execution. Through this mechanism, under the condition of limited evaluation resources, tasks with greater impact on the overall evaluation can be evaluated first, thereby improving evaluation efficiency.

[0095] Initially, the evaluation execution frequency for each inference task is the default base execution frequency, which refers to the evaluation frequency (i.e., inference frequency) of the model on that inference task.

[0096] Step S702: Based on the generalization ability index of the target artificial intelligence model, adjust the evaluation execution frequency of inference tasks whose task similarity is greater than the first preset similarity threshold.

[0097] In some embodiments, the generalization ability metric includes the basic performance difference of inference tasks with task similarity greater than a first preset similarity threshold; in the task network, for inference tasks with task similarity exceeding the first preset similarity threshold θ, if the basic performance difference of inference tasks with task similarity exceeding the first preset similarity threshold θ satisfies:

[0098] The model is then judged to have stable generalization ability in reasoning tasks where the task similarity exceeds a first preset similarity threshold θ. This is a given performance fluctuation threshold.

[0099] At this point, among the reasoning tasks whose similarity exceeds the first preset similarity threshold θ, the reasoning task with the highest centrality parameter is selected as the representative task for complete reasoning. That is, the evaluation execution frequency of this reasoning task is maintained at the default base execution frequency, while the evaluation execution frequency of the remaining reasoning tasks is adjusted as follows:

[0100] Among them, F base Based on the execution frequency, 0 < η < 1, The preset execution frequency attenuation coefficient, This represents the execution frequency of the remaining reasoning tasks.

[0101] If the task similarity exceeds the first preset similarity threshold θ, the basic performance difference of the reasoning task does not meet the requirements. If the similarity of the task exceeds the first preset similarity threshold θ, then the evaluation execution frequency of the reasoning task will be maintained or increased.

[0102] The above mechanism can reduce the amount of computational overhead in repetitive reasoning for semantically similar tasks.

[0103] Step S703: In response to the generality index of the target artificial intelligence model satisfying the task pruning condition, inference tasks with centrality parameters below the centrality threshold are removed from the evaluation execution order.

[0104] The conditions for task pruning include: task coverage is greater than a coverage threshold, and the difference in the weighted average performance index between any two task groups is less than a difference threshold. and .

[0105] in, This refers to task coverage, a common metric. The multiple reasoning tasks to be evaluated are divided into multiple task groups, and evaluated sequentially according to the group order. Let be the weighted average performance index of each task in the i-th task group. Let be the weighted average performance index of each task in the (i-1)th task group. This is the difference threshold.

[0106] When the generality index of the target artificial intelligence model meets the task pruning conditions, it is determined that the overall performance of the model tends to be stable during the evaluation of grouped tasks. The evaluation pruning mechanism is triggered, and the inference tasks with centrality parameters below the centrality threshold are removed from the evaluation execution order. The removed inference tasks stop executing model inference, thereby terminating redundant evaluation tasks in advance. This mechanism reduces the overall number of inferences while ensuring the validity of the evaluation results.

[0107] Step S704: In response to the domain-specific index of the target artificial intelligence model being greater than the domain-specific threshold, the evaluation execution frequency of each inference task in the task cluster whose average basic performance index is lower than the basic performance threshold is increased; the inference tasks in the task cluster belong to the same domain, and the average basic performance index is the mean of the basic performance index of each inference task in the task cluster.

[0108] When the domain specificity index I3 is greater than the given domain specificity threshold T s This indicates that the model performs significantly differently across different domains. In this case, the frequency of evaluation execution can be increased for task clusters in domains where the model's performance is weaker. :

[0109] Among them, R k >1 represents the neighborhood resampling scaling factor; d omaink For the task cluster with weaker performance (average base performance metrics below the base performance threshold), Based on the frequency of execution, this mechanism allows for compensatory evaluation of areas where the model's capabilities are weak, thereby improving the comprehensiveness of the assessment.

[0110] Step S705: In response to the robustness index of the target artificial intelligence model being greater than the robustness threshold, the evaluation execution frequency of the inference task in the specific inference task and / or task variant group is increased. The specific inference task is the inference task whose task similarity with multiple inference tasks is lower than the third preset similarity threshold. The task variant group includes inference tasks whose task similarity reaches the second preset similarity threshold and whose corresponding dataset characteristics are different.

[0111] When the robustness index satisfies I4>T r T rThe robustness threshold indicates that the model is unstable in task variants or specific inference tasks. In this case, a task variant group or specific inference task is introduced, and the evaluation execution frequency F of the specific inference task and / or the inference task in the task variant group is set. variant Increase, the specific adjustment method is as follows:

[0112] Among them, the robust compensation factor R var >1. This mechanism enhances the evaluation strength of the model in complex and specific reasoning task scenarios.

[0113] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method.

[0114] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Another embodiment of this application provides an electronic device, such as... Figure 8 As shown, it includes a memory 10 and a processor 20. The memory 10 stores a computer program, and the processor 20 is configured to run the computer program to perform the steps in any of the above embodiments of the task network-based artificial intelligence model evaluation and optimization method.

[0115] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above embodiments of the task network-based artificial intelligence model evaluation and optimization method.

[0116] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.

[0117] Embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above embodiments of the task network-based artificial intelligence model evaluation and optimization method.

[0118] Embodiments of this application also provide another computer program product, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above embodiments of the task network-based artificial intelligence model evaluation and optimization method.

[0119] Any of the components, modules, units, parts, methods, and operations described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or any combination thereof. Alternatively or additionally, any functionality described herein can be executed at least in part by one or more hardware logic components, such as, but not limited to, a central processing unit (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), an application-specific standard product (ASSP), a system-on-a-chip (SoC), a complex programmable logic device (CPLD), a microprocessor (MCU), etc. The terms "system," "computing device," or "apparatus" as used herein encompass various means, devices, and machines for processing data, including, for example, one or more programmable processors, computers, SoCs, or combinations thereof. The apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or one or more combinations thereof. The aforementioned computer program (also known as a program, software, software application, app, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for a computing environment.

[0120] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0121] The above provides a detailed description of the task network-based artificial intelligence model evaluation and optimization method provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only intended to help understand the method and core ideas of this application. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this application.

Claims

1. A task network-based artificial intelligence model evaluation optimization method, characterized in that, include: Obtain the task similarity between multiple inference tasks of the target artificial intelligence model; Based on the task similarity among the multiple inference tasks, a task network for the target artificial intelligence model is constructed, wherein each task node in the task network corresponds to a separate inference task. The comprehensive performance index of the target artificial intelligence model is evaluated based on the basic performance index of the inference task corresponding to each task node in the task network. Based on the comprehensive performance index of the target artificial intelligence model and the network structure parameters of the task network, a task evaluation execution strategy for the artificial intelligence model to be evaluated is determined. The task evaluation execution strategy is used to control the evaluation execution process of the artificial intelligence model to be evaluated for multiple inference tasks.

2. The artificial intelligence model evaluation and optimization method according to claim 1, characterized in that, The step of constructing the task network of the target artificial intelligence model based on the task similarity between the multiple inference tasks includes: Calculate the task similarity between any two reasoning tasks based on the task description text of each reasoning task. Each reasoning task is used as a task node, and an edge structure between any two task nodes is constructed based on the task similarity between any two reasoning tasks. Based on the task nodes and edge structure corresponding to each inference task, the task network of the target artificial intelligence model is constructed.

3. The artificial intelligence model evaluation and optimization method according to claim 2, characterized in that, The step of constructing an edge structure between any two task nodes based on the task similarity between any two reasoning tasks includes: In response to the task similarity between any two inference tasks being greater than a first preset similarity threshold, an edge structure is constructed between the task nodes of any two inference tasks; In response to the fact that the task similarity between any two inference tasks is less than a first preset similarity threshold, the task nodes of the two inference tasks are kept disconnected.

4. The artificial intelligence model evaluation and optimization method according to claim 2, characterized in that, The comprehensive performance indicators include general indicators. The evaluation of the comprehensive performance indicators of the target artificial intelligence model based on the basic performance indicators of the inference tasks corresponding to each task node in the task network includes: Based on the basic performance indicators of the inference tasks corresponding to each task node in the task network, the task coverage of the target artificial intelligence model is evaluated, wherein the task coverage is the percentage of task nodes whose basic performance indicators reach the performance threshold. The weights of each inference task are determined based on the degree centrality of each task node in the task network and the task similarity corresponding to each edge structure. According to the weight of each inference task, the basic performance indicators of each inference task are weighted and averaged to obtain the weighted average performance indicator of the target artificial intelligence model. Based on the task coverage and the weighted average performance index, the generality index of the target artificial intelligence model is obtained.

5. The artificial intelligence model evaluation and optimization method according to claim 4, characterized in that, The step of determining the weights of each inference task based on the degree centrality of each task node in the task network and the task similarity corresponding to each edge structure includes: For each task node, calculate the average task similarity corresponding to that task node based on the task similarity corresponding to each edge structure of that task node. The degree centrality of the task node is summed with the average task similarity to obtain the weight of the reasoning task corresponding to the task node.

6. The artificial intelligence model evaluation and optimization method according to claim 2, characterized in that, The comprehensive performance index includes a generalization ability index. The evaluation of the comprehensive performance index of the target artificial intelligence model based on the basic performance index of the inference task corresponding to each task node in the task network includes: Traverse at least one group of task nodes with edge structure in the task network, and calculate the generalization ability index of the target artificial intelligence model based on the task similarity between task nodes in each group and the basic performance index corresponding to each task node.

7. The artificial intelligence model evaluation and optimization method according to claim 1, characterized in that, The comprehensive performance indicators include domain-specific indicators; the evaluation of the comprehensive performance indicators of the target artificial intelligence model based on the basic performance indicators of the inference tasks corresponding to each task node in the task network includes: Each reasoning task is categorized into clusters according to its domain, resulting in multiple task clusters; Obtain the average basic performance index of each task cluster, and the mean of the average basic performance index of each task cluster. Based on the average basic performance index and mean of each task cluster, the domain-specific index of the target artificial intelligence model is calculated.

8. The artificial intelligence model evaluation and optimization method according to claim 1, characterized in that, The comprehensive performance metrics include robustness metrics; the evaluation of the comprehensive performance metrics of the target artificial intelligence model based on the basic performance metrics of the inference tasks corresponding to each task node in the task network includes: Reasoning tasks whose task similarity reaches the second preset similarity threshold and whose corresponding datasets have different characteristics are divided into a task variant group; Obtain the average base performance metrics for each task variant group, and the mean of the average base performance metrics for each task variant group. The task variant stability index of the target artificial intelligence model is calculated based on the average basic performance index and mean of each task variant group. The abnormal task performance index of the target artificial intelligence model is determined based on the difference between the average basic performance index of each inference task and the average basic performance index of a specific inference task; the specific inference task is an inference task whose task similarity with multiple inference tasks is lower than a third preset similarity threshold. The robustness index of the target artificial intelligence model is obtained by weighted summing of the stability index of the task variant and the performance index of the abnormal task.

9. The artificial intelligence model evaluation and optimization method according to claim 1, characterized in that, The network structure parameters of the task network include the centrality parameters of each task node; the comprehensive performance indicators include generality indicators, generalization ability indicators, domain-specific indicators, and robustness indicators; the task evaluation execution strategy includes the evaluation execution order and evaluation execution frequency of each inference task. The step of determining the task evaluation execution strategy for the artificial intelligence model to be evaluated based on the comprehensive performance index of the target artificial intelligence model and the network structure parameters of the task network includes: Multiple inference tasks are sorted in descending order of their centrality parameters to generate an evaluation execution order for the multiple inference tasks. Based on the generalization ability index of the target artificial intelligence model, adjust the evaluation execution frequency of inference tasks whose task similarity is greater than the first preset similarity threshold; In response to the target AI model's generality index satisfying the task pruning condition, inference tasks with centrality parameters below the centrality threshold are removed from the evaluation execution order; In response to the domain-specificity index of the target artificial intelligence model being greater than the domain-specificity threshold, the evaluation execution frequency of each inference task in the task cluster whose average basic performance index is lower than the basic performance threshold is increased; the inference tasks in the task cluster belong to the same domain, and the average basic performance index is the mean of the basic performance index of each inference task in the task cluster. In response to the robustness index of the target artificial intelligence model being greater than the robustness threshold, the evaluation execution frequency of the inference tasks in the specific inference task and / or task variant group is increased. The specific inference task is an inference task whose task similarity with multiple inference tasks is lower than a third preset similarity threshold. The task variant group includes inference tasks whose task similarity reaches a second preset similarity threshold and whose corresponding dataset characteristics are different.

10. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the artificial intelligence model evaluation and optimization method as described in any one of claims 1 to 9 when executing the computer program.