Model evaluation method, apparatus, device, storage medium, and program product
By using multidimensional evaluation to quantitatively assess the maturity of large models through data science, the problems of resource waste and poor performance of large models in vehicle systems are solved, ensuring that model selection matches actual needs and improving the adaptability and security of model deployment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG GEELY HLDG GRP CO LTD
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-30
Smart Images

Figure CN122309306A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of vehicle technology, and in particular to a model evaluation method, apparatus, device, storage medium, and program product. Background Technology
[0002] With the rapid development of artificial intelligence technology, large models (such as large language models LLM and visual language models VLM) are gradually being introduced into the automotive field to realize functions such as voice interaction, scene recognition, and user intent understanding.
[0003] Currently, the existing technology mainly involves directly replacing the original small vehicle-mounted model with a large model, and then using the large model to realize the various functions of the small model.
[0004] However, the existing technology's approach to functionality, by blindly using large models, exhibits a "rash" application of large models, ultimately resulting in the large models being less effective and efficient in actual application than the optimized small models. Summary of the Invention
[0005] This application provides model evaluation methods, apparatus, devices, storage media, and program products to provide selection guidance for the early stages of model deployment, ensuring that the models deployed to the devices match actual needs and improving the model deployment effect.
[0006] In a first aspect, embodiments of this application provide a model evaluation method, including:
[0007] Obtain multidimensional evaluation quantitative data, which includes at least one of the following: the task adaptability of the first model, the model deployment feasibility of deploying the first model to the target device, and the model application value generated by deploying the first model to the target device.
[0008] Based on the multidimensional evaluation and quantitative data, the maturity level of the first model deployment applied to the target device is determined.
[0009] Secondly, embodiments of this application provide a model evaluation apparatus, comprising:
[0010] The acquisition module is used to acquire multidimensional evaluation quantitative data, which includes at least one of the following: the task adaptability of the first model, the model deployment feasibility of deploying the first model to the target device, and the model application value generated by deploying the first model to the target device.
[0011] The determination module is used to determine the maturity of the first model deployment applied to the target device based on the multidimensional evaluation quantitative data.
[0012] Thirdly, embodiments of this application provide an electronic device, including: a memory and a processor;
[0013] The memory stores computer-executed instructions;
[0014] The processor executes computer execution instructions stored in the memory, causing the processor to perform the method described above.
[0015] Fourthly, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the first aspect and / or various possible implementations of the first aspect.
[0016] Fifthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the first aspect and / or various possible implementations of the first aspect.
[0017] The model evaluation method, apparatus, device, storage medium, and program product provided in this application embodiment quantitatively evaluate the first model from three dimensions: task adaptability, model deployment feasibility, and model application value. Based on the evaluation results, the model can be reasonably selected in the early stage of model deployment, ensuring that the model deployed to the device matches the actual needs and avoiding the waste of computing power caused by the overzealous use of large models. Attached Figure Description
[0018] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0019] Figure 1 A schematic diagram of the model evaluation method provided for this application;
[0020] Figure 2 Flowchart of the task adaptability determination method provided in the embodiments of this application;
[0021] Figure 3 This is a schematic diagram of the model deployment feasibility determination process provided in the embodiments of this application;
[0022] Figure 4 This is a schematic diagram of the model application value determination process provided in the embodiments of this application;
[0023] Figure 5 A flowchart of the vehicle-mounted large model evaluation process provided in this application embodiment;
[0024] Figure 6 A schematic diagram of the structure of the model evaluation device provided in this application;
[0025] Figure 7 A schematic diagram of the electronic structure provided in this application.
[0026] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0027] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0028] Large models, such as Large Language Model (LLM) and Vision-Language Model (VLM), are gradually being introduced into the automotive field to realize functions such as voice interaction, scene recognition, and user intent understanding.
[0029] In real-world vehicle scenarios, model deployment may have unique technical constraints and application requirements. For example, (1) the hardware resources of vehicle systems are limited (including computing power, storage and network bandwidth), making it difficult to support the efficient deployment of large models; (2) the vehicle environment has high real-time requirements. For example, in autonomous driving or emergency response scenarios, the model inference speed directly affects system safety; (3) vehicle functions need to take into account user interaction experience. For example, voice assistants need to have natural language understanding and multi-turn dialogue capabilities, while traditional small models, such as convolutional neural networks (CNN), are not good at performing in complex tasks.
[0030] Currently, there is a widespread phenomenon in the automotive field of blindly pursuing the deployment of large models to automotive systems. For example, in traditional perception tasks such as vehicle target detection, the indiscriminate use of large models leads to a waste of computing power and results in inferior performance compared to optimized smaller models. This deviation in technical path not only increases development costs but may also introduce security risks (such as hallucination problems) due to the excessive generalization ability of the model.
[0031] Based on this, embodiments of this application provide a model evaluation method, apparatus, and electronic device. By acquiring multi-dimensional evaluation and quantitative data such as model-task adaptability, task processing feasibility, and model application value, it achieves a scientific evaluation of model deployment and application, helps the industry clarify the advantageous boundaries of large models, rather than blindly using large models to replace mature small model solutions in all scenarios, avoids the "rash" deployment and application of large models in vehicle scenarios, and promotes the allocation of resources towards technologies that can truly create value.
[0032] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems are described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will now be described with reference to the accompanying drawings.
[0033] Figure 1 This is a schematic diagram of the model evaluation method provided in this application. This method can be deployed in electronic devices (such as computer devices and mobile terminals). Figure 1 As shown, the method includes:
[0034] Step 110: Obtain multidimensional evaluation quantitative data.
[0035] The multidimensional evaluation quantitative data includes at least one of the following: the task adaptability of the first model, the model deployment feasibility of deploying the first model to the target device, and the model application value generated by deploying the first model to the target device.
[0036] Step 120: Based on multidimensional evaluation and quantitative data, determine the maturity of the first model deployment applied to the target device.
[0037] The model evaluation method provided in this application can be applied to the model selection stage of in-vehicle intelligent systems (such as voice assistants, scene recognition, and user intent understanding). Its typical network architecture includes the in-vehicle terminal (devices with limited computing power), the cloud (model training and updates), and edge nodes (local caching and security management). For example, in voice interaction scenarios, the system needs to determine whether to use a large language model to support multi-turn dialogue or maintain a small model to ensure real-time performance under limited computing power.
[0038] In this embodiment, the first model may refer to the large model mentioned above, while the second model mentioned later may refer to the small model.
[0039] In addition, the third and fourth models will be mentioned later, both of which can refer to smaller models.
[0040] Compared to small models, large models require more software resources (such as software framework, compilation optimization, model compression, and runtime thread overhead) and hardware resources (such as chip computing power, memory resources, and hard disk capacity). In addition, large models have more model parameters than small models, and require more manpower and resources in the early stages (for example, large models need to collect more training data) to build them.
[0041] Furthermore, the difference between large and small models lies in their model parameter scale: large models have a large number of parameters (e.g., over a billion), rely on large-scale training data for pre-training, and learn general feature representations. Compared to small models, large models offer greater flexibility and generalization ability, but at a higher computational cost. Small models, on the other hand, have a smaller number of parameters (e.g., in the millions to tens of millions). Small models are primarily designed and finely tuned for specific tasks (such as vehicle target detection). For these specific tasks, small models offer high computational efficiency and fast response, making them more suitable for real-time and power-sensitive scenarios.
[0042] Regarding step 110 above, taking the model evaluation method provided in this application as an example applied to the automotive field, the target device can be a device in the vehicle, such as a perception and interaction device (used to realize voice interaction with the driver and vehicle navigation, etc.) and an intelligent driving control device (used to realize intelligent driving).
[0043] First, it is necessary to obtain multi-dimensional evaluation quantitative data of the first model, and based on this, determine the maturity of the first model, so as to determine whether the first model is suitable for deployment in sensing and interaction devices.
[0044] In this embodiment, task adaptability refers to whether the first model can adapt to the tasks that the sensing and interaction device needs to process. The adaptability of the first model can be determined based on the overlap between the various functions supported by the first model and the functions required by the task; the higher the overlap, the higher the adaptability.
[0045] For example, the first model supports user voice interaction and text recognition, while the task that the sensing and interaction device needs to process is text recognition. This means that the text recognition function of the first model matches the task processing requirements, but the user voice interaction function is redundant for the sensing and interaction device. Thus, the overlap is only 50%.
[0046] Model deployment feasibility refers to the engineering feasibility of deploying the first model to the vehicle-mounted device. For example, if the hard drive space of the vehicle-mounted device is small and cannot meet the space requirements of the first model, it means that the first model cannot be effectively deployed and applied to the vehicle-mounted device.
[0047] The application value of a model refers to the added value that a model can bring when deployed and applied to an in-vehicle device. For example, a small model that could not interact with a user via voice, but after a large model is deployed and applied to an in-vehicle device, it can not only interact with the user via voice, but also change its tone of voice according to the user's emotions, providing emotional value to the user. This indicates that the application value of the large model is greater than that of the small model.
[0048] Regarding step 120 above, maturity is a specific numerical value. Depending on its value range, it can be used to determine whether the first model can be deployed and applied to in-vehicle equipment.
[0049] For example, the numerical range can be divided into a first range [0, 30], a second range [30, 60], and a third range [60, 90]. The first range corresponds to "not recommended for deployment", the second range corresponds to "can be deployed", and the third range corresponds to "recommended for deployment". When the maturity level is in the third range, it means that after the first model is deployed to the vehicle device, there will be no "overly aggressive" phenomenon.
[0050] This application embodiment quantifies the first model from three dimensions: task adaptability, model deployment feasibility, and model application value. Based on the evaluation results, the model can be reasonably selected in the early stage of model deployment to ensure that the model deployed to the device matches the actual needs and avoids the phenomenon of "overreaching" and wasting computing power.
[0051] The following examples describe in detail how to obtain the aforementioned multidimensional evaluation and quantitative data.
[0052] Figure 2 A flowchart of the task adaptability determination method provided in the embodiments of this application is shown below. Figure 2 The method shown can specifically include:
[0053] Step 210: Obtain the number of the first task and the number of the second task;
[0054] Step 220: Determine the task scenario scalability score of the first model based on the number of the first task and the number of the second task;
[0055] Step 230: Determine the task suitability based on the task scenario scalability score.
[0056] Wherein, the first number of tasks is the number of tasks that the first model can handle in the target device, the second number of tasks is the number of tasks that the second model can handle in the target device, and at least one of the number of model parameters, the amount of training data, and the amount of computing resources consumed by the second model is less than that of the first model.
[0057] In this embodiment, the first model refers to the large model, and the second model refers to the small model.
[0058] Smaller models offer greater flexibility in deployment and upgrades, demonstrate higher processing efficiency for specific tasks, and provide accurate results. While larger models offer more functionality and can handle more complex task scenarios, their larger model parameters and training data make deployment and upgrades more time-consuming and labor-intensive. Furthermore, larger models require more computing resources (both software and hardware) during task processing.
[0059] Continuing with the example of in-vehicle devices, let's assume the large model is a visual language model. It can support reminders left behind by various types of mobile phones and keychains within the vehicle. Traditional small models can only support predefined tasks, and the model needs to be retrained when the task changes.
[0060] In addition, large models have emergent capabilities, enabling them to handle zero-shot and few-shot tasks and support new tasks without fine-tuning the training.
[0061] Zero-shot tasks refer to large models predicting or performing tasks on unseen categories / tasks directly through transfer learning, semantic association, or cross-modal knowledge (such as text-image mapping) without any training samples (i.e., zero labeled samples). Few-shot tasks refer to large models quickly learning the patterns of new tasks or categories and performing predictions on a small number of test samples with a very small number of labeled samples (e.g., 1-5).
[0062] Regarding step 210 above, the first number of tasks refers to the number of tasks supported by the first model, and the second number of tasks refers to the number of tasks supported by the second model.
[0063] Regarding steps 220 and 230 above, the more tasks the model supports, the better the model's scalability to task scenarios, and the higher the task scenario scalability score.
[0064] Among them, the difference between the number of first tasks and the number of second tasks can be compared. The larger the difference, the better the application effect of the first model after it is deployed to the target device compared to the second model. The higher the task adaptability is.
[0065] It should be noted that the task adaptability determination method provided in this application can be executed by an electronic device. The first task quantity and the second task quantity can be calculated by other devices and input into the electronic device, or the user can directly input the first task quantity and the second task quantity into the electronic device based on expert experience, and then the electronic device can complete the task adaptability calculation based on the first task quantity and the second task quantity.
[0066] Traditional model evaluation methods often focus solely on model accuracy while neglecting task adaptability, leading to the blind deployment of large models that ultimately underperform smaller models, resulting in overly ambitious deployments. This application addresses this issue by providing a task adaptability determination method that quantifies the difference in task scenario coverage by utilizing the number of tasks each large and small model can handle. This allows for a precise assessment of the scalability advantages of large models. For example, in scenarios requiring zero-shot support, large models score significantly higher than small models, guiding selection towards highly scalable large models. This solves the most fundamental and critical adaptability problem in large model selection.
[0067] Furthermore, in some embodiments, the task scenario scalability score can be determined through the following steps:
[0068] Step 11: Determine the weight of the first quantity based on the number of the first task and the number of the second task;
[0069] Step 12: Obtain the preset second quantity weights and initial scene scores;
[0070] Step 13: Determine the task scenario scalability score based on the target quantity weight and the initial scenario score. The target quantity weight is one of the first quantity weight and the second quantity weight.
[0071] In this embodiment, the user can determine the number of task scenarios n1 that the large model can cover by enumeration, as the first number of tasks, and determine the number of task scenarios n2 that the small model can cover by enumeration, as the second number of tasks.
[0072] In this embodiment, the first quantity weight can be calculated as follows: (n2-n1) / n1.
[0073] In addition, the second quantity weight can be 1, and the target quantity weight can be the smallest of the first and second quantity weights. For example, calculate the scalable scene weight w1=min((n2-n1) / n1,1), and use the scalable scene weight w1 as the target quantity weight.
[0074] For example, taking the car key forgetting reminder task as an example, in the key forgetting reminder task, the small model only supports predefined key types (n1=5), while the large model supports any key type (n2=10), so w1=1.
[0075] In this embodiment, the initial scene score can be 10, and the scalable scene score s1 = 10 * w1 is calculated as the task adaptability.
[0076] This application embodiment determines the first quantity weight by utilizing the first task quantity and the second task quantity, which enables dynamic adjustment of the weight. This makes the quantified task scenario expansion score more consistent with the actual application scenario, avoids overestimating the scalability advantage of the large model, and thus more accurately matches the actual task requirements.
[0077] Furthermore, in practical applications, there may be specific task requirements. For example, after the model outputs a result, it may need to provide an explanation of that output; or the model may need to interact with the user during task execution. Based on this, task suitability can be further determined through the following steps:
[0078] Step 21: If the target device has a primary requirement, obtain the explanatory score;
[0079] Step 22: If the target device has a second requirement, obtain the interactivity score;
[0080] Step 23: Determine the task fit based on at least one of the explanatory score and the interactivity score, as well as the task scenario extensibility score.
[0081] The first requirement is to explain the model's output; the second requirement is for users to interact with the model.
[0082] In this embodiment, taking the large language model as an example, the large language model has reasoning ability and can output the reasoning process in the form of natural language; while the small model is more like a black box, which can only output the result and does not have the interpretability of the output result.
[0083] In addition, large language models output natural language, which can support long contexts and thus have user-friendly interactivity; while small models output vectors or labels, which users cannot understand and are not convenient for user-model interaction.
[0084] Considering the differences between large and small models, the adaptability of large models can be assessed by whether the interpretability of the output results needs to be provided. This will help determine whether to select a large model for deployment on the target device and avoid "overreaching".
[0085] In this embodiment, taking the navigation task in the target device as an example, the first and second requirements mentioned above exist when the target device performs the navigation task. During the navigation process, the user may say, "Please help me navigate to destination A." At this time, the model needs to interact with the user through voice or image prompts to inform them of the specific navigation route, which constitutes the second requirement.
[0086] In addition, after the model navigation routes are formulated, it is also necessary to provide corresponding explanations for each navigation route. For example, navigation route S1 is the closest, while navigation route S2 takes the least time, which means there is a first requirement.
[0087] For step 21 above, the explanatory score s2 = 10 * w2 can be calculated using the formula.
[0088] Here, w2 is the interpretability weight. When the model output needs to be explained (i.e., there is a first requirement), the interpretability weight w2=1, and when there is no first requirement, w2=0.
[0089] Regarding step 22 above, the interaction feasibility score s3 = 10 * w3 can be calculated using the formula, which serves as the interaction score mentioned in this paper.
[0090] Here, w3 is the interactive weight. If the model needs to interact with the user (including one or more rounds of interaction), it means that there is a second need, and the interactive weight w3=1; if there is no second need, then w3=0.
[0091] For step 23 above, the task scenario adaptability score s = s1 + s2 + s3 can be calculated as the task adaptability degree. A higher task adaptability degree indicates a greater recommendation to deploy the model on the target device.
[0092] This application embodiment uses interpretability score, interactivity score, and task scenario extensibility score to determine task fit, which can improve the accuracy of large model fit evaluation and avoid the problem of traditional model evaluation focusing only on model accuracy while ignoring task fit.
[0093] Figure 3 This is a schematic diagram of the model deployment feasibility determination process provided in the embodiments of this application, such as... Figure 3 As shown, it includes:
[0094] Step 310: Obtain the first processing rate, the second processing rate, the data transmission rate of the target device, the first model parameter quantity of the first model, and the third model parameter quantity of the third model.
[0095] Here, the first processing rate refers to the task processing rate of the first model, and the second processing rate refers to the task processing rate of the second model. The first model can still refer to the large model mentioned above, and the third model can still refer to the small model mentioned above. The number of model parameters in the first model (i.e., the number of parameters in the first model) is greater than the number of model parameters in the third model (i.e., the number of parameters in the third model).
[0096] Step 320: Determine the processing rate fraction based on the first processing rate and the second processing rate.
[0097] Step 330: Determine the data transmission score based on the data transmission rate, the number of parameters in the first model, and the number of parameters in the third model.
[0098] Step 340: Determine the feasibility of model deployment based on at least one of the processing rate score and the data transfer score.
[0099] In real-world automotive applications, deploying large models presents numerous challenges that directly impact task performance. Specifically, large models involve a significant number of parameters, requiring more network bandwidth for updates and greater computational resources and security measures for inference. However, in practical applications, the computing and network resources of automotive devices are limited.
[0100] In this embodiment, in light of the actual situation, two scores, processing rate score and data transmission score, can be selected to determine whether the large model is more suitable for deployment in vehicle-mounted equipment than the small model.
[0101] For step 310 above, the processing speed of the model mainly considers the following four factors: the number of model parameters, the model accuracy, the device computing power, and the number of tokens in a single inference.
[0102] In this method, an electronic device is used as the main body to execute the method. The first processing rate, the second processing rate, the data transmission rate (e.g., network bandwidth), the first model parameter quantity, and the second model parameter quantity are parameters that can be directly input by the user into the electronic device.
[0103] Additionally, in some embodiments, the processing rate of a large model for task requests, i.e., the first processing rate, can be estimated by the electronic device using the formula: the computing power of the target device at the model's precision / (2 * number of model parameters * number of tokens). For example, a large model with 7 billion model parameters can process approximately 0.7 requests per second for a token request of length 1000 on a target device with an int8 computing power of 10 trillion operations per second.
[0104] In this embodiment, the processing rate can be estimated to meet the task requirements based on the task's hardware and software parameters and the expected number of model parameters.
[0105] Regarding step 320 above, the difference between the first processing rate and the second processing rate can be compared. If the difference is not significant, it means that the large model and the small model take similar amounts of time to process tasks. Considering that the large model has a large number of model parameters and requires more network bandwidth during updates, the deployment feasibility of the large model will be reduced. In other words, it is not recommended to deploy the large model to the target device.
[0106] Regarding step 330 above, the data transmission rate can refer to the network bandwidth of the target device. When deploying and updating the model, the larger the number of model parameters and the smaller the network bandwidth, the worse the convenience of model deployment and updating will be, which will also reduce the feasibility of deploying large models.
[0107] For example, taking an 8-bit integer (int8) as an example, a model with 7 billion model parameters requires approximately 7 gigabytes of storage. Therefore, updating the model will consume at least 7 gigabytes of bandwidth. In vehicle scenarios where network conditions are not ideal, large model updates may fail with a high probability, thus reducing the feasibility of deploying large models.
[0108] The lower the feasibility of deploying a large model, the less recommended it is to deploy and apply a large model on the target device.
[0109] In this embodiment, by considering data transmission efficiency and the number of model parameters, the transmission time required for both the large and small models to be transmitted to the target device, as well as the subsequent update time, can be determined. Based on the transmission and update times, a data transmission score can be determined.
[0110] The longer the transmission duration and update duration, the smaller the data transmission score.
[0111] For step 340 above, the higher the score, the greater the feasibility of model deployment, indicating that it is more recommended to deploy the large model on the target device.
[0112] This application embodiment determines the deployment feasibility of a large model by combining processing rate score and data transmission score, which solves the problem of traditional model evaluation ignoring engineering constraints. It ensures that the large model can adapt to the computing power and network limitations of the vehicle environment after deployment, avoids performance degradation caused by resource bottlenecks, and improves the application effect of the model.
[0113] Furthermore, in some embodiments, the processing rate fraction can be determined by the following steps:
[0114] Step 31: Determine the first rate weight based on the first processing rate and the second processing rate;
[0115] Step 32: Obtain the preset initial rate score and second rate weight;
[0116] Step 33: Determine the processing rate score based on the target rate weight and the initial rate score.
[0117] The target rate weight is one of the first rate weight and the second rate weight.
[0118] In this embodiment, the first processing rate v1 of the large model and the second processing rate v2 of the small model can be estimated based on factors such as the number of model parameters and available computing power.
[0119] Specifically, for step 31 above, the first rate weight can be calculated using the formula: v1 / v2.
[0120] For step 32 above, the initial rate fraction can be set to 10, and the second rate weight can be set to 1.
[0121] Regarding step 33 above, the processing rate weight w4 = min(v1 / v2, 1) can be determined as the target rate weight through the following steps. That is, the target rate weight is the smallest value among the first rate weight and the second rate weight.
[0122] The processing rate score can be calculated using the formula p1=10*w4, which is the processing rate score mentioned in this article.
[0123] This application embodiment transforms the abstract device computing power bottleneck into a calculable first rate weight, thereby achieving an accurate assessment of the feasibility of engineering implementation, providing a reasonable basis for early model selection, avoiding performance degradation caused by resource bottlenecks after model deployment, and further improving the application effect after model deployment.
[0124] Furthermore, in some embodiments, the data transmission fraction can be determined by the following steps:
[0125] Step 41: Determine the first duration based on the data transmission rate and the number of parameters in the first model. The first duration is the time required to transmit the first model to the target device.
[0126] Step 42: Determine the second duration based on the data transmission rate and the parameters of the second model. The second duration is the time required to transmit the second model to the target device.
[0127] Step 43: Determine the first transmission weight based on the first duration and the second duration;
[0128] Step 44: Obtain the preset second transmission weight and initial transmission score;
[0129] Step 45: Determine the data transmission score based on the target transmission weight and the initial transmission score. The target transmission weight is one of the first transmission weight and the second transmission weight.
[0130] For steps 41 and 42 above, the first time t1 required for the large model to be transmitted to the target device and the second time t2 required for the small model to be transmitted to the target device can be estimated based on the bandwidth configured in the vehicle and the number of model parameters.
[0131] For step 43 above, the first transmission weight = second duration t2 / first duration t1.
[0132] For step 44 above, the preset second transmission weight is 1, and the initial transmission score is 10.
[0133] For step 45 above, calculate the model transmission weight w5=min(t2 / t1, 1), which is used as the target transmission weight. That is, the target transmission weight is the smallest of the first transmission weight and the second transmission weight.
[0134] Alternatively, the model transmission score p2 = 10 * w5 can be calculated using the following formula, which serves as the data transmission score mentioned in this article.
[0135] This application embodiment quantifies the impact of network status on project feasibility by calculating the first transmission weight (such as the 7 gigabyte model update depending on bandwidth stability), avoiding situations such as task interruption due to model update failure after selection, and further improving the application effect after model deployment.
[0136] In some embodiments, if the task requires risk management of the model (i.e., a third requirement), the feasibility of model deployment can be determined through the following steps:
[0137] Step 51: If the target device has a third requirement, obtain the initial risk control score;
[0138] Step 52: Determine the feasibility of model deployment based on at least one of the processing rate score and data transmission score, as well as the initial risk management score.
[0139] In this embodiment, if the target device has a need for risk management of the model (i.e., a third need), the risk management weight w6 can be configured as 1; if the task does not have a third need, the risk management weight w6 can be configured as 0.
[0140] Among them, the initial risk control score can be preset to 10, and the initial risk control score p3 = 10 * w6 is calculated from this.
[0141] In this embodiment, the feasibility score of the project can be calculated as p = p1 + p2 + p3 to represent the feasibility of model deployment.
[0142] In this embodiment, model risk management requirements refer to a set of systematic and proactive management requirements, processes and measures established throughout the entire lifecycle of a model (especially a large model) in order to prevent, identify, assess, monitor and mitigate various potential hazards and negative impacts that the model may bring.
[0143] Smaller models are trained on specific datasets, have weaker generalization ability, and their output is basically focused on the training set. Larger models, on the other hand, have stronger generalization ability but are susceptible to the illusion problem, and their output is easily influenced by user guidance. Therefore, in practical applications, larger models require more security control measures than smaller models.
[0144] This application embodiment obtains an initial risk control score to determine the feasibility of model deployment. In this way, the tendency to select large models can be suppressed in high-risk scenarios, thereby avoiding security problems caused by blindly selecting large models and improving the application effect after model deployment.
[0145] Figure 4 This is a schematic diagram of the model application value determination process provided in the embodiments of this application, such as... Figure 4 As shown, it includes:
[0146] Step 410: Obtain the first technical indicator, the second technical indicator, the first number of operations, the second number of operations, and the emotional satisfaction.
[0147] Here, the first technical indicator is the task completion indicator of the first model, the second technical indicator is the task completion indicator of the fourth model, the first number of operations is the number of user operation steps when the first model processes the task, the second number of operations is the number of user operation steps when the fourth model processes the task, and the emotional satisfaction is the amount of emotional information output to the user by the first model when processing the task. The first model can still refer to the large model mentioned above, and the fourth model can refer to the small model mentioned above, where at least one of the following is less than that of the first model: number of model parameters, amount of training data, and computational resource consumption.
[0148] Step 420: Determine the target indicator score based on the first technical indicator and the second technical indicator.
[0149] Step 430: Determine the target operation score based on the first operation quantity and the second operation quantity.
[0150] Step 440: Determine the target emotional score based on emotional satisfaction.
[0151] Step 450: Determine the application value of the model based on at least one of the target indicator score, target performance score, and target sentiment score.
[0152] In this embodiment, the purpose of determining the application value of the model is mainly to understand whether replacing the small model with a large model can bring incremental business value, so as to avoid blindly pursuing the large model and the small model without understanding the capability boundaries of the large model and the small model, thus avoiding the phenomenon of "overreaching".
[0153] In traditional technologies, when deploying large models to in-vehicle devices, the model evaluation methods mainly focus on inference metrics, such as whether inference speed and accuracy have improved, lacking a systematic evaluation of whether "the task requires a large model." In this embodiment, however, by considering three aspects—technical indicators, the number of user-required steps, and emotional satisfaction—a comprehensive evaluation of the application value of large models can be achieved.
[0154] For step 410 above, different tasks have different task completion metrics. For example, for classification tasks, the corresponding task completion metric is accuracy; while for recall tasks, the corresponding task completion metric is recall rate.
[0155] In addition, taking the example of a driver wanting to navigate to location C, for the large model, after the driver says "go to location C", the large model can directly give the navigation result and start navigation. The only operation steps the driver needs to perform in the whole process are to say this sentence.
[0156] For the small model, it needs to first convert speech into text and perform text recognition, and then interact with the driver to determine whether location C is the destination. At this point, the driver needs to perform a further confirmation operation. After the small model determines the destination, it may also generate several navigation routes for the driver to choose from. Thus, the driver also needs to perform a selection operation. The whole process involves a large number of operation steps.
[0157] Regarding emotional satisfaction, large models can identify a user's current emotion based on the tone of their voice. For example, if a user sounds like they are crying, a large model can recognize that the user is sad and can provide comforting language during the voice interaction (i.e., outputting emotional information to the user). Small models, on the other hand, only provide corresponding voice responses to voice questions without carrying emotional information. Therefore, large models are better able to provide emotional value to users, resulting in greater emotional satisfaction than small models.
[0158] Regarding steps 420 and 430 above, a target indicator score can be given based on the difference between the first technical indicator and the second technical indicator. Alternatively, a target operation score can be given based on the difference between the first operation quantity and the second operation quantity.
[0159] Regarding step 440 above, if the large model can indeed provide emotional value to the user when processing the task, a target emotional score greater than zero can be configured for it; if it cannot provide emotional value, that is, the emotional satisfaction is zero, then the corresponding target emotional score is also zero.
[0160] This application's embodiments determine the model's application value by using three metrics: target indicator score, target operation score, and target sentiment score, thus achieving a precise evaluation of the model's actual user experience. For example, in scenarios where emotional value is prioritized (such as natural language interaction scenarios), the selection is guided towards models with high emotional value, which can improve the accuracy of early model selection and further ensure the application effect after model deployment.
[0161] Furthermore, the target indicator score can be determined through the following steps:
[0162] Step 61: Determine the weight of the first indicator based on the first technical indicator and the second technical indicator;
[0163] Step 62: Obtain the preset initial indicator score;
[0164] Step 63: Determine the target indicator score based on the weight of the first indicator and the initial indicator score.
[0165] For step 61 above, calculate the technical indicator weight w7=(va1-va2) / va2, and use it as the first indicator weight.
[0166] For step 62 above, the initial index score can be set to 10.
[0167] For step 63 above, the technical indicator score vs1 = 10 * w7 can be used as the target indicator score.
[0168] This application embodiment determines the weight of the first indicator by comparing the technical indicators of the large model and the small model, and determines the target indicator score based on this. This can measure the differences between the large model and the small model in technical indicators such as accuracy and recall, providing more scientific guidance for the early model selection, and avoiding the problem of blindly choosing the large model in scenarios where the technical indicators are only slightly improved, thus causing a waste of resources.
[0169] Furthermore, evaluating a model cannot rely solely on improvements in technical metrics (such as accuracy), as small improvements in technical metrics in automotive mass production may not have a significant impact on user experience. Therefore, in some embodiments, the target operation score can be further determined through the following steps:
[0170] Step 71: Determine the weight of the first operation quantity based on the first operation quantity and the second operation quantity;
[0171] Step 72: Obtain the preset second operation quantity weight and initial operation score;
[0172] Step 73: Determine the target operation score based on the target operation quantity weight and the initial operation score. The target operation quantity weight is one of the first operation quantity weight and the second operation quantity weight.
[0173] For step 71 above, calculate the weight of the first operation quantity = first operation quantity step1 / second operation quantity step2.
[0174] For step 72 above, the weight of the second operation quantity can be set to 1, and the initial operation score can be set to 10.
[0175] For step 73 above, calculate the functional efficiency weight w8 = 1 - min(step1 / step2, 1), which is used as the target operation quantity weight. That is, the target operation quantity weight is the smallest of the first operation quantity weight and the second operation quantity weight.
[0176] The functional efficiency score vs2 = 10 * w8 is further calculated and used as the target operation score.
[0177] This application embodiment determines the target operation score by comparing the number of operation steps between large and small models, which can achieve a quantitative evaluation of the improvement in model function efficiency (e.g., simplifying from 3 operation steps to 1 operation step). This allows the actual user experience to be considered when selecting a model, further improving the application effect after the model is deployed.
[0178] In some embodiments, the target sentiment score can be determined through the following steps:
[0179] Step 81: Determine the target emotional weight based on emotional satisfaction;
[0180] Step 82: Obtain the preset initial emotion score;
[0181] Step 83: Determine the target sentiment score based on the target sentiment weight and the initial sentiment score.
[0182] If the model can bring emotional value to the user, such as outputting voice with varying tone, then the emotional satisfaction value is greater than zero, and the target emotional weight w9=1 is given; if the model cannot bring emotional value to the user, then the emotional satisfaction value is equal to zero, and the target emotional weight w9=0 is given.
[0183] In this embodiment, the preset initial emotion score can be 10.
[0184] In this embodiment, the emotional value score vs3 = 10 * w9 is calculated as the target emotional score.
[0185] This application embodiment determines the target emotional score by judging whether the model can bring emotional value to the user, thereby realizing a quantitative assessment of the user's emotional needs for the model. This allows the user's emotional needs to be considered when selecting a model, ensuring that the model can improve the user's emotional satisfaction after deployment.
[0186] In other embodiments, the large model maturity score can be calculated as score = s + p + vs; then the large model maturity score can be normalized as score_normal = score / 90; finally, model technology selection can be performed based on the normalized large model maturity score.
[0187] Taking the above model evaluation method applied to the vehicle-mounted large model evaluation scenario as an example, in order to avoid the lack of clear evaluation standards in the vehicle field to guide "what scale of model to use in what scenario to solve what problem", the phenomenon of "overreaching" with large models has occurred. This has resulted in a significant increase in the demand for software and hardware resources after using large models to replace traditional small models. Instead of bringing actual value increment, it has sacrificed the advantages of small models in terms of flexible deployment and upgrades. Figure 5 The flowchart for the evaluation of the large vehicle model provided in the embodiments of this application is as follows: Figure 5 As shown, the large model maturity assessment includes three parts: task scenario adaptability, engineering implementation feasibility, and value increment assessment. Each part has a total score of 30 points, and each sub-part has 10 points. The specific assessment process for the three parts is as follows:
[0188] (1) The evaluation of task suitability score is as follows:
[0189] ① Based on the task requirements, enumerate the number of scenarios n1 and n2 that the current model can cover; ② Calculate the scalable scenario weight w1 = min((n2-n1) / n1, 1); ③ Calculate the scalable scenario score s1 = 10 * w1; ④ Clarify whether the model's inference results need to be explained. If explanation is required, the interpretability weight w2 = 1; otherwise, w2 = 0; ⑤ Calculate the interpretability score s2 = 10 * w2; ⑥ Clarify whether the model needs to interact with the user in multiple rounds. If interaction is required, the interactivity weight w3 = 1; otherwise, w3 = 0; ⑦ Calculate the interaction feasibility score s3 = 10 * w3; ⑧ Calculate the task scenario adaptability score s = s1 + s2 + s3.
[0190] (2) The assessment of the feasibility of project implementation (i.e., the feasibility of model deployment) is as follows:
[0191] ① Estimate the model parameters, available computing power, and other factors, and calculate the processing rates v1 and v2 of the large and small models respectively according to the formula; ② Calculate the processing rate weight w4=min(v1 / v2, 1); ③ Calculate the processing rate score p1=10*w4; ④ Estimate the transmission duration t1 and t2 of the large and small models based on the bandwidth configured for the vehicle and the number of model parameters, and calculate the model transmission weight w5=min(t2 / t1, 1); ⑤ Calculate the model transmission score p2=10*w5; ⑥ Clarify the model risk management requirements. If risk management is required, the risk management weight w6=1, otherwise w6=0; ⑦ Calculate the risk management score p3=10*w6; ⑧ Calculate the project feasibility score p=p1+p2+p3.
[0192] (3) The evaluation of value increment (i.e., the application value of the model) is as follows:
[0193] ① Calculate the technical indicators va1 and va2 of the large and small models, and calculate the weight of the technical indicators w7 = (va1 - va2) / va2; ② Calculate the technical indicator score vs1 = 10 * w7; ③ Calculate the number of steps required for user active operation step1 and step2 when using the large and small models respectively, and calculate the functional efficiency weight w8 = 1 - min(step1 / step2, 1); ④ Calculate the functional efficiency score vs2 = 10 * w8; ⑤ Calculate the functional emotional value weight. If the large model can bring emotional value to the user, then w9 = 1, otherwise w9 = 0; ⑥ Calculate the emotional value score vs3 = 10 * w9; ⑦ Calculate the value increment score vs = vs1 + vs2 + vs3.
[0194] After completing the evaluation of the above three parts, the final large model maturity score is calculated as score = s + p + vs; then the large model maturity score is normalized as score_normal = score / 90; and model technology selection is carried out based on the normalized large model maturity score.
[0195] Traditional technologies can only assess the hardware and software capabilities when using large models and perform grading, but cannot assess whether a particular task is suitable for using a large model, making it difficult to prevent the problem of overly aggressive application of large models. This embodiment evaluates the adaptability of large models to actual tasks, the engineering feasibility of using large models, and whether large models bring incremental value, and quantifies the results to provide model selection guidance for relevant decision-makers. This can drive resources toward directions that truly create value and avoid the phenomenon of "overly aggressive" application.
[0196] Figure 6 A schematic diagram of the structure of the model evaluation device provided in this application is shown below. Figure 6 As shown, the model evaluation device 60 provided in this embodiment includes:
[0197] The acquisition module 601 is used to acquire multidimensional evaluation quantitative data. This multidimensional evaluation quantitative data includes at least one of the following: the task adaptability of the first model, the model deployment feasibility of deploying the first model to the target device, and the model application value generated by deploying the first model to the target device.
[0198] The determination module 602 is used to determine the maturity of the first model deployment applied to the target device based on multidimensional evaluation quantitative data.
[0199] In one possible implementation, the acquisition module can specifically be used to: acquire the first number of tasks and the second number of tasks; determine the task scenario scalability score of the first model based on the first number of tasks and the second number of tasks; and determine the task adaptability based on the task scenario scalability score. Here, the first number of tasks refers to the number of tasks the first model can handle on the target device, the second number of tasks refers to the number of tasks the second model can handle on the target device, and the second model has at least one less parameter count, training data volume, and computational resource consumption than the first model.
[0200] In one possible implementation, the acquisition module can specifically be used to: determine a first quantity weight based on the first number of tasks and the second number of tasks; acquire a preset second quantity weight and an initial scene score; and determine a task scene scalability score based on a target quantity weight and the initial scene score. The target quantity weight is one of the first quantity weight and the second quantity weight.
[0201] In one possible implementation, the acquisition module can specifically be used to: acquire an interpretability score if the target device has a first requirement; acquire an interactivity score if the target device has a second requirement; and determine task suitability based on at least one of the interpretability score and the interactivity score, and a task scenario extensibility score. Here, the first requirement is the need to explain the model's output, and the second requirement is the need for the user to interact with the model.
[0202] In one possible implementation, the acquisition module can specifically be used to: acquire a first processing rate, a second processing rate, a data transmission rate of the target device, a first model parameter quantity of a first model, and a third model parameter quantity of a third model; then determine a processing rate score based on the first processing rate and the second processing rate; and determine a data transmission score based on the data transmission rate, the first model parameter quantity, and the third model parameter quantity; and finally determine the feasibility of model deployment based on at least one of the processing rate score and the data transmission score.
[0203] Wherein, the first processing rate is the task processing rate of the first model, the second processing rate is the task processing rate of the third model, and the number of parameters of the first model is greater than the number of parameters of the third model.
[0204] In one possible implementation, the acquisition module can specifically be used to: determine a first rate weight based on a first processing rate and a second processing rate; acquire a preset initial rate score and a second rate weight; and determine a processing rate score based on a target rate weight and the initial rate score. The target rate weight is one of the first rate weight and the second rate weight.
[0205] In one possible implementation, the acquisition module can specifically be used to: determine a first duration based on the data transmission rate and the number of parameters in the first model; then determine a second duration based on the data transmission rate and the number of parameters in the second model; and determine a first transmission weight based on the first and second durations; finally, acquire a preset second transmission weight and an initial transmission score; and determine a data transmission score based on the target transmission weight and the initial transmission score. Here, the first duration is the time required to transmit the first model to the target device, the second duration is the time required to transmit the second model to the target device, and the target transmission weight is one of the first and second transmission weights.
[0206] In one possible implementation, the acquisition module can specifically be used to: acquire an initial risk management score if the target device has a third requirement; and determine the feasibility of model deployment based on at least one of a processing rate score and a data transmission score, as well as the initial risk management score. Here, the third requirement refers to the requirement for risk management of the model.
[0207] In one possible implementation, the acquisition module can specifically be used to: acquire a first technical indicator, a second technical indicator, a first number of operations, a second number of operations, and emotional satisfaction; then determine a target indicator score based on the first and second technical indicators; determine a target operation score based on the first and second number of operations; finally determine a target emotional score based on emotional satisfaction; and determine the model application value based on at least one of the target indicator score, target operation score, and target emotional score. Wherein, the first technical indicator is the task completion indicator of the first model, the second technical indicator is the task completion indicator of the fourth model, the first number of operations is the number of user operation steps when the first model processes the task, the second number of operations is the number of user operation steps when the fourth model processes the task, and emotional satisfaction is the amount of emotional information output to the user when the first model processes the task. The fourth model has at least one less parameter count, training data volume, and computational resource consumption than the first model.
[0208] In one possible implementation, the acquisition module can be specifically used to: determine the weight of the first indicator based on the first technical indicator and the second technical indicator; then obtain a preset initial indicator score; and determine the target indicator score based on the weight of the first indicator and the initial indicator score.
[0209] In one possible implementation, the acquisition module can specifically be used to: determine a first operation quantity weight based on a first operation quantity and a second operation quantity; acquire a preset second operation quantity weight and an initial operation score; and determine a target operation score based on a target operation quantity weight and the initial operation score. The target operation quantity weight is one of the first operation quantity weight and the second operation quantity weight.
[0210] In one possible implementation, the acquisition module can be specifically used to: determine the target emotional weight based on the emotional satisfaction level; and obtain a preset initial emotional score; and determine the target emotional score based on the target emotional weight and the initial emotional score.
[0211] The model evaluation device provided in this embodiment can execute the method provided in the above method embodiment. Its implementation principle and technical effect are similar, and will not be described in detail here.
[0212] Figure 7 A schematic diagram of the electronic structure provided in this application. (See attached diagram.) Figure 7 As shown, the electronic device 70 provided in this embodiment includes at least one processor 701 and a memory 702. Optionally, the device 70 further includes a communication component 703. The processor 701, memory 702, and communication component 703 are connected via a bus.
[0213] In a specific implementation, at least one processor 701 executes computer execution instructions stored in memory 702, causing at least one processor 701 to perform the above-described method.
[0214] The specific implementation process of processor 701 can be found in the above method embodiments, and its implementation principle and technical effect are similar. It will not be repeated here.
[0215] In the above embodiments, it should be understood that the processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in this invention can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules within the processor.
[0216] The memory may include random access memory (RAM) and may also include non-volatile memory (NVM), such as at least one disk storage device.
[0217] The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses shown in the accompanying drawings are not limited to a single bus or a single type of bus.
[0218] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.
[0219] This application also provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the above-described method.
[0220] The aforementioned readable storage medium can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The readable storage medium can be any available medium accessible to a general-purpose or special-purpose computer.
[0221] An exemplary readable storage medium is coupled to a processor, enabling the processor to read information from and write information to the readable storage medium. Of course, the readable storage medium can also be a component of the processor. The processor and the readable storage medium can reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the processor and the readable storage medium can exist as discrete components in the device.
[0222] The division of units is merely a logical functional division; in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or units, and may be electrical, mechanical, or other forms.
[0223] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0224] In addition, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0225] If a function is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0226] Those skilled in the art will understand that all or part of the steps of the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When executed, the program performs the steps of the above-described method embodiments; and the aforementioned storage medium includes various media capable of storing program code, such as ROM, RAM, magnetic disks, or optical disks.
[0227] Finally, it should be noted that other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common knowledge or customary techniques in the art not disclosed herein, and is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of the invention is limited only by the appended claims.
Claims
1. A model evaluation method, characterized in that, include: Obtain multidimensional evaluation quantitative data, which includes at least one of the following: the task adaptability of the first model, the model deployment feasibility of deploying the first model to the target device, and the model application value generated by deploying the first model to the target device. Based on the multidimensional evaluation and quantitative data, the maturity level of the first model deployment applied to the target device is determined.
2. The method according to claim 1, characterized in that, Obtaining the task suitability includes: Obtain a first number of tasks and a second number of tasks. The first number of tasks is the number of tasks that the first model can handle in the target device. The second number of tasks is the number of tasks that the second model can handle in the target device. At least one of the following is less than that of the first model: the number of model parameters, the amount of training data, and the amount of computing resources consumed. The task scenario scalability score of the first model is determined based on the first number of tasks and the second number of tasks. The task suitability is determined based on the task scenario scalability score.
3. The method according to claim 2, characterized in that, The step of determining the task scenario scalability score of the first model based on the first number of tasks and the second number of tasks includes: The first quantity weight is determined based on the first task quantity and the second task quantity; Obtain the preset second quantity weights and initial scene scores; The task scenario scalability score is determined based on the target quantity weight and the initial scenario score, wherein the target quantity weight is one of the first quantity weight and the second quantity weight.
4. The method according to claim 2, characterized in that, The step of determining the task suitability based on the task scenario scalability score includes: If the target device has a first requirement, then an interpretability score is obtained, where the first requirement is the requirement to explain the model output results. If the target device has a second requirement, then an interactivity score is obtained, where the second requirement is the user's need to interact with the model. The task suitability is determined based on at least one of the explanatory score and the interactive score, as well as the task scenario extensibility score.
5. The method according to claim 1, characterized in that, To determine the feasibility of deploying the model, the following steps are taken: The first processing rate, the second processing rate, the data transmission rate of the target device, the first model parameter quantity of the first model, and the third model parameter quantity of the third model are obtained. The first processing rate is the task processing rate of the first model, the second processing rate is the task processing rate of the third model, and the first model parameter quantity is greater than the third model parameter quantity. A processing rate score is determined based on the first processing rate and the second processing rate; The data transmission score is determined based on the data transmission rate, the number of parameters in the first model, and the number of parameters in the third model. The feasibility of the model deployment is determined based on at least one of the processing rate score and the data transmission score.
6. The method according to claim 5, characterized in that, The step of determining the processing rate score based on the first processing rate and the second processing rate includes: The first rate weight is determined based on the first processing rate and the second processing rate; Obtain the preset initial rate score and second rate weight; The processing rate score is determined based on the target rate weight and the initial rate score, wherein the target rate weight is one of the first rate weight and the second rate weight.
7. The method according to claim 5, characterized in that, Determining the data transmission score based on the data transmission rate, the number of parameters in the first model, and the number of parameters in the second model includes: Based on the data transmission rate and the number of parameters of the first model, a first duration is determined, wherein the first duration is the duration required to transmit the first model to the target device; Based on the data transmission rate and the number of parameters of the second model, a second duration is determined, wherein the second duration is the time required to transmit the second model to the target device; The first transmission weight is determined based on the first duration and the second duration; Obtain the preset second transmission weight and initial transmission score; The data transmission score is determined based on the target transmission weight and the initial transmission score, wherein the target transmission weight is one of the first transmission weight and the second transmission weight.
8. The method according to any one of claims 5-7, characterized in that, To determine the feasibility of deploying the model, the following steps are taken: If the target device has a third requirement, then an initial risk control score is obtained, where the third requirement is the requirement for risk control of the model. The feasibility of the model deployment is determined based on at least one of the processing rate score and the data transmission score, as well as the initial risk management score.
9. The method according to claim 1, characterized in that, To obtain the application value of the model, including: The system acquires a first technical indicator, a second technical indicator, a first number of operations, a second number of operations, and emotional satisfaction. The first technical indicator is the task completion indicator of the first model, the second technical indicator is the task completion indicator of the fourth model, the first number of operations is the number of user operation steps when the first model processes the task, the second number of operations is the number of user operation steps when the fourth model processes the task, and the emotional satisfaction is the amount of emotional information output to the user when the first model processes the task. The fourth model has at least one less parameter count, training data volume, and computational resource consumption than the first model. The target indicator score is determined based on the first technical indicator and the second technical indicator; Determine the target operation score based on the first operation quantity and the second operation quantity; Based on the stated emotional satisfaction, determine the target emotional score; The application value of the model is determined based on at least one of the target indicator score, the target operation score, and the target sentiment score.
10. The method according to claim 9, characterized in that, The step of determining the target indicator score based on the first technical indicator and the second technical indicator includes: The weight of the first indicator is determined based on the first technical indicator and the second technical indicator; Obtain the preset initial indicator score; The target indicator score is determined based on the first indicator weight and the initial indicator score.
11. The method according to claim 9, characterized in that, The step of determining the target operation score based on the first operation quantity and the second operation quantity includes: The weight of the first operation quantity is determined based on the first operation quantity and the second operation quantity; Obtain the preset second operation quantity weight and initial operation score; The target operation score is determined based on the target operation quantity weight and the initial operation score, wherein the target operation quantity weight is one of the first operation quantity weight and the second operation quantity weight.
12. The method according to claim 9, characterized in that, The process of determining the target emotional score based on the emotional satisfaction level includes: Based on the stated emotional satisfaction level, determine the target emotional weight; Obtain the preset initial emotion score; The target sentiment score is determined based on the target sentiment weight and the initial sentiment score.
13. A model evaluation device, characterized in that, include: The acquisition module is used to acquire multidimensional evaluation quantitative data, which includes at least one of the following: the task adaptability of the first model, the model deployment feasibility of deploying the first model to the target device, and the model application value generated by deploying the first model to the target device. The determination module is used to determine the maturity of the first model deployment applied to the target device based on the multidimensional evaluation quantitative data.
14. An electronic device, characterized in that, include: Memory, processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory, causing the processor to perform the method as described in any one of claims 1-12.
15. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-12.
16. A computer program product, characterized in that, Includes a computer program that, when executed by a processor, implements the method described in any one of claims 1-12.