A method and apparatus for evaluating the inference performance of large models

By monitoring the execution status of large model load testing tasks, performance, stability, and cost data are automatically acquired and integrated, solving the problem of incomplete evaluation in existing evaluation methods. This enables a comprehensive performance evaluation of large language model inference services and provides accurate decision-making basis.

CN122309318APending Publication Date: 2026-06-30ANQING (TIANJIN) COMPUTER CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ANQING (TIANJIN) COMPUTER CO LTD
Filing Date
2026-06-03
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing large language model evaluation methods only focus on a single performance indicator and cannot comprehensively evaluate the overall performance of inference services, including cost-effectiveness and operational stability. This results in a lack of data support and inaccurate evaluation in hardware procurement decisions.

Method used

By monitoring the execution status of large model stress testing tasks, the system automatically acquires inference performance index data, time-series monitoring data, and static asset cost information, calculates performance scores, stability scores, and cost scores respectively, and integrates them into a comprehensive performance score, providing a unified quantitative evaluation framework.

Benefits of technology

It enables a comprehensive quantitative evaluation of the response quality, operational reliability, and economic benefits of inference services, solving the problems of incomplete evaluation and lack of data verifiability in existing evaluation schemes, and providing a unified basis for decision-making.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309318A_ABST
    Figure CN122309318A_ABST
Patent Text Reader

Abstract

This application provides a method and apparatus for evaluating the performance of large-scale model inference, belonging to the field of artificial intelligence. The method includes: monitoring the execution status of the large-scale model load testing task; acquiring, based on the execution status, inference performance index data of the load testing task, time-series monitoring data of the large-scale model deployment hardware devices within the corresponding time period of the load testing task, and static asset cost information of the large-scale model deployment hardware devices; calculating a performance score based on the inference performance index data; calculating a stability score based on the time-series monitoring data; calculating a cost score based on the static asset cost information; and integrating the performance score, the stability score, and the cost score to calculate a comprehensive performance score for the large-scale model inference. The large-scale model inference performance evaluation method and apparatus provided in this application can comprehensively evaluate the overall performance of inference services, including performance indicators, cost-effectiveness, and operational stability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence, and in particular to a method and apparatus for evaluating the performance of large model inference. Background Technology

[0002] Performance evaluation of large language model inference services is a crucial basis for model selection, hardware procurement, and deployment optimization. In a typical evaluation process, the load testing client initiates concurrent requests to the target inference service, recording end-to-end performance metrics such as first-word latency compliance rate (TTFT compliance rate), first-word time compliance rate (TPOT compliance rate), and request error rate, generating a load testing report. Simultaneously, operations personnel collect operating parameters such as GPU utilization, memory bandwidth utilization, and power consumption during the load testing period through a monitoring system, manually comparing these parameters with the load testing report to determine if performance bottlenecks exist. Furthermore, economic factors such as hardware procurement costs and power consumption are typically incorporated into procurement decisions, but these cost data are often calculated separately outside the evaluation process using spreadsheets or offline scripts.

[0003] From a practical perspective, the above process has several problems. First, the evaluation dimensions are incomplete. Existing load testing tools and monitoring systems each cover two dimensions: performance and hardware operating status, but they do not incorporate cost factors into a unified quantitative analysis framework. Operations personnel cannot directly obtain the cost-effectiveness ranking of different hardware solutions while meeting performance requirements from the evaluation results; procurement decisions still rely on manual estimation and lack data support. Second, bottleneck diagnosis relies on manual experience. When inference performance does not meet expectations, operations personnel need to manually compare the latency and total inference service throughput data in the load testing report with the memory bandwidth, computing utilization, and other indicator curves in the monitoring system to determine whether the current configuration is limited by memory bandwidth, computing units, or other factors. This judgment process heavily relies on personal experience, making it difficult to guarantee the accuracy and consistency of the conclusions. Furthermore, data traceability is difficult. Monitoring data has retention period limitations, and data expires even faster in laboratory environments. Hardware configurations and prices are frequently adjusted. When reviewing historical evaluation conclusions several months later, the original monitoring data and the cost parameters at that time are often unavailable, making the report's conclusions lack verifiability, and auditing and review are very difficult. Furthermore, as large-model inference services evolve towards heterogeneous computing clusters, the same model often needs to be deployed on hardware devices with different physical characteristics. Existing evaluation methods for measuring the operational stability of inference services typically employ uniform static scoring rules, failing to differentiate between the fault tolerance capabilities of individual hardware devices. Data center-grade graphics cards are usually equipped with error-correcting memory and high-speed interconnect links, resulting in relatively low latency jitter baselines under high concurrency. Using a fixed scoring standard may mask potential hardware failures. Consumer-grade or workstation-grade graphics cards often rely on standard bus interconnects and lack comprehensive error correction protection; their normal minor fluctuations may be excessively penalized by the fixed scoring rules. This indiscriminate evaluation method makes it difficult to compare the stability scores of different hardware, failing to truly reflect the overall service quality of various hardware in a heterogeneous environment. Therefore, there is an urgent need for a method that can automatically aggregate the aforementioned multi-source data, simultaneously cover the three dimensions of performance, stability, and cost for quantitative evaluation, and provide a complete record of the evaluation process and conclusions. Summary of the Invention

[0004] In view of this, this application provides a method and apparatus for evaluating the performance of large-scale language model inference, in order to solve the problem that existing large-scale language model evaluation methods only focus on a single performance indicator and cannot comprehensively evaluate the overall performance of inference services, including cost-effectiveness and operational stability.

[0005] Specifically, this application is implemented through the following technical solution:

[0006] The first aspect of this application provides a method for evaluating the inference performance of large models, the method comprising:

[0007] Monitor the execution status of the large model load testing task, and based on the execution status, obtain the inference performance index data of the load testing task, the time-series monitoring data of the large model deployment hardware device within the corresponding time period of the load testing task, and the static asset cost information of the large model deployment hardware device;

[0008] Calculate a performance score based on the aforementioned inference performance index data;

[0009] A stability score is calculated based on the aforementioned time-series monitoring data;

[0010] Calculate the cost score based on the static asset cost information;

[0011] The comprehensive performance score of the large model inference is calculated by combining the performance score, the stability score, and the cost score. The performance score is calculated by weighting at least two performance indicators read from the inference performance index data; the stability score adopts a base score deduction system, which is obtained by deducting points after multiplying the monitoring indicators in the time-series monitoring data with the corresponding penalty coefficients; the cost score is obtained by mapping the cost-effectiveness ranking corresponding to the ratio of the total number of output tokens to the cost item.

[0012] A second aspect of this application provides a large model inference performance evaluation device, the device comprising an acquisition module and a calculation module;

[0013] The acquisition module is used to monitor the execution status of the large model load testing task, and acquire the inference performance index data of the load testing task, the time-series monitoring data of the large model deployment hardware device within the corresponding time period of the load testing task, and the static asset cost information of the large model deployment hardware device based on the execution status.

[0014] The calculation module is used to calculate a performance score based on the inference performance index data;

[0015] The calculation module is also used to calculate a stability score based on the time-series monitoring data;

[0016] The calculation module is also used to calculate a cost score based on the static asset cost information;

[0017] The calculation module is also used to calculate the comprehensive performance score of the large model inference by fusing the performance score, the stability score and the cost score; The performance score is calculated by weighting at least two performance indicators read from the inference performance index data; the stability score adopts a base score deduction system, which is obtained by deducting points after multiplying the monitoring indicators in the time-series monitoring data with the corresponding penalty coefficients; the cost score is obtained by mapping the cost-effectiveness ranking corresponding to the ratio of the total number of output tokens to the cost item.

[0018] The large-scale model inference performance evaluation method and apparatus provided in this application automatically triggers data acquisition by monitoring the execution status of stress testing tasks. It incorporates three types of data—inference performance indicators, time-series monitoring data, and static asset cost information—from the model itself, model deployment hardware, and model execution tasks, respectively, into a unified analysis framework. This covers all core objects involved in large-scale model inference. After calculating performance scores, stability scores, and cost scores separately, the scores are merged into a comprehensive performance score. This achieves a comprehensive quantitative evaluation of the inference service in terms of response quality, operational reliability, and economic benefits. It solves the problem that existing evaluation schemes only focus on a single performance indicator and cannot comprehensively evaluate the overall performance of the inference service, including cost-effectiveness and operational stability. Specifically, by monitoring the execution status of load testing tasks and automatically acquiring three types of data based on the execution status, performance data, monitoring data, and cost information are aligned in time, avoiding data window deviations caused by manual operation. Performance scores are calculated based on inference performance metrics, quantifying the response speed and service quality of the inference service into standardized values. Stability scores are calculated based on time-series monitoring data, quantifying the operational reliability of the inference service during load testing. Cost scores are calculated based on static asset cost information, incorporating economic factors such as hardware procurement and power consumption into the evaluation system, making economic benefits a measurable scoring dimension. Finally, a comprehensive performance score is calculated by integrating performance, stability, and cost scores, consolidating the evaluations of the three dimensions into a single comparable score, providing a unified quantitative decision-making basis for hardware selection and deployment optimization. Attached Figure Description

[0019] Figure 1 A flowchart of Embodiment 1 of the large model inference performance evaluation method provided in this application;

[0020] Figure 2 This is a schematic diagram of the structure of Embodiment 2 of the large model inference performance evaluation device provided in this application. Detailed Implementation

[0021] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application.

[0022] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used herein are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

[0023] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."

[0024] Example 1

[0025] The following specific embodiments are given to illustrate the technical solution of this application in detail.

[0026] Figure 1 This is a flowchart of an embodiment of the large model inference performance evaluation method provided in this application. Please refer to... Figure 1 The method provided in this embodiment may include:

[0027] S101. Monitor the execution status of the large model load testing task, and based on the execution status, obtain the inference performance index data of the load testing task, the time-series monitoring data of the large model deployment hardware device within the corresponding time period of the load testing task, and the static asset cost information of the large model deployment hardware device.

[0028] It should be noted that in the traditional load testing process, the collection of performance data, the alignment of hardware monitoring data, and the correlation of cost information often rely on manual switching between different systems, which is inefficient and prone to data time window misalignment due to inconsistent operation timing. This step monitors the execution status of the load testing task and automatically triggers the subsequent data acquisition process as soon as the task is completed, ensuring the integrity and time consistency of data collection.

[0029] It's important to note that the monitoring of the load testing task's execution status is handled by the scheduling service. The scheduling service continuously monitors changes in the load testing task's status, and when the task's status changes to "completed," the scheduling service sends a trigger signal to initiate the subsequent data acquisition process. Using status monitoring instead of manual triggering has two advantages: firstly, automated triggering eliminates the need for on-site maintenance personnel, and the system responds immediately after the load testing task completes, avoiding discrepancies between the time window of the time-series monitoring data and the load testing period caused by manual delays; secondly, status monitoring accurately captures the task's completion time, providing precise time boundaries for subsequent accurate retrieval of monitoring data within the load testing period from the time-series database.

[0030] It should also be noted that the scheduling service employs different triggering strategies for different types of load testing tasks. The scheduling service first identifies the task type of the current load testing task. When the task type is a comparative evaluation type, it determines the associated tasks based on the task type. The scheduling service queries the task association table to identify all subtasks associated with the current load testing task, packages the current load testing task and all associated subtasks into a combined load testing task, and continuously monitors the execution status of all subtasks in the combined load testing task. A trigger signal is only sent after all subtasks have been completed. When the task type is a basic performance evaluation type, the scheduling service sends a trigger signal directly as soon as the status of the current load testing task changes to "completed."

[0031] The scheduling service internally maintains a mapping table between task types and trigger conditions. When a load testing task is a basic performance evaluation type, the scheduling service sends a trigger signal immediately after the task status changes to "completed," ensuring the response efficiency of regular evaluations. When a load testing task is a comparative evaluation type, meaning the same model needs to be tested sequentially on multiple different hardware platforms, the scheduling service sends a trigger signal uniformly after all related subtasks are completed, ensuring that the cost score calculation is based on a complete set of comparative data. This differentiated triggering mechanism based on task type avoids the problem of missing data caused by inconsistent completion times of subtasks in comparative evaluation scenarios.

[0032] It should be noted that in this embodiment, the entity that performs the monitoring, acquisition, calculation and fusion operations is the evaluation system, hereinafter referred to as the system.

[0033] Upon receiving the trigger signal, the system obtains inference performance metrics data of the load testing task, time-series monitoring data of the large model deployment hardware devices within the corresponding time period of the load testing task, and static asset cost information of the large model deployment hardware devices from multiple data sources. The following describes the acquisition methods of each data source and its content.

[0034] Specifically, the inference performance metrics data for the load testing task are obtained. These metrics are derived from statistical results recorded by the load testing client during the load testing task execution. These metrics reflect the end-to-end performance of the inference service when processing requests. The performance metrics data include at least the First-Word Latency Failure Rate (TTFT Failure Rate) and the First-Word Output Time (TPOT Failure Rate). First-Word Latency refers to the time interval from sending a request to receiving the first inference result token. The TTFT Failure Rate is the percentage of requests with an actual first-word latency lower than the preset target latency out of the total number of requests. First-Word Output Time refers to the average time interval between outputting each token during the inference process. The TPOT Failure Rate is the percentage of requests with an actual output time lower than the preset target time out of the total number of requests. The performance metrics data also include TotalOutput Tokens, which is the total number of inference output tokens generated by all requests during the load testing period.

[0035] The system acquires time-series monitoring data of the large-scale model deployment hardware devices for the corresponding time period of the load testing task. This data originates from a time-series database. During the load testing task, various operating parameters of the deployed hardware devices are continuously collected and stored in the time-series database. The system constructs time range query conditions based on the start and end times of the load testing task and retrieves all monitoring data for that time period from the time-series database, rather than extracting only a snapshot value at a single moment. This is because the load on the inference service changes dynamically during the load testing process; data from a single moment cannot reflect the true operating status of the hardware throughout the entire load testing period. Extracting data for the complete time period is essential to support accurate diagnosis of hardware bottlenecks and stability assessment.

[0036] The time-series monitoring data collection targets the operating parameters of the graphics processors deployed for inference services and the server nodes they reside on. This includes at least the average power consumption of the GPU, which is used to measure the power cost of subsequent inference calculations; the peak utilization rate of GPU computing units, reflecting the upper limit of the load on computing cores during inference; the peak utilization rate of GPU memory bandwidth, reflecting the upper limit of memory access load during inference; the peak utilization rate of KV cache, reflecting the upper limit of key-value cache utilization during inference; and time-series data on request error rate and latency of each request, which are used for subsequent stability score calculations.

[0037] Obtain static asset cost information for the hardware devices deployed in the large-scale model. This static asset cost information comes from an asset configuration database. This type of data is unrelated to the dynamic operation of the load testing task and describes the procurement and operating cost parameters of the hardware devices themselves. It mainly includes the unit price of the graphics processor (GPU), i.e., the procurement cost of that model; the unit electricity price, i.e., the unit electricity price in the data center or laboratory environment; and the estimated lifespan of the equipment in hours. Combining these static parameters with the load testing duration, the hardware amortization cost and electricity cost corresponding to this load testing task can be calculated, providing cost-side input for subsequent cost-effectiveness index calculations.

[0038] The data acquisition process was completed through the automatic aggregation of the aforementioned multi-source data. It's important to note that the three types of data were not simply collected and summarized in isolation; rather, they originated from three core objects involved in large-scale model inference: inference performance metrics data from the load testing client, reflecting the end-to-end performance of the model itself in the inference task; time-series monitoring data from a time-series database, reflecting the operating status and resource consumption of the hardware deployed by the model during the inference process; and static asset cost information from an asset configuration database, reflecting the cost investment of the hardware infrastructure upon which the model's tasks rely. By incorporating all elements of the data from the model itself, the hardware deployed by the model, and the tasks executed by the model into a unified analytical framework, subsequent scoring calculations can balance task execution effectiveness with cost optimization. Performance and stability scores measure the quality of the inference service from a task execution perspective, while cost scores measure the economic efficiency of the inference service from a hardware investment perspective. The comprehensive performance score, resulting from the fusion of these three metrics, more accurately reflects the overall performance of the inference service in a real deployment environment.

[0039] S102. Calculate the performance score based on the inference performance index data.

[0040] It should be noted that the purpose of calculating the performance score is to transform the response speed and service quality exhibited by the large model inference service during stress testing into a standardized numerical value, providing a unified scale input for subsequent integration with scores from other dimensions.

[0041] It's important to note that evaluating the performance of an inference service cannot rely on a single metric. First-letter latency determines the user's perceived response speed, while word processing time determines the output efficiency of generated content. Both reflect the quality of the inference service from different perspectives, and their relative importance may differ in real-world deployment scenarios. Therefore, performance scoring requires assigning weights to both metrics and considering them comprehensively.

[0042] Specifically, the calculation of the performance score based on the inference performance index data includes:

[0043] (1) Read the first performance index and the second performance index from the inference performance index data.

[0044] The system extracts the first-character latency compliance rate and the pronunciation time compliance rate from the inference performance index data obtained in step S101, which are used as the first performance index and the second performance index, respectively. The first-character latency compliance rate is the proportion of the number of requests whose actual first-character latency is lower than the preset target latency to the total number of requests, with a value ranging from 0 to 1. The closer the value is to 1, the better the compliance. The pronunciation time compliance rate is the proportion of the number of requests whose actual pronunciation time is lower than the preset target time to the total number of requests, also ranging from 0 to 1. Both of these indicators are positive indicators, meaning that the larger the value, the better the performance.

[0045] (2) Obtain the first weight corresponding to the first performance index and the second weight corresponding to the second performance index respectively.

[0046] Two performance metrics are assigned corresponding weights: the first weight corresponds to the first character latency compliance rate, and the second weight corresponds to the pronunciation time compliance rate; the sum of the two weights is 1. The weight values ​​can be adjusted based on the actual deployment scenario's emphasis on response latency and output efficiency. In scenarios with high real-time interaction requirements, a higher weight can be assigned to the first character latency compliance rate; in batch processing scenarios where output efficiency is more important, a higher weight can be assigned to the pronunciation time compliance rate. By default, each weight is set to 0.5, balancing the two metrics.

[0047] (3) Calculate the first product of the first performance index and the first weight, calculate the second product of the second performance index and the second weight, and add the first product and the second product to obtain the weighted sum.

[0048] Multiply the first character delay compliance rate by the first weight to obtain the first product; multiply the pronunciation time compliance rate by the second weight to obtain the second product. Add the two products together to obtain the weighted sum. The weighted sum reflects the overall performance after considering the relative importance of each indicator. The higher the weighted sum, the better the inference service performs in the performance dimensions that users care about.

[0049] The above calculation process can be expressed as:

[0050] ;

[0051] Among them, S p For performance scoring, R TTFT R is the first character delay compliance rate. TPOTThe target score for articulation time is defined as follows: w1 is the first weight corresponding to the target score for first-word delay, w2 is the second weight corresponding to the target score for articulation time, and w1 + w2 = 1. Normalize() is the normalization function, which can be expressed as follows: Where z is the original index value, z max z min These are the maximum and minimum values ​​of the indicator, respectively.

[0052] (4) Normalize the weighted sum and use the normalization result as the performance score.

[0053] It should be noted that the normalization process is used to linearly map the weighted sum calculated in step (3) from the original [0, 1] interval to a preset standard scoring interval, so that the performance score can be compared with the stability score and cost score on the same scoring scale. The weighted sum itself is between 0 and 1, but in order to incorporate the performance score into an evaluation system that is unified with other dimensions, the weighted sum needs to be normalized and mapped to a preset scoring interval. The normalization process is implemented by the normalization function Normalize(), which multiplies the weighted sum by a preset full score value to obtain the final performance score. In one feasible scheme, the full score is one hundred points. The normalized performance score is one of the three inputs for subsequent comprehensive score fusion, and its value directly reflects the comprehensive performance of the inference service in terms of response speed and service quality during stress testing.

[0054] S103. Calculate a stability score based on the time-series monitoring data.

[0055] It should be noted that the purpose of the stability score is to transform the fluctuations in operational reliability and service quality exhibited by the inference service during stress testing into standardized values, providing a unified scale of input for subsequent integration with scores from other dimensions.

[0056] It's important to note that the request error rate reflects the proportion of failures the inference service encounters when processing requests, while latency jitter variance reflects the degree of fluctuation in the inference service's response time. Both reflect the service's stability and predictability from different perspectives. The stability score is calculated using a base score deduction system, meaning it starts from a maximum score and deducts points based on the severity of each unstable factor, rather than starting with zero points. This is because the normal operation of the inference service should be stable and reliable; unstable factors represent abnormal deviations and should be penalized from the maximum score.

[0057] Specifically, the calculation of the stability score based on the time-series monitoring data includes:

[0058] (1) Extract the first monitoring data and the second monitoring data from the time-series monitoring data.

[0059] The request error rate and latency jitter variance are extracted from the acquired time-series monitoring data and used as the first and second monitoring data, respectively. The request error rate is the proportion of failed requests to the total number of requests during the load test, ranging from 0 to 1; a smaller value indicates a more reliable service. Latency jitter variance is the variance of the latency of each request during the load test, used to measure the consistency of the inference service response time. A smaller variance indicates that the response time of each request is closer to the average, and the service is more stable; a larger variance indicates more drastic fluctuations in response time, and poorer predictability of the user experience.

[0060] The latency jitter variance is calculated by first averaging the latency of all requests during the load test, then summing the squared deviations of each request's latency from the average, and finally dividing by the total number of requests. This calculation process can be expressed as:

[0061] ;

[0062] ;

[0063] Where N is the total number of requests within the time window, t i Let V be the specific time taken for the i-th request, μ be the average latency, and V be the latency. jitter This is the variance of the delay jitter.

[0064] (2) Obtain the preset base score, the first penalty coefficient corresponding to the first monitoring data, and the second penalty coefficient corresponding to the second monitoring data.

[0065] Obtain the preset stability base score and various penalty coefficients. The base score is the maximum stability rating, and is out of 100. The first penalty coefficient corresponds to the request error rate, and is used to deduct points from the base score for each unit of request error rate evaluation value (e.g., one percent error rate). The second penalty coefficient corresponds to the latency jitter variance, and is used to deduct points from the base score for each unit of latency jitter variance.

[0066] The penalty coefficient is not fixed but dynamically determined based on the task accuracy requirements of this load testing task and the hardware fault tolerance capabilities of the large-scale model deployment hardware. The specific determination steps include: extracting the initial penalty benchmark for the task; the system parses the metadata of the current load testing task and extracts the task accuracy requirement level. For tasks with high accuracy requirements, such as financial analysis and logical reasoning, a higher initial penalty coefficient is matched; for tasks with a certain degree of semantic tolerance, such as text summarization and casual conversation, a lower initial penalty coefficient is matched. Next, a hardware fault tolerance feature vector is constructed. The system obtains key physical parameters of the large-scale model deployment hardware in real time through the hardware management interface, including whether memory error correction is supported, interconnect topology bandwidth, and memory capacity margin. Based on these parameters, the system outputs a quantified hardware fault tolerance index through a preset hardware fault tolerance evaluation model. Then, the initial penalty coefficient is negatively attenuated based on the hardware fault tolerance index. For devices with a higher hardware fault tolerance index, the corresponding penalty coefficient is adjusted downwards by a larger margin; for devices with a lower fault tolerance index, the penalty coefficient remains strictly adjusted or is adjusted downwards by a smaller margin. The adjusted penalty coefficient is not lower than the preset lower limit of the penalty coefficient. The first penalty coefficient and the second penalty coefficient are determined in the same way as described above, and each corresponds to a different initial penalty benchmark and hardware fault tolerance correction amount.

[0067] This dynamic penalty mechanism, which combines task accuracy requirements with hardware fault tolerance characteristics, ensures that in heterogeneous computing power evaluation, the stability score will neither be overly lenient on potential failures of high-fault-tolerant devices nor overly penalize normal physical jitter of low-fault-tolerant devices, thus achieving adaptive calibration of the evaluation standard.

[0068] (3) Calculate the third product of the first monitoring data and the first penalty coefficient, and calculate the fourth product of the second monitoring data and the second penalty coefficient.

[0069] Multiplying the request error rate by the first penalty coefficient yields the third product, which reflects the total penalty for request failures. Multiplying the latency jitter variance by the second penalty coefficient yields the fourth product, which reflects the total penalty for response time fluctuations. The sum of these two penalty values ​​is the total penalty for instability factors in this load test.

[0070] (4) Subtract the third product and the fourth product from the base value to obtain the intermediate value.

[0071] Subtract the sum of the two deduction values ​​calculated in step (3) from the base score to obtain the median value. The median value reflects the remaining stability score of the inference service after deducting the effects of various unstable factors. When the inference service performs perfectly during the stress test, that is, when the request error rate is zero and the latency jitter variance is zero, the median value is equal to the full base score.

[0072] (5) Compare the intermediate value with zero, and take the larger one as the stability score.

[0073] The median value is compared to zero, and the larger value is taken as the final stability score. This boundary handling ensures that the stability score will not be negative. When the total deduction for unstable factors exceeds the base score, the lower limit of the stability score is zero. In real-world scenarios, if the request error rate of the inference service is extremely high or the latency fluctuations are extremely drastic, the total deduction may exceed the base score; in this case, the stability score is capped at zero.

[0074] The calculation process for the stability score mentioned above can be expressed as follows:

[0075] ;

[0076] Among them, S s For stability scoring, S base Based on the base score, E rate To request the error rate, V jitter To represent the variance of delay jitter, α is the first penalty coefficient, β is the second penalty coefficient, and max() is the maximum value function to ensure that the stability score is not lower than zero.

[0077] S104. Calculate the cost score based on the static asset cost information.

[0078] It should be noted that the purpose of calculating the cost score is to transform the question of how much money is spent to produce how many tokens into a standardized value, so as to provide a unified scale of input for subsequent integration with scores from other dimensions.

[0079] It's important to note that existing large-scale model inference evaluation schemes typically only focus on performance metrics and cannot answer purchasing decision questions such as which graphics card is the most cost-effective. This step quantifies and correlates hardware procurement costs, power consumption costs, and inference output (total number of output tokens), making economic efficiency a measurable and comparable scoring dimension. The cost score is not simply calculated as a price-performance ratio numerical value; instead, it is ranked relative to historical price-performance ratio samples of similar hardware, ensuring that the cost score reflects the current evaluated device's economic competitive position among similar hardware.

[0080] Specifically, the calculation of the cost score based on the static asset cost information includes:

[0081] (1) Read the total number of output tokens from the inference performance index data, read the hardware configuration unit price and electricity unit price from the static asset cost information, and calculate the hardware amortization cost and electricity cost according to the duration of the stress test task.

[0082] The system reads the total output token count from the acquired inference performance metrics data. This total number of inference output tokens generated by all requests during the load test reflects the total output of the inference service during this load test. Simultaneously, the system reads the hardware configuration unit price and electricity unit price from the acquired static asset cost information.

[0083] The hardware amortization cost is calculated as follows: divide the unit price of the hardware configuration by the estimated service life in hours, and then multiply by the duration of this load test. The estimated service life in hours represents the total expected usage time of the equipment, and the hardware configuration unit price is amortized hourly over the equipment's lifespan to each load test, reflecting the hardware depreciation cost incurred in this load test.

[0084] The electricity cost is calculated as follows: First, obtain the average power P of the GPU. avg (Unit: Watts W), divide it by 1000 to convert to kilowatts (kW), then multiply by the operating time T (unit: hours h) to get the total energy consumption in kilowatt-hours (kWh) (i.e., total power consumption); finally, multiply the total energy consumption by the unit electricity price C. unit (Unit: Yuan / kWh) This gives you the electricity cost. The corresponding calculation formula is: .

[0085] (2) Calculate the ratio of the total number of output tokens to the sum of the hardware amortization cost and the power cost to obtain the cost-performance index.

[0086] Using the total number of output tokens as the numerator and the sum of hardware amortization costs and electricity costs as the denominator, the ratio of the two is calculated to obtain the cost-effectiveness index. The cost-effectiveness index reflects the number of inference tokens that can be produced per unit of cost. The higher the cost-effectiveness index, the more tokens the inference service produces with the same hardware and electricity cost investment, and the better the economic efficiency.

[0087] The above calculation process can be expressed as:

[0088] ;

[0089] Where ROI is the cost-effectiveness index, T out For the total number of output tokens (Total_Output_Tokens), C hw For hardware amortization costs, C power Electricity cost. Electricity cost = (average power / 1000) × duration × electricity price per unit. Hardware amortization cost = (equipment unit price / estimated lifespan in hours) × duration.

[0090] (3) Obtain a sample set of cost-effectiveness index of historical stress testing tasks.

[0091] A sample set of performance-to-price ratio (PTR) indices from historical load testing tasks is retrieved from the database. This sample set contains PTR values ​​calculated for similar hardware in previous load tests, forming a comparative reference distribution. The values ​​in the sample set reflect the historical PTR performance of similar hardware under different models and configurations, providing a benchmark for the PTR index of the current evaluation object.

[0092] (4) Count the number of samples in the sample set whose values ​​are less than or equal to the cost-effectiveness index, divide the number of samples by the total number of samples, and then multiply by the preset full score to obtain the cost score.

[0093] Iterate through each sample value in the cost-effectiveness index sample set, count the number of samples whose values ​​are less than or equal to the current evaluation object's cost-effectiveness index, divide this number by the total number of samples in the sample set to obtain the percentile ranking of the current cost-effectiveness index in the historical samples, and then multiply the percentile ranking by the preset full score to obtain the final cost score.

[0094] The above calculation process can be expressed as:

[0095] ;

[0096] Among them, f rank () represents the ranking scoring function, which is calculated as follows:

[0097] ;

[0098] Where x is the cost-effectiveness index for which the ranking score is to be calculated, S is the historical cost-effectiveness index sample set, M is the total number of samples in the sample set S, and s i Let be the cost-effectiveness index value of the i-th sample in the sample set S, and let II(•) be the indicator function. When the condition within the parentheses s i Returns 1 if x ≤ x is true, otherwise returns 0. ∑ is the summation symbol, which represents the number of samples in the sample set S that are less than or equal to x.

[0099] Therefore, in this step, ROI target Set represents the cost-effectiveness index of the current evaluation object. ROI M is a sample set of historical load testing task cost-effectiveness indices obtained from the database. ROI Total number of samples. Cost score S c This is equivalent to ROI target and Set ROI Substitute f rank The result obtained by the () function is the statistical Set. ROI ROI less than or equal to targetThe percentage of the sample size is multiplied by the maximum score (i.e., 100 in the formula above). A higher cost score indicates a better economic competitive position for the evaluated object among similar hardware.

[0100] S105. Calculate the comprehensive performance score of the large model inference by integrating the performance score, the stability score, and the cost score.

[0101] It's important to note that the fusion of comprehensive performance scores is not a simple summation or averaging. Different users prioritize different aspects of performance, cost, and stability across various scenarios. In performance-first scenarios, response speed and inference service throughput are primary considerations; in cost-first scenarios, economic efficiency and cost-effectiveness are the core concerns; and in stability-first scenarios, service reliability and predictability are more critical. Therefore, the fusion of comprehensive scores employs a user-configurable weighting mechanism, allowing users to adjust the relative importance of each dimension according to their specific needs. The fusion formula is:

[0102] ;

[0103] Among them, S total For the overall score, S p S c S s These are performance score, cost score, and stability score, respectively. p W c W s Configure weights for the corresponding users, and satisfy W. p +W c +W s =1.

[0104] It should be noted that the base score S for stability rating is... base This is not a fixed value, but rather dynamically determined based on the processor type of the deployed hardware and the performance curves of that processor type in previous stress tests. Before using this method for the first time, a processor type baseline table, baseline score conversion rules, and correction rules need to be established in advance. The specific steps for establishing these rules include:

[0105] First, operational data for each processor type was collected under preset standard loads. For each known processor type, a benchmark inference task was run under standard load conditions, and the baseline values ​​of the request error rate and latency jitter variance were recorded. Standard load conditions included preset concurrent request counts, input sequence lengths, and output sequence lengths to ensure that different processor types were compared under the same conditions.

[0106] Secondly, baseline score conversion rules are established. The request error rate baseline and latency jitter variance baseline values ​​for each processor type under standard load are converted into initial base score values ​​corresponding to that processor type according to a preset normalized mapping relationship. The principle for setting the normalized mapping relationship is: the lower the request error rate baseline value, the smaller the latency jitter variance baseline value, and the higher the initial base score value, the better the stability performance of the processor under standard load. The initial base score values ​​for each processor type, along with the processor type identifier, are written into the processor type baseline table.

[0107] Then, correction rules are established, defining the mapping relationship between the base score correction amount and the performance curve trend. Performance curve variation characteristics are extracted from historical stress test data, including the slope of request error rate as concurrency increases and the distribution variance of latency jitter variance across different load ranges. A correspondence table between these variation characteristics and the base score correction amount is established. In the table, a larger slope indicates a greater downward correction of the base score, reflecting a faster decrease in processor stability under high concurrency; a larger distribution variance indicates a greater downward correction of the base score, reflecting more severe latency fluctuations under different loads. The corrected base score value is no lower than the preset lower limit and no higher than the preset upper limit.

[0108] After the processor type baseline table, baseline score conversion rules, and correction rules are established, the base score used in each stability score calculation is dynamically determined according to the following steps. The specific steps for dynamic determination include: First, reading the processor type identifier of the large-scale model deployment hardware device used in this load testing task; second, querying the processor type baseline table to determine whether the processor type is being tested for the first time. If so, the initial testing process is executed; otherwise, the dynamic correction process is executed.

[0109] The initial evaluation process includes: collecting baseline values ​​for request error rate and latency jitter variance of the processor type under a preset standard load; calculating the initial base score for the processor type according to the aforementioned baseline score conversion rules; and writing the initial base score into the record corresponding to the processor type identifier in the processor type baseline table. Different processor types have different stability baselines in inference scenarios. For example, the request error rate baseline of one processor model is naturally higher than that of another model under high concurrency. Using a uniform base score would make it impossible to directly compare the stability scores between different processors. By determining independent initial base scores for each processor type through the initial evaluation process, the comparability of stability scores between different processor types is ensured.

[0110] The dynamic correction process includes: retrieving the currently effective base score for the processor type from the processor type baseline table, and obtaining the request error rate and latency jitter variance data recorded in previous load tests for the processor type from the database; calculating the slope of the request error rate as concurrency increases, and the distribution variance of latency jitter variance in different load intervals; determining the base score correction amount according to the aforementioned correction rules based on the slope and distribution variance; adding the currently effective base score to the base score correction amount to obtain the corrected base score, and updating the base score record corresponding to the processor type in the processor type baseline table. As load test data for this type of processor continues to accumulate, the dynamic correction process is triggered multiple times, allowing the base score to be continuously adjusted according to the changing trend of the performance curve, and the stability score to reflect the relative stability of this type of processor among similar hardware.

[0111] Before fusing and calculating the overall performance score, the system first receives and weights the evaluation preference parameters. Specifically, before calculating the overall performance score of the large model inference by fusing the performance score, the stability score, and the cost score, the following steps are also included:

[0112] (1) Receive the evaluation preference parameters input by the user.

[0113] The system provides an entry point for configuring assessment preferences through a user interface, allowing users to select preset preference types. Assessment preference parameters include three options: performance priority, cost priority, and stability priority. Different preference choices correspond to different scoring weight combinations, enabling the overall score to better reflect the user's decision-making tendencies.

[0114] (2) Based on the evaluation preference parameters, match the corresponding target weight combination from the preset weight mapping table. The target weight combination includes performance weight, stability weight and cost weight.

[0115] The weight mapping table records the weight combination corresponding to each evaluation preference parameter. When the evaluation preference parameter is performance-oriented, the performance weight in the weight combination is higher; when it is cost-oriented, the cost weight is higher; and when it is stability-oriented, the stability weight is higher. The sum of all weights is always 1.

[0116] It should be noted that the target weight combination matched above is not fixed, but dynamically adjusted based on the cumulative deployment time of the large model's hardware, the accuracy requirements of this load test, and evaluation preference parameters. Each correction mapping table is pre-configured based on historical test data, and the specific steps for dynamic adjustment include:

[0117] First, the cost weights are adjusted based on the cumulative deployment time. The cumulative deployment time of the hardware devices deployed in the large model since the initial deployment is obtained, and the decay coefficient of the cost weights is calculated. The decay coefficient is negatively correlated with the cumulative deployment time. The matched cost weights are then added to the cost weight adjustment amount to obtain the adjusted cost weights. The adjusted cost weights do not exceed the preset cost weight upper limit.

[0118] Secondly, the performance weights are adjusted according to task precision requirements. The task precision requirement level for this load testing task is obtained; a higher precision requirement level indicates stricter requirements on inference service response latency and total inference service throughput. The corresponding correction amount is read from the performance weight adjustment table based on the precision requirement level; the higher the precision requirement level, the larger the correction amount. This correction amount is then added to the matched performance weights to obtain the adjusted performance weights.

[0119] Then, the stability weights are adjusted for task accuracy requirements and evaluation preferences. The first adjustment amount is read from the stability weight adjustment table based on the task accuracy requirement level; the higher the accuracy requirement level, the larger the first adjustment amount. Simultaneously, the second adjustment amount is read from the stability preference adjustment table based on the evaluation preference parameter; when the evaluation preference parameter is stability priority, the second adjustment amount is positive. The matched stability weights are then added to the first and second adjustment amounts to obtain the adjusted stability weights.

[0120] Finally, the corrected performance weights, corrected stability weights, and corrected cost weights are normalized so that their sum equals 1 again. The normalized weight combination is then used as the final target weight combination for fusion calculation.

[0121] (3) During the fusion calculation, the performance score, the stability score and the cost score are weighted and summed according to the target weight combination to obtain the comprehensive performance score.

[0122] The system multiplies the performance score calculated in S102 by a performance weight, the stability score calculated in S103 by a stability weight, and the cost score calculated in S104 by a cost weight. The sum of these three products yields the overall performance score. The overall performance score is a standardized numerical value; a higher score indicates better overall performance of the inference service under the current evaluation preferences.

[0123] After generating a comprehensive performance score, the system also performs bottleneck diagnosis based on time-series monitoring data. The diagnosis is based on a pre-defined diagnostic rule base. It's important to note that the construction of this rule base is flexible. The system extracts time-series monitoring data accumulated from historical load testing tasks and corresponding manually labeled bottleneck types to form rule primitives. Each rule primitive is essentially a conditional mapping between monitoring metric values ​​and bottleneck types. During the rule base's creation, the rule primitives are indexed and categorized according to hardware type. This allows for rapid retrieval of the corresponding rule subset for matching based on the hardware type of the large model's deployment hardware during subsequent use, without needing to traverse all rules.

[0124] Specifically, the establishment of the diagnostic rule base includes:

[0125] (i) Collect time-series monitoring data from historical load testing tasks and the corresponding manually labeled bottleneck types.

[0126] The system retrieves time-series monitoring data recorded in historical load testing tasks from the database, as well as bottleneck type conclusions manually marked by operations and maintenance personnel after each load test. The bottleneck type marking is based on the analysis of monitoring data and judgment of actual deployment experience. For example, when the video memory bandwidth utilization is close to full load and the total throughput of the inference service no longer increases with the increase of concurrent requests, the operations and maintenance personnel mark the bottleneck type of this load test as video memory bandwidth limited; when the KV cache occupancy rate exceeds a preset threshold, it is marked as video memory capacity limited.

[0127] (ii) Extract monitoring indicators and threshold ranges corresponding to the monitoring indicators from the time-series monitoring data, and encapsulate the correspondence between the monitoring indicators, the threshold ranges and the bottleneck types into rule primitives.

[0128] For each historical load test task, the system extracts key monitoring indicators that trigger bottlenecks and their threshold ranges from time-series monitoring data. For example, from a load test marked as a memory bandwidth-limited bottleneck, two monitoring features are extracted: the peak memory bandwidth utilization exceeds 90%, and the slope of the total inference service throughput increasing with concurrency is lower than a preset threshold. The system encapsulates the monitoring indicator name, conditional threshold (e.g., greater than 90%, slope lower than threshold), and the corresponding bottleneck type (memory bandwidth-limited) into a rule primitive. The rule primitive contains monitoring indicator name fields, conditional operator fields, conditional threshold fields, and bottleneck type fields, forming a complete mapping from condition to conclusion.

[0129] (iii) Generate a unique index identifier for each rule primitive and classify the rule primitives according to hardware type.

[0130] Specifically, the index identifier adopts a segmented encoding structure, consisting of three segments: hardware type code, rule classification code, and sequence number. The hardware type code is the first segment of the index identifier, occupying a fixed prefix position, and is used to identify the hardware type to which the rule primitive applies. The rule classification code is the second segment, used to identify the bottleneck type corresponding to the rule primitive. The sequence number is the third segment, an incremental number for rule primitives under the same hardware type and bottleneck type. During classification, based on the hardware type of the load testing task from which the rule primitive originates, the corresponding hardware type code is extracted and filled into the hardware type code field of the index identifier. The bottleneck type field in the rule primitive is parsed to extract the corresponding rule classification code, which is then filled into the rule classification code field of the index identifier. A sequence number is generated to ensure that the index identifier under the same hardware type code and rule classification code is unique. After classification, rule primitives with the same hardware type code prefix are grouped into the same hardware type category. By categorizing by hardware type, the diagnostic rules for different types of processors are logically independent of each other. For example, a memory bandwidth limitation rule applicable to a certain type of GPU will not be mistakenly applied to the diagnosis of another GPU with different memory specifications.

[0131] (iiii) Store the classified rule primitives and their index identifiers in the rule base to form a diagnostic rule base.

[0132] The system stores the categorized rule primitives in the diagnostic rule base according to hardware type. The diagnostic rule base uses a partitioned storage structure by hardware type, with the hardware type code as the partition index key. Each hardware type corresponds to an independent rule subset, and the rules within each subset are sorted according to the rule category code. When adding or modifying a rule, the system locates the corresponding rule subset based on the hardware type code of the rule primitive, affecting only the rule subset for that specific hardware type and not rules for other hardware types. During diagnostic matching, the system directly locates and iterates through the corresponding rule subset based on the hardware type code of the hardware devices deployed in the large model, without needing to traverse all rules. Users can select tags for specific hardware types and bottleneck types from the rule base to generate customized diagnostic rule combinations as needed.

[0133] It's important to note that the rule primitives in the diagnostic rule base are not fixed. The system dynamically adjusts the weights of these primitives based on feedback from diagnostic results. Each rule primitive maintains a confidence weight, initially set to 1. When a rule primitive is triggered and generates a diagnostic result, if the operations and maintenance personnel correct this result (e.g., changing the automatically diagnosed bottleneck type to another), the system records this correction event and reduces the confidence weight of the original rule primitive by a preset decay step. When the confidence weight of a rule primitive falls below a preset minimum confidence threshold, the rule primitive is marked as pending review and no longer participates in diagnostic matching. It will only become effective again after the operations and maintenance personnel reconfirm or adjust the rule conditions. This dynamic adjustment mechanism for rule confidence based on diagnostic feedback enables the diagnostic rule base to self-evolve, and the diagnostic accuracy gradually improves with usage time.

[0134] After the diagnostic rule base is established, the system matches the time-series monitoring data acquired by S101 with the rule base and outputs the diagnostic results. Specifically, these include:

[0135] (1) Match the time-series monitoring data with the preset diagnostic rule library, traverse each diagnostic rule in the diagnostic rule library, and determine whether the value of each monitoring indicator in the time-series monitoring data meets the preset condition threshold in the diagnostic rule.

[0136] The system retrieves a subset of rules corresponding to the hardware type of the large-scale model deployed in the current load testing task from the diagnostic rule base. It then iterates through each rule primitive in the subset, checking whether the values ​​of each monitoring indicator in the time-series monitoring data meet the threshold conditions defined in the rule primitive. This method of matching rule subsets based on hardware type avoids blindly traversing all rules for all hardware types, improving matching efficiency and diagnostic accuracy.

[0137] (2) When the value of the corresponding monitoring indicator in the time-series monitoring data meets the condition threshold, the corresponding diagnostic rule is triggered to generate a diagnostic result containing the bottleneck type and optimization suggestions.

[0138] When the conditions of a certain rule primitive are met, the system triggers the rule and generates a diagnostic record. The diagnostic record includes: the bottleneck type (e.g., memory bandwidth-limited, compute unit-limited, memory capacity-limited, communication bandwidth-limited, etc.), a description of the triggering condition (e.g., peak memory bandwidth utilization exceeds 90% and the total inference service throughput growth stagnates), and corresponding optimization suggestions (e.g., increasing tensor parallelism, enabling KV cache quantization, increasing batch processing size, etc.). The diagnostic results are output in conjunction with the overall performance score, allowing users to see both the overall performance and where the problems lie, as well as how to improve it.

[0139] After generating the diagnostic results, the system further locates the specific hardware components corresponding to the bottleneck, including:

[0140] (i) Based on the bottleneck type in the diagnostic results, query the preset anomaly location mapping table, which records the hardware component identifiers corresponding to each bottleneck type.

[0141] The system maintains an anomaly location mapping table, which associates each bottleneck type with a specific hardware component identifier. These hardware component identifiers include memory bandwidth unit identifiers, computing unit identifiers, and communication interface unit identifiers, with each bottleneck type corresponding to a unique identifier or a set of hardware component identifiers. For example, a memory bandwidth-limited bottleneck corresponds to a memory bandwidth unit identifier, a computing unit-limited bottleneck corresponds to a computing unit identifier, and a communication bandwidth-limited bottleneck corresponds to a communication interface unit identifier.

[0142] (ii) Read the target component identifier corresponding to the bottleneck type from the anomaly location mapping table, and determine the target component identifier as the anomaly location in the large model deployment hardware device.

[0143] The system reads the corresponding component identifier from the mapping table, marks it as an abnormal location, and outputs it along with the diagnostic results. Based on this, maintenance personnel can directly locate the hardware components that need attention or replacement.

[0144] (3) The diagnostic results are correlated with the comprehensive performance score and output.

[0145] The system associates diagnostic results with overall performance scores according to task identifiers and presents them in the same output report or the same set of database records, fully reflecting the evaluation conclusions of the reasoning service at both the overall evaluation and specific problem levels.

[0146] To support the long-term traceability of evaluation conclusions, after completing all scoring calculations and diagnostic analyses, the system stores all raw data and calculation results in snapshot form. The method also includes:

[0147] (1) After completing the calculation of the performance score, the stability score, the cost score and the comprehensive performance score, the inference performance index data, the time series monitoring data, the static asset cost information and each score result are assembled into a structured data object.

[0148] The system assembles inference performance index data, time-series monitoring data, and static asset cost information, along with calculated performance scores, stability scores, cost scores, comprehensive performance scores, and diagnostic results, into a complete structured data object according to a preset data structure. This structured data object contains end-to-end data from the original input to the final conclusion, and can completely reconstruct the calculation basis of this evaluation.

[0149] (2) Serialize the structured data object into text in a lightweight data exchange format.

[0150] The system serializes structured data objects into text in a lightweight data exchange (JEE) format, such as JSON, YAML, or XML. These formats are language-independent, highly readable, and easy to parse, making them suitable as persistent storage formats for snapshot data. The serialized text format preserves all fields and values ​​of the structured data object in plain text.

[0151] (3) Obtain the task identifier of the current load test task, associate the text with the task identifier, and write it into the snapshot storage table of the database.

[0152] The system obtains the unique task identifier of the current load testing task, establishes a relationship between the task identifier and a lightweight data exchange format text file, and writes it to a snapshot storage table in the database specifically used to store snapshot data. Each record corresponds to a complete analysis snapshot of a load testing task, with the task identifier serving as the primary key, supporting subsequent quick retrieval by task identifier.

[0153] It's important to note that the data retention strategy for the snapshot storage table is configured differently based on the type and importance of the evaluation task. For snapshot data generated by routine performance evaluation tasks, the system stores it according to the preset default retention period, and snapshot data exceeding the retention period is automatically cleaned up. For tasks marked as important benchmark evaluations, such as hardware selection decisions and procurement assessments, the system marks the snapshot data as permanently retained, without being limited by the default retention period. Operations personnel can set the retention level for tasks during creation, and the system automatically matches the corresponding data cleanup strategy based on the retention level. This differentiated snapshot retention strategy based on task importance ensures the long-term traceability of critical evaluation data while preventing insufficient storage space in the snapshot storage table due to the continuous accumulation of large amounts of routine evaluation data.

[0154] Taking JSON text as an example, when it is necessary to review historical evaluation results, the system provides a snapshot restoration function, which specifically includes:

[0155] (1) Receive the task identifier of the historical stress test task input by the user.

[0156] Users can input the task identifier of the historical load testing task that needs to be revisited through the system interface or API.

[0157] (2) Based on the task identifier, query the corresponding JSON text from the snapshot storage table of the database.

[0158] The system uses the task identifier as the query condition to retrieve the corresponding record from the snapshot storage table and extract the stored JSON text.

[0159] (3) Deserialize the JSON text to restore the inference performance index data, time series monitoring data, static asset cost information and each score result of the historical stress test task.

[0160] The system performs deserialization on the JSON text, restoring the JSON-formatted text into a structured data object. From the object, it extracts inference performance metrics, time-series monitoring data, static asset cost information, as well as performance scores, stability scores, cost scores, and overall performance scores. The restored data is completely consistent with the data at the time of the evaluation.

[0161] (4) Output the restored data to reproduce the historical evaluation conclusions.

[0162] The system presents the restored data and scoring results across all dimensions to the user in their original format, allowing the user to fully trace the entire evaluation process. Even if the original monitoring data in the time-series database has been cleaned up due to retention period limitations, or if the unit price of hardware configurations has changed due to market adjustments, historical evaluation conclusions are still verifiable, ensuring the accuracy of the audit and the traceability of the report.

[0163] It should be noted that, along with the overall performance score and diagnostic results, the system also generates a trend analysis report of the overall score. The system arranges the overall performance score of this load test task with the overall scores of previous load tests on the hardware platform in a time series, calculating the score change trend. If the overall score shows a continuous upward trend, it indicates that the inference service is undergoing continuous optimization; if the overall score drops significantly, the system automatically retrieves configuration items or software versions that changed between two adjacent load tests, prompting operations personnel to pay attention to changes that may affect performance. The trend analysis report is stored in the database along with the snapshot data, providing a continuous reference for the long-term performance management of the inference service.

[0164] The method provided in this embodiment automatically triggers data acquisition by monitoring the execution status of load testing tasks. It incorporates inference performance indicators, time-series monitoring data, and static asset cost information into a unified analysis framework. For the first time, it introduces a cost score in addition to performance and stability scores, and integrates the three-dimensional scores into a comprehensive performance score through a user-configurable weighting mechanism. This achieves a comprehensive quantitative evaluation of the inference service across three dimensions: response quality, operational reliability, and economic benefits, solving the problem of existing evaluation schemes focusing only on a single performance indicator. Specifically, by monitoring the execution status of load testing tasks and automatically triggering data acquisition upon task completion, it avoids time window deviations caused by manual operation, ensuring the time alignment of performance data, monitoring data, and cost information. The performance score is calculated based on a weighted normalization of the first-character latency compliance rate and the character output time compliance rate, taking into account the performance of the inference service in both response latency and output efficiency dimensions, avoiding a one-sided evaluation of a single indicator. A base score deduction system is used to calculate a stability score starting from full score based on request error rate and latency jitter variance, quantifying the operational reliability of the inference service. Instability factors are represented by penalty items, and the penalty coefficient is based on the fault tolerance of the hardware device. The dynamic determination of force and task accuracy requirements ensures the comparability of stability scores in heterogeneous hardware environments. A cost-effectiveness index is calculated by comparing the total number of output tokens with the sum of hardware amortization costs and power costs. This index is then mapped to a cost score after relative ranking within a historical set of cost-effectiveness samples of similar hardware, transforming economic benefits from absolute values ​​into standardized scores reflecting competitive position. Furthermore, by receiving user-inputted evaluation preference parameters and matching them with corresponding target weight combinations, the three-dimensional scores are weighted and summed according to performance, stability, and cost weights. This ensures that the comprehensive performance score reflects the user's actual decision-making preferences, providing direct quantitative evidence for hardware selection and deployment optimization. Furthermore, this embodiment also constructs an adaptive stability evaluation model for heterogeneous computing power environments. Addressing the issue of evaluation distortion in heterogeneous hardware clusters caused by the static unified scoring of existing stress testing tools, a dynamic penalty mechanism is used to deeply decouple the accuracy requirements at the task level from the hardware fault tolerance of the physical base. This results in a non-linear negative correlation between the stability deduction coefficient and the hardware fault tolerance, effectively avoiding the problem of high fault tolerance equipment failures being masked or low fault tolerance equipment normal fluctuations being over-penalized. This significantly improves the objectivity and accuracy of the selection and evaluation of large-scale heterogeneous computing power clusters.

[0165] Example 2

[0166] Corresponding to the aforementioned embodiment of a large model inference performance evaluation method, this application also provides an embodiment of a large model inference performance evaluation device.

[0167] Figure 2 This is a schematic diagram of the structure of Embodiment 2 of the large model inference performance evaluation device provided in this application. Please refer to... Figure 2The apparatus provided in this embodiment includes an acquisition module 210 and a calculation module 220;

[0168] The acquisition module 210 is used to monitor the execution status of the large model stress test task, and acquire the inference performance index data of the stress test task, the time-series monitoring data of the large model deployment hardware device within the corresponding time period of the stress test task, and the static asset cost information of the large model deployment hardware device based on the execution status.

[0169] The calculation module 220 is used to calculate a performance score based on the inference performance index data;

[0170] The calculation module 220 is also used to calculate a stability score based on the time-series monitoring data;

[0171] The calculation module 220 is also used to calculate a cost score based on the static asset cost information;

[0172] The calculation module 220 is also used to calculate the comprehensive performance score of the large model inference by integrating the performance score, the stability score and the cost score; The performance score is calculated by weighting at least two performance indicators read from the inference performance index data; the stability score adopts a base score deduction system, which is obtained by deducting points after multiplying the monitoring indicators in the time-series monitoring data with the corresponding penalty coefficients; the cost score is obtained by mapping the cost-effectiveness ranking corresponding to the ratio of the total number of output tokens to the cost item.

[0173] The apparatus of this embodiment can be used to perform... Figure 1 The steps of the method embodiment shown are similar in principle and process, and will not be repeated here.

[0174] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.

[0175] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this application according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0176] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

Claims

1. A method for evaluating the inference performance of large models, characterized in that, The method includes: Monitor the execution status of the large model load testing task, and based on the execution status, obtain the inference performance index data of the load testing task, the time-series monitoring data of the large model deployment hardware device within the corresponding time period of the load testing task, and the static asset cost information of the large model deployment hardware device; Calculate a performance score based on the aforementioned inference performance index data; A stability score is calculated based on the aforementioned time-series monitoring data; Calculate the cost score based on the static asset cost information; The comprehensive performance score of the large model inference is calculated by combining the performance score, the stability score, and the cost score. The performance score is calculated by weighting at least two performance indicators read from the inference performance index data; the stability score adopts a base score deduction system, which is obtained by deducting points after multiplying the monitoring indicators in the time-series monitoring data with the corresponding penalty coefficients; the cost score is obtained by mapping the cost-effectiveness ranking corresponding to the ratio of the total number of output tokens to the cost item.

2. The method according to claim 1, characterized in that, After calculating the overall performance score of the large model inference, the method further includes: The time-series monitoring data is matched with a preset diagnostic rule base, and each diagnostic rule in the diagnostic rule base is traversed to determine whether the value of each monitoring indicator in the time-series monitoring data meets the preset condition threshold in the diagnostic rule. When the value of the corresponding monitoring indicator in the time-series monitoring data meets the condition threshold, the corresponding diagnostic rule is triggered to generate a diagnostic result containing the bottleneck type and optimization suggestions. The diagnostic results are then correlated with the overall performance score and output.

3. The method according to claim 2, characterized in that, After generating the diagnostic results, which include bottleneck types and optimization suggestions, the process includes: Based on the bottleneck type in the diagnostic results, a preset anomaly location mapping table is queried. The anomaly location mapping table records the hardware component identifiers corresponding to each bottleneck type. Read the target component identifier corresponding to the bottleneck type from the anomaly location mapping table, and determine the target component identifier as the anomaly location in the large model deployment hardware device.

4. The method according to claim 2, characterized in that, The establishment of the diagnostic rule base includes: Collect time-series monitoring data from historical load testing tasks and the corresponding manually labeled bottleneck types; Extract monitoring indicators and threshold ranges corresponding to the monitoring indicators from the time-series monitoring data, and encapsulate the correspondence between the monitoring indicators, the threshold ranges and the bottleneck types into rule primitives; A unique index identifier is generated for each rule primitive, and the rule primitives are classified according to hardware type; The categorized rule primitives and their index identifiers are stored in the rule base to form a diagnostic rule base.

5. The method according to claim 1, characterized in that, The calculation of the performance score based on the inference performance index data includes: Read the first performance index and the second performance index from the inference performance index data; Obtain the first weight corresponding to the first performance indicator and the second weight corresponding to the second performance indicator, respectively. Calculate the first product of the first performance index and the first weight, calculate the second product of the second performance index and the second weight, and add the first product and the second product to obtain a weighted sum; The weighted sum is normalized, and the normalized result is used as the performance score.

6. The method according to claim 1, characterized in that, The calculation of the stability score based on the time-series monitoring data includes: Extract the first monitoring data and the second monitoring data from the time-series monitoring data; Obtain a preset base score, a first penalty coefficient corresponding to the first monitoring data, and a second penalty coefficient corresponding to the second monitoring data; Calculate the third product of the first monitoring data and the first penalty coefficient, and calculate the fourth product of the second monitoring data and the second penalty coefficient; Subtract the third and fourth products from the base value to obtain the intermediate value; The intermediate value is compared with zero, and the larger one is taken as the stability score.

7. The method according to claim 1, characterized in that, The calculation of the cost score based on the static asset cost information includes: The total number of output tokens is read from the inference performance index data, the hardware configuration unit price and the electricity unit price are read from the static asset cost information, and the hardware amortization cost and electricity cost are calculated based on the duration of the stress test task. The ratio of the total number of output tokens to the sum of the hardware amortization cost and the power cost is used to obtain the cost-effectiveness index. Obtain a sample set of historical load testing task cost-effectiveness index samples; The cost score is obtained by counting the number of samples in the sample set whose values ​​are less than or equal to the cost-effectiveness index, dividing the number of samples by the total number of samples, and then multiplying the result by a preset maximum score.

8. The method according to claim 1, characterized in that, The method further includes: After calculating the performance score, stability score, cost score, and comprehensive performance score, the inference performance index data, the time-series monitoring data, the static asset cost information, and each score result are assembled into a structured data object. The structured data object is serialized into text in a lightweight data interchange format; Obtain the task identifier of the current load testing task, associate the text with the task identifier, and write it to the snapshot storage table of the database.

9. The method according to claim 1, characterized in that, Before calculating the comprehensive performance score of the large model inference by fusing the performance score, the stability score, and the cost score, the method further includes: Receive user-inputted evaluation preference parameters; Based on the evaluation preference parameters, a corresponding target weight combination is matched from a preset weight mapping table. The target weight combination includes performance weight, stability weight, and cost weight. During the fusion calculation, the performance score, the stability score, and the cost score are weighted and summed according to the target weight combination to obtain the comprehensive performance score.

10. A device for evaluating the inference performance of large models, characterized in that, The device includes an acquisition module and a calculation module; The acquisition module is used to monitor the execution status of the large model load testing task, and acquire the inference performance index data of the load testing task, the time-series monitoring data of the large model deployment hardware device within the corresponding time period of the load testing task, and the static asset cost information of the large model deployment hardware device based on the execution status. The calculation module is used to calculate a performance score based on the inference performance index data; The calculation module is also used to calculate a stability score based on the time-series monitoring data; The calculation module is also used to calculate a cost score based on the static asset cost information; The calculation module is also used to calculate the comprehensive performance score of the large model inference by fusing the performance score, the stability score and the cost score; The performance score is calculated by weighting at least two performance indicators read from the inference performance index data; the stability score adopts a base score deduction system, which is obtained by deducting points after multiplying the monitoring indicators in the time-series monitoring data with the corresponding penalty coefficients; the cost score is obtained by mapping the cost-effectiveness ranking corresponding to the ratio of the total number of output tokens to the cost item.