Integrated benchmark method of large language model
The integrated benchmark method addresses the gap in evaluating LLMs by aligning hardware and model performance data, providing a comprehensive assessment of LLMs on diverse hardware devices.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- KOREA ELECTRONICS TECH INST
- Filing Date
- 2024-12-13
- Publication Date
- 2026-06-18
AI Technical Summary
Current benchmarking methods for large language models (LLMs) primarily focus on evaluating AI model performance without considering hardware performance, leading to a lack of comprehensive evaluation in real-world scenarios.
An integrated benchmark method that aligns and samples hardware and model performance data using techniques like Nearest-Neighbor algorithms, time windowing, and regression analysis to evaluate LLMs on various hardware devices.
Enables comprehensive performance evaluation of LLMs by correlating hardware and model performance, optimizing device settings for execution time, power usage, temperature, and cost.
Smart Images

Figure KR2024096995_18062026_PF_FP_ABST
Abstract
Description
Integrated benchmarking method for large language models
[0001] The present invention relates to an integrated benchmark method for large language models, and specifically, to an integrated benchmark method that enables comprehensive performance evaluation by monitoring and sampling the performance of hardware processing large language models along with a benchmark for measuring the performance of large language models.
[0002] This invention is a study conducted by the Korea Electronics Technology Institute under the research project title "Development of SW Technology for Large Artificial Neural Network Processing PIM-NPU Support System" for the research project "Development of SW Technology for Large Artificial Neural Network Artificial Intelligence Semiconductor" with support from the Ministry of Science and ICT (Project No. 00228255, Project Unique No. 1711195774).
[0003] With the rapid development of Large Language Models (LLMs), various benchmarks are being developed to evaluate the language generation capabilities and accuracy of these models, and through these, the models' accuracy and perplexity are being evaluated. Table 1 below lists the metrics used to evaluate the performance of various language models (LLaMA) by various benchmarks.
[0004] CategoryBenchmark# ShotsMetricLlama 3 8BLlama 3.1 8BLlama 3 70BLlama 3.1 70BLlama 3.1 405BGeneralMMLU5macro_avg / acc_char66.766.779.579.385.2MMLU-Pro (CoT)5macro_avg / acc_char36.237.155.053.861.6AGIEval English3-5average / acc_char47.147.863.064.671.6CommonSenseQA7acc_char72.675.083.884.185.8Winogrande5acc_char-60.5-83.386.7BIG-Bench Hard (CoT)3average / em61.164.281.381.685.9ARC-Challenge25acc_char79.479.793.192.996.1Knowledge reasoningTriviaQA-Wiki5em78.577.689.789.891.8Reading comprehensionSQuAD1em76.477.085.681.889.3QuAC (F1)1f144.444.951.151.153.6BoolQ0acc_char75.775.079.079.480.0DROP (F1)3f158.459.579.779.684.8
[0005] Acc (Accuracy), indicated in the evaluation criteria (metric), represents the percentage of correct predictions, and f1 (F1 score) represents an indicator calculated as the harmonic mean of precision and recall to overcome statistical illusions caused by imbalanced data. However, these benchmark performance metrics only consider the AI model itself and do not take into account the hardware performance of the test bed used to measure the model's performance.
[0006] Various high-efficiency and low-power inference devices are emerging, including not only NVIDIA GPUs but also AMD’s MI-200 and MI-300, Intel’s Gaudi-3, and startups such as Rebellion, HyperX, and DeepX. Consequently, methods for evaluating the performance of devices running LLM are becoming more diverse; however, evaluations are currently based on the specifications of device manufacturers or vendors, and there is a lack of standardized evaluation methods to comprehensively analyze the performance of AI models and devices.
[0007] Since LLM performance evaluation and hardware performance measurement are conducted separately, a comprehensive benchmarking method is needed that can verify LLM execution performance in real-world scenarios while simultaneously evaluating device efficiency.
[0008] The present invention was conceived in response to the aforementioned lack and need, and the objective of the present invention is to provide an integrated benchmark method that enables comprehensive performance evaluation by monitoring and sampling the performance of hardware processing a large language model, along with a benchmark for measuring the performance of the large language model.
[0009] An integrated benchmark method for a large language model according to an embodiment of the present invention for achieving the above objectives comprises the steps of: preparing an LLM (large language model); preparing an input data set for benchmarking the LLM; generating individual input prompts for the input data set; obtaining hardware performance data by sampling hardware performance information of a test bed that executes the benchmark; generating an answer by inputting the generated input prompts into the LLM; obtaining model performance data including the generated answer and the correct answer of the input data set; and sorting the model performance data and the hardware performance data according to time and outputting result data.
[0010] In this case, the model performance data includes a processing start time (time_stamp) and a processing elapse time (elapse_time) for individual input prompts, and the hardware performance data includes a periodic sampling time for sampling the hardware performance information, and the outputting step may include aligning the model performance data and the hardware performance data based on the processing start time and the sampling time.
[0011] Here, the outputting step can be sorted using a Nearest-Neighbor (NN) algorithm for the processing start time and the sampling time.
[0012] Additionally, the outputting step may include, when the hardware performance information is sampled multiple times during the processing time of a single input prompt, using a time windowing to condense and align multiple hardware sampling data for the model performance data for the single input prompt.
[0013] In this case, the outputting step may output result data in which the average of the values of the plurality of hardware sampling data that are shortened and aligned is sorted to the one model performance data.
[0014] Additionally, the outputting step may include, when the hardware performance information is sampled once during the processing time of a plurality of input prompts, extending and aligning the hardware performance data sampled once with respect to the model performance data for the plurality of input prompts.
[0015] In this case, the outputting step may include obtaining a plurality of hardware performance data corresponding to the plurality of model performance data using regression analysis.
[0016] Meanwhile, a computer-readable recording medium having a program according to one embodiment of the present invention is intended to execute the aforementioned integrated benchmark method on a computer.
[0017] Furthermore, a computing system for an integrated benchmark of a large language model according to one embodiment of the present invention includes a storage unit that stores an LLM (large language model) and an input data set for benchmarking the LLM, and a processor that generates individual input prompts of the input data set, obtains hardware performance data by sampling hardware performance information of a test bed that executes the benchmark, generates an answer by inputting the generated input prompts into the LLM, obtains model performance data including the generated answer and the correct answer of the input data set, sorts the model performance data and the hardware performance data according to time, and outputs result data.
[0018] Here, the model performance data includes a processing start time (time_stamp) and a processing elapse time (elapse_time) for individual input prompts, and the hardware performance data includes a periodic sampling time for sampling the hardware performance information, and the processor can align the model performance data and the hardware performance data using a Nearest-Neighbor (NN) algorithm for the processing start time and the sampling time.
[0019] In this case, if the processor samples the hardware performance information multiple times during the processing time of a single input prompt, it can use a time windowing to condense and align multiple hardware sampling data for the model performance data for the single input prompt.
[0020] In addition, when the processor samples the hardware performance information once during the processing time of a plurality of input prompts, it can extend and align the hardware performance data sampled once with respect to the model performance data for the plurality of input prompts.
[0021] According to the embodiments of the present invention described above, the present invention can sample and measure hardware information for executing LLM in a test bed and simultaneously measure performance results for individual input samples included in the problem of a language model benchmark. Hardware performance information and model performance information are aligned together, making it possible to cross-analyze LLM performance on the execution device and hardware information such as the device's power, execution time, and memory usage. Based on this, the settings of the device executing LLM can be optimized in various aspects such as execution time, power usage, temperature, and cost.
[0022] FIG. 1 is a block diagram showing the configuration of a computing system for an integrated benchmark of a large language model according to an embodiment of the present invention;
[0023] FIG. 2 is a block diagram showing the specific configuration of the computing system of FIG. 1;
[0024] FIG. 3 is a block diagram schematically showing a software module stored in the storage unit of the computing system of FIG. 2;
[0025] FIG. 4 is a block diagram sequentially listing stepwise components for performing an integrated benchmark of a large language model according to an embodiment of the present invention;
[0026] FIG. 5 is a diagram schematically illustrating the time relationship of operations processing an integrated benchmark of a large language model according to an embodiment of the present invention;
[0027] FIG. 6 is a diagram illustrating an example of extended alignment of model performance data and hardware performance;
[0028] FIG. 7 is a diagram illustrating an example of model performance data and hardware performance arranged in a condensed manner; and,
[0029] FIG. 8 is a flowchart illustrating an integrated benchmark method for a large language model according to one embodiment of the present invention.
[0030] The present invention will be described in more detail below with reference to the drawings. Furthermore, in describing the present invention, detailed descriptions of related known functions or configurations are omitted if it is determined that such detailed descriptions would unnecessarily obscure the essence of the invention. Additionally, the terms described below are defined considering their functions in the present invention, and these may vary depending on the intentions or relationships of the user or operator. Therefore, their definitions should be based on the content throughout this specification.
[0031]
[0032] FIG. 1 is a block diagram showing the configuration of a computing system for an integrated benchmark of a large language model according to one embodiment of the present invention.
[0033] Referring to FIG. 1, the computing system (100) includes a storage unit (110) and a processor (120).
[0034] The computing system (100) runs benchmarks for AI models, particularly large language models (LLMs). Thus, the computing system (100) provides a test bed for benchmarking LLMs. The computing system (100) may be a single computer in a local area, or it may be two or more computers connected by network communication and performing distributed tasks.
[0035] The storage unit (110) stores an LLM (Large Language Model) to be benchmarked and an input data set for benchmarking the LLM. The input data set may be provided by a benchmark tool. The benchmark tool may be, for example, MMLU (Massive Multitask Language Understanding).
[0036] MMLU is a multiple-choice test consisting of questions with four options (A, B, C, D) covering 57 general knowledge domains, with only one correct answer, and is grouped into categories such as "Humanities," "Social Sciences," and "STEM."
[0037] The storage unit (110) can be implemented as a locally connected storage drive, but can also be implemented as a storage such as a cloud connected to a communication network.
[0038] The processor (120) performs operations to control the operation of the computing system (100) and each component.
[0039] The processor (120) generates individual input prompts for the input data set of the storage unit (110). The input prompts are input values for asking questions to the LLM and obtaining estimated answers. For example, one input prompt of the MMLU is as follows.
[0040] Question: What is the embryological origin of the hyoid bone?Choies:- A. The first pharyngeal arch- B. The first and second pharyngeal arches- C. The second pharyngeal arch- D. The second and third pharyngeal archesCorrect answer:
[0041] The processor (120) obtains hardware performance data by sampling hardware performance information of the test bed running the benchmark. To do this, the processor (120) starts hardware monitoring to check the hardware performance information of the test bed before LLM input. Hardware monitoring can be performed by utilizing a hardware profiling API (e.g., a hardware performance counter). The processor (120) can sample hardware performance information at a preset interval. An example of a log sampling the hardware performance information of the test bed every 0.5 seconds while performing an LLM benchmark may be as follows.
[0042] Time (s),GPU Utilization (%),Memory Usage (MB),Power Draw (W),Temperature (C),System Power (W)1730444735.3667684,0,26920.5625,74.669,47,4561730444735.9427137,94 ,26920.5625,282.37,56,5441730444736.514267,95,26920.5625,243.7,58,544 1730444737.0783165,95,26920.5625,288.54,58,6561730444737.6464677,94,26920.5625,277.255,58,6561730444738.2424746,93,26920.5625,242.645,58,65 61730444738.8103955,94,26920.5625,267.834,58,6561730444739.38651,95,26920.5625,289.65,58,6561730444739.9505324,95,26920.5625,287.143,59,656 1730444740.526566,94,26920.5625,289.068,59,6561730444741.0905929,93,26920.5625,265.202,59,6561730444741.6587694,94,26920.5625,258.73,59,656
[0043] The hardware performance information sampled in the example includes sampling time based on system time, GPU utilization, memory usage, graphics card (VGA) power draw, temperature, and total system power. Then, the processor (120) inputs an input prompt into the LLM to generate an answer. Then, the processor (120) generates model performance data including the generated answer and the correct answer of the input data set.
[0044] The following is an example of the logs accumulated and generated for each input prompt by the MMLU benchmark tool.
[0045] time_stamp,subject,iseq_len,kv_shape,elapsed_time,answer,predict1730444735.2959201,high_school_government_and_politics,586,,0.08112883567810059,D,D1730444735.3810372,high_school_government_and_politics,579,,0.08009529113769531,B,A1730444735.464026,high_school_government_and_politics,508,,0.05876493453979492,B,B1730444735.526574,high_school_government_and_politics,506,,0.04993176460266113,A,A1730444735.5792358,high_school_government_and_politics,522,,0.053229570388793945,A,A1730444735.63502,high_school_government_and_politics,525,,0.05431985855102539,A,A1730444735.6918955,high_school_government_and_politics,498,,0.05124211311340332,B,B1730444735.7456422,high_school_government_and_politics,521,,0.05376720428466797,D,D1730444735.8020325,high_school_government_and_politics,522,,0.054204702377319336,B,D1730444735.8600094,high_school_government_and_politics,498,,0.05009770393371582,C,C1730444735.9131217,high_school_government_and_politics,547,,0.0543670654296875,D,D
[0046] In this example, the model performance information includes the LLM's processing start time (time_stamp), MMLU's question subject, input prompt length (iseq_len), processing elapsed time (elapsed_time), correct answer (answer), and the model's estimated answer (predict) for individual input prompts. The model performance information includes items of size (kv_shape) of Key-Value pairs, but these are not recorded in this example. Then, the processor (120) sorts the model performance data and hardware performance data by time and outputs the result data.
[0047] As in the example above, model performance data includes the processing start time (time_stamp) and processing elapse time (elapse_time) for individual input prompts, and hardware performance data includes a periodic sampling time (Time) for sampling hardware performance information.
[0048] Therefore, model performance data and hardware performance data obtained by LLM benchmarking and hardware sampling performed at the same time are correlated with each other by the time information contained therein.
[0049] The processor (120) aligns the model performance data and the hardware performance data using the nearest-neighbor (NN) algorithm of the processing start time and the sampling time.
[0050] In this regard, referring to FIG. 5, FIG. 5 schematically illustrates the time relationship of operations processing an integrated benchmark of a large language model according to one embodiment of the present invention.
[0051] Referring first to FIG. 5a, benchmarking (510) and hardware monitoring (520) for the LLM are processed in parallel over time (t). Benchmarking (510) for the LLM inputs questions (531, 532, 533) in sequence and takes a considerable amount of time to obtain an estimated answer for each question. While the LLM answers the questions, the hardware monitor (520) samples hardware performance information several times at a sampling period (T).
[0052] Referring to FIG. 5b, it is similar to the case of FIG. 5a, but the sampling period of the hardware monitor (520') has been extended. The extended sampling period (T') reduces any potential adverse effects that may occur when the test bed performs calculations for LLM benchmarking due to frequent sampling.
[0053] Accordingly, FIG. 5a is a case where hardware performance information is sampled multiple times during the processing time of a single input prompt for one question of the benchmark. FIG. 5b is a case where hardware performance information is sampled once (541') during the processing time of multiple input prompts (question 2, question 3) for multiple questions of the benchmark.
[0054] The Nearest-Neighbor (NN) algorithm is a method that relates data points closest in time to each other. Depending on the number of variables, the nearest-neighbor technique can be generalized into a mathematical formula that calculates Euclidean distance as follows.
[0055]
[0056] In this embodiment, nearest neighbor data is determined by considering only time (t).
[0057] In the embodiment of FIG. 5a, the nearest benchmark data to be aligned with the sampling data (540) that does not overlap in time zones is the benchmark data for Question 3 having the processing start time closest to the sampling time of the sampling data (540). In the embodiment of FIG. 5b, the nearest sampling data to be aligned with the benchmark data (533) of Question 3 that does not have simultaneous sampling is the sample (541').
[0058]
[0059] Referring again to FIG. 1, when the processor (120) samples the hardware performance information multiple times during the processing time of one input prompt, it uses a time windowing to condense and align multiple hardware sampling data for model performance data for one input prompt.
[0060] Here, time windowing refers to a method of setting the time (elapsed_time) (or period) during which the LLM processes a single input prompt as the time window, and mapping multiple sampling data received within this period.
[0061] To map the performance metrics of a model processing a specific question to a single hardware performance metric, multiple hardware performance data are compressed and aligned with a single model performance data.
[0062] Here, the processor can align the average of the values of multiple hardware sampling data that are condensedly aligned with a single model performance data. Specifically, condensed alignment may involve calculating the average of the values of multiple hardware performance data and aligning the hardware performance data composed of these average values with the single model performance data. For example, if the GPU utilization (%) obtained from three hardware samplings from the time a question's input prompt is entered until the LLM answers is 95, 94, and 93, the GPU utilization spent on processing that question is determined to be the average value of 94% and aligned with the model performance data for that single input prompt. In addition to the average, cumulative sums may be calculated. For example, the total cumulative time up to three hardware samplings may be calculated.
[0063] Additionally, when the processor (120) samples the hardware performance information once during the processing time of a plurality of input prompts, it extends and aligns the hardware performance data sampled once for the model performance data for the plurality of input prompts. That is, even if the time zones are not the same, the data corresponding to the plurality of model performance data are aligned based on one hardware performance data.
[0064] Extended sort is a method for filling in missing data when two time series data do not correspond to each other and one side contains missing values. One type of extended sort simply fills the missing values with the nearest valid nearby value (previous or subsequent) in chronological order.
[0065] As another method, regression analysis can be used for extended alignment of hardware performance data. Specifically, the processor (120) can obtain multiple hardware performance data corresponding to the multiple model performance data using regression analysis.
[0066] Regression analysis is a statistical analysis method that calculates a model between two variables for observed continuous variables and measures the goodness of fit. Briefly, even if hardware performance data corresponding to a specific model performance statistic is empty at the same time, if it is analyzed that the load on hardware resources is roughly proportional to the value of the model performance statistic (e.g., input prompt length), regression analysis concludes that hardware performance data corresponding to the model performance statistic with such a trend or mean must have existed. As seen in the example above, since model performance information includes multiple variables (time_stamp, subject, iseq_len, elapsed_time, answer, predict) and hardware performance information includes multiple variables (Time, GPU Utilization, Memory Usage, Power Draw, Temperature, System Power), multiple regression analysis can be performed during extended alignment.
[0067]
[0068] Referring to FIG. 6, FIG. 6 illustrates an example of extended alignment of model performance data and hardware performance.
[0069] The alignment relationship between the two sets of data is determined by the difference between the time (TIME) values of the hardware performance data and the time (time_stamp) values of the model performance data. Specifically, the first four columns of Table 2 list the values obtained by subtracting four (yellow, green, purple, orange) hardware performance sampling times (TIME) from the benchmark time (time_stamp) of each model performance data (e.g., 1730444735.2959201 - 1730444735.3667684 = -0.0708483).
[0070] From the start of the first LLM benchmark until the acquisition of model performance data for the 6th question, the hardware performance data sampled from the first hardware performance information is extended and aligned (yellow).
[0071] From the 7th model performance data, the difference with the second hardware performance sampling time (-0.25) is closer than the difference with the first hardware performance sampling time (0.33). Therefore, the 7th model performance data is aligned with the second hardware performance data (green) (Nearest Neighbor Algorithm).
[0072] Regression is used to extend and align the first sampled hardware performance data from Hardware Monitoring Table 1 with the six model performance data from Benchmark Results Table 2 (Type #1).
[0073] In this embodiment, since only the first initial value is considered, the values {u1, u2, ..., u6} of each item (GPU Utilization, Memory Usage, Power Draw, Temperature, System Power) of all hardware performance data that are extended and aligned to the six model performance data of the first set are all the same by a constant regression analysis without slope. That is, missing values are all filled with nearby values.
[0074] Next, the second hardware performance data (green in Table 1) is regressed to have a linear or non-linear relationship with the values of the first hardware performance data over time, and is extended and aligned with the second set of 10 model performance data (green in Table 2) (Type #2).
[0075]
[0076] Next, referring to FIG. 7, FIG. 7 illustrates an example of model performance data and hardware performance arranged in a condensed manner.
[0077] Data Tables 1 and 2, identical to the embodiment of Fig. 6, were obtained, but in the example of Fig. 7, the six model performance data (yellow) of Table 2 are aligned in a shortened manner with one hardware performance data (yellow) of Table 1.
[0078] The first set of hardware performance data and six series of model performance data, which are temporal nearest neighbor data, are condensed into a single integrated model performance data set consisting of Total seq len, Average seq len, Total elapsed time, Average elapsed time, Accuracy (Acc), and the number of correct answers out of total questions (Acc cnt). This allows the performance information of the LLM processing the six question prompts provided by the MMLU benchmark, along with the hardware performance information for the same time period, to be provided together. The multiple model performance data sets in the next nearest neighbor relationship can then be condensed and aligned with each individual hardware performance data set.
[0079]
[0080] Figure 2 is a block diagram showing the specific configuration of the computing system of Figure 1.
[0081] Referring to FIG. 2, the computing system (100) includes a storage unit (110) and a processor (120). Additionally, the computing system (100) further includes an input unit (130), a display unit (140), and a communication unit (150).
[0082] Here, the description of the input unit (110) and the processor (120) is based on the description of the same components previously referred to in FIG. 1, and redundant descriptions are omitted.
[0083] The input unit (130) provides means for user operation and commands of the computing system (100). The input unit (130) can be implemented as a peripheral device such as a keyboard, mouse, or touchpad.
[0084] The display unit (140) outputs a screen. The display unit (140) may include a display panel such as an LCD, OLED, TFT, or IPS. The display unit (140) can output a predetermined pixel value at a designated location from image data received by a driving driver. The display unit (140) may be implemented as an integrated touch display together with the input unit (130).
[0085] The communication unit (150) establishes a communication connection with another device. The communication unit (150) can be connected to a server via a broadband network to receive a benchmark tool or another language model (LLM) that is the target of the benchmark. Additionally, the communication unit (150) can transmit result data to another device or user terminal.
[0086] The processor (120) controls the components of the computing system (100) overall.
[0087] The processor (120) includes RAM (121), ROM (122), main CPU (123), graphics processing unit (124), first to n interfaces (125-1 to 125-n), and a bus (126).
[0088] RAM (121), ROM (122), main CPU (123), graphics processing unit (124), first to n interfaces (125-1 to 125-n), etc. can be connected to each other via a bus (126).
[0089] The first to n interfaces (125-1 to 125-n) are connected to various components (110, 130, 140, 150). One of the interfaces may be a network interface connected to an external device via a network. For example, the input unit (130) may be connected via short-range wireless transmission communication or a Bluetooth link, and the storage unit (110) may be provided to a cloud server via broadband communication.
[0090] The main CPU (123) accesses the storage unit (110) and performs booting using the O / S stored in the storage unit (110). Then, various calculations for the integrated benchmark of the computing system (100) can be performed using various programs, content, data, etc. stored in the storage unit (110).
[0091] A set of instructions for booting the system is stored in the ROM (122). When a turn-on command is input and power is supplied, the main CPU (123) copies the O / S stored in the storage unit (350) to the RAM (121) according to the instructions stored in the ROM (122), and executes the O / S to boot the system. When booting is complete, the main CPU (123) copies various application programs stored in the storage unit (120) to the RAM (121), and executes the application programs copied to the RAM (121) to perform various operations.
[0092] The graphics processing unit (124) generates a screen containing various objects such as icons, images, and text using a calculation unit (not shown) and a rendering unit (not shown). The calculation unit (not shown) calculates attribute values such as coordinate values, shape, size, and color for each object to be displayed according to the layout of the screen based on a received control command. The rendering unit (not shown) generates a screen of various layouts containing objects based on the attribute values calculated by the calculation unit (not shown).
[0093] In particular, the graphics processing unit (124) can implement objects generated by the main CPU (123) into a GUI (Graphic User Interface), icons, user interface screens, etc. The implemented screens can be displayed on the display unit (140).
[0094] Meanwhile, the storage unit (110) stores at least one software module for controlling the computing system (100).
[0095]
[0096] FIG. 3 is a diagram showing the configuration of a storage unit (110) in which a software module for realizing the function of the computing system (100) of FIG. 2 is stored.
[0097] Referring to FIG. 3, the storage unit (110) includes an LLM benchmark module (111), a HW sampling module (112), a data alignment module (113), and a result data output module (114).
[0098] The LLM benchmark module (111) provides a tool for benchmarking large language models. The benchmark tool may be MMLU, BIG-Bench, GLUE, etc. The LLM benchmark module (111) may include problems and answers for evaluating performance.
[0099] The HW sampling module (112) samples hardware performance information of the test bed running the benchmark. The HW sampling module (112) may include a hardware profiling API (e.g., a hardware performance counter).
[0100] The data alignment module (113) aligns model performance data and hardware performance data according to time. The data alignment module (113) enables the rectification of one-to-many correspondences or mismatch relationships caused by the time difference between the model's benchmark time and the hardware sampling time. The data alignment module (113) can provide a nearest neighbor algorithm to establish time-based relationships of the data, regression analysis for extended alignment and a time window for reduced alignment, and statistical operations such as average and cumulative.
[0101] The result data output module (114) enables the output of result data including aligned model performance data and hardware performance data. Additionally, the result data output module (114) can output normalized data of model performance according to hardware specifications in a report of a certain format based on the model performance data and hardware performance data.
[0102]
[0103] FIG. 4 is a block diagram sequentially listing stepwise components for performing an integrated benchmark of a large language model according to one embodiment of the present invention.
[0104] Each block of FIG. 4 can be executed by a processor (120) of a computing system (100). First, input data (410) is provided by a benchmark tool, e.g., MMLU. The input data (410) is stored in a storage unit (110). Next, for each question of the input data (410), an input prompt is generated by a prompt generation unit (420). Next, the operation of estimating the answer to the input prompt by the model execution unit (430) and the hardware performance monitoring and sampling of the test bed by the hardware monitor (470) are performed in parallel. Next, the prediction result (estimated answer) output from the model execution unit (430) in block (440) is evaluated by comparison (e.g., accuracy (acc)) with the correct answer of the input data (410). Then, the model performance data (benchmark information) obtained by benchmarking the target LLM and the data sampled from the hardware performance information (profiling information) are sorted by time (450). As previously explained, sorting may include extended sorting or reduced sorting if a one-to-many relationship is established by the nearest neighbor algorithm. Next, integrated analysis information is generated based on the sorted benchmark and profiling information (extended / reduced sorted model performance data and hardware performance data) (460).
[0105]
[0106] FIG. 8 is a flowchart illustrating an integrated benchmark method for a large language model according to one embodiment of the present invention.
[0107] Referring to FIG. 8, the integrated benchmark method for a large language model includes the step (S810) of preparing an LLM (large language model). The LLM may be an existing pre-trained language model or a language model additionally trained on a new dataset.
[0108] Next, the integrated benchmark method includes the step (S820) of preparing an input data set for benchmarking the LLM. The input data set may be provided by a benchmark tool such as MMLU.
[0109] Next, the integrated benchmark method includes the step (S830) of generating individual input prompts for the input data set. That is, it generates input data in a form for inputting the problem into the LLM.
[0110] Next, the integrated benchmark method includes a step (S840) of obtaining hardware performance data by sampling hardware performance information of a test bed running the benchmark. Sampling may be performed at periodic sampling times, and the hardware performance data may include the sampling time at which the hardware performance information is sampled.
[0111] Next, the integrated benchmark method includes the step (S850) of generating an answer by inputting the generated input prompt into the LLM. The time required for the LLM to generate an answer may vary depending on the performance of the LLM and the specifications of the hardware.
[0112] Next, the integrated benchmark method includes the step (S860) of generating model performance data including the generated answer and the correct answer of the input data set. The model performance data may include a comparison to determine the correct answer rate or accuracy. Additionally, the model performance data may include a processing start time (time_stamp) and a processing elapsed time (elapse_time) for individual input prompts.
[0113] Next, the integrated benchmark method includes a step (S870) of aligning model performance data and hardware performance data over time and outputting result data. For aligning the model performance data and hardware performance data, a Nearest-Neighbor (NN) algorithm at each processing start time and sampling time may be used.
[0114] When the hardware performance information is sampled multiple times during the processing time of a single input prompt, multiple hardware sampling data can be compressed and aligned with respect to the model performance data for the single input prompt using a time windowing. When compressed and aligned, the average of the values of the multiple hardware sampling data compressed and aligned with respect to the single model performance data can be aligned in the resulting data.
[0115] When the hardware performance information is sampled once during the processing time of multiple input prompts, the hardware performance data sampled once can be extended and aligned with respect to the model performance data for the multiple input prompts. In this case, multiple hardware performance data corresponding to the multiple model performance data can be obtained by using regression analysis for extended alignment.
[0116]
[0117] Meanwhile, a non-transitory computer-readable medium may be provided that stores a program for sequentially executing the integrated benchmark method of a large language model according to the present invention.
[0118] A non-transient readable medium refers to a medium that stores data semi-permanently and can be read by a device, rather than a medium that stores data for a short moment, such as a register, cache, or memory. Specifically, the various applications or programs described above may be stored and provided on non-transient readable media such as CDs, DVDs, hard disks, Blu-ray discs, USBs, memory cards, and ROMs.
[0119] Furthermore, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above. It is understood that various modifications can be made by those skilled in the art without departing from the essence of the invention as claimed in the claims, and such modifications should not be understood individually from the technical spirit or perspective of the present invention.
Claims
1. In the integrated benchmarking method for large language models, Phase of preparing LLM (Large Language Model); A step of preparing an input data set for benchmarking the above LLM; A step of generating individual input prompts for the above input data set; A step of obtaining hardware performance data by sampling hardware performance information of a test bed executing the above benchmark; A step of generating an answer by inputting the above-generated input prompt into the above-generated LLM; A step of obtaining model performance data including the generated answer and the correct answer of the input data set; and An integrated benchmark method comprising the step of arranging the above model performance data and the above hardware performance data according to time and outputting result data.
2. In Paragraph 1, The above model performance data includes the processing start time (time_stamp) and processing elapse time (elapse_time) for individual input prompts, and The above hardware performance data includes a periodic sampling time for sampling the above hardware performance information, and The above outputting step is, An integrated benchmark method comprising aligning the model performance data and the hardware performance data based on the processing start time and the sampling time.
3. In Paragraph 2, The above outputting step is, An integrated benchmark method that sorts using a Nearest-Neighbor (NN) algorithm at the processing start time and the sampling time.
4. In Paragraph 2, The above outputting step is, When the above hardware performance information is sampled multiple times during the processing time of a single input prompt, An integrated benchmark method comprising condensing and aligning multiple hardware sampling data for model performance data for a single input prompt using a time windowing.
5. In Paragraph 4, The above outputting step is, An integrated benchmark method that outputs result data in which the average of the values of the abbreviated and aligned plurality of hardware sampling data is sorted to the above-mentioned single model performance data.
6. In Paragraph 2, The above outputting step is, When the above hardware performance information is sampled once during the processing time of multiple input prompts, An integrated benchmark method comprising extending and aligning the once-sampled hardware performance data with respect to the model performance data for the plurality of input prompts.
7. In Paragraph 6, The above outputting step is, An integrated benchmark method comprising obtaining multiple hardware performance data corresponding to the multiple model performance data using regression analysis.
8. A computer-readable recording medium storing a program for executing the integrated benchmark method described in any one of paragraphs 1 through 7 on a computer.
9. In a computing system for integrated benchmarking of large language models, A storage unit for storing an LLM (Large Language Model) and an input data set for benchmarking the LLM; and Generate individual input prompts for the above input data set, and Hardware performance data is obtained by sampling hardware performance information of the test bed running the above benchmark, and Input the above-generated input prompt into the above-mentioned LLM to generate an answer, and Obtain model performance data including the generated answer and the correct answer of the input data set, and A computing system comprising a processor that sorts the above model performance data and the above hardware performance data over time and outputs result data.
10. In Paragraph 9, The above model performance data includes the processing start time (time_stamp) and processing elapse time (elapse_time) for individual input prompts, and The above hardware performance data includes a periodic sampling time for sampling the above hardware performance information, and The above processor is, A computing system that aligns the model performance data and the hardware performance data using a Nearest-Neighbor (NN) algorithm for the processing start time and the sampling time.
11. In Paragraph 10, The above processor is, When the above hardware performance information is sampled multiple times during the processing time of a single input prompt, A computing system that uses a time windowing to condense and align multiple hardware sampling data for model performance data for a single input prompt.
12. In Paragraph 10, The above processor is, When the above hardware performance information is sampled once during the processing time of multiple input prompts, A computing system that extends and aligns the once-sampled hardware performance data with respect to the model performance data for the plurality of input prompts.